### Creation of DataFrames in Pandas

#### 1. **Using a NumPy Array:**
   - Create a DataFrame from a 2D NumPy array with specified row and column indices.
     ```python
     df = pd.DataFrame(np.random.randn(2, 3), columns=["First", "Second", "Third"], index=["a", "b"])
     ```

   - Access row and column indices using the special Index object:
     ```python
     df.index  # Row indices
     df.columns  # Column indices
     ```

   - If indices are not provided, implicit integer indices are used:
     ```python
     df2 = pd.DataFrame(np.random.randn(2, 3), index=["a", "b"])
     ```

#### 2. **Using Columns:**
   - Create a DataFrame using columns specified as lists, NumPy arrays, or Pandas Series.
     ```python
     s1 = pd.Series([1, 2, 3])
     s2 = pd.Series([4, 5, 6], name="b")
     ```

   - Create a DataFrame specifying column names explicitly:
     ```python
     pd.DataFrame(s1, columns=["a"])
     ```

   - Use the name attribute of Series as the column name:
     ```python
     pd.DataFrame(s2)
     ```

   - For multiple columns, use a dictionary where keys are column names, and values are column content:
     ```python
     pd.DataFrame({"a": s1, "b": s2})
     ```

#### 3. **Using Rows:**
   - Create a DataFrame from a list of rows using dictionaries, lists, Series, or NumPy arrays.
     ```python
     df_dict = pd.DataFrame([{"Wage": 1000, "Name": "Jack", "Age": 21}, {"Wage": 1500, "Name": "John", "Age": 29}])
     ```

   - Alternatively, provide column names explicitly:
     ```python
     df_list = pd.DataFrame([[1000, "Jack", 21], [1500, "John", 29]], columns=["Wage", "Name", "Age"])
     ```

   - Note: Column order matters when creating DataFrames from a dictionary of columns.

#### Additional Notes:
- In Pandas, maintaining column order can enhance readability and convey semantic meaning.
- Choosing appropriate methods for DataFrame creation depends on the available data and desired structure.

In [3]:
"""
Exercise 4.1 (cities)
Write function cities that returns the following DataFrame of top Finnish cities by population:

Population Total area
Helsinki         643272     715.48
Espoo            279044     528.03
Tampere          231853     689.59
Vantaa           223027     240.35
Oulu             201810     3817.52
"""
import pandas as pd

def cities():
    indicies = ["Helsinki", "Espoo", "Tampere", "Vantaa", "Oulu"]
    datas = [
        [643272, 715.48],
        [279044, 528.03],
        [231853, 689.59],
        [223027, 240.35],
        [201810, 3817.52]
        ]
    return pd.DataFrame(datas, columns=["Population", "Total area"],index=indicies)
   
    
def main():
    df = cities()
    cols=df.columns
    ind=df.index
    print(cols)
    print(ind)
    print(df)
    return
"""   df = cities()
    print(df.dtypes)
    print(df)
    """
main()

Index(['Population', 'Total area'], dtype='object')
Index(['Helsinki', 'Espoo', 'Tampere', 'Vantaa', 'Oulu'], dtype='object')
          Population  Total area
Helsinki      643272      715.48
Espoo         279044      528.03
Tampere       231853      689.59
Vantaa        223027      240.35
Oulu          201810     3817.52


In [33]:
"""
Exercise 4.2 (powers of series)
Make function powers_of_series that takes a Series and a positive integer k as parameters and returns a DataFrame.
The resulting DataFrame should have the same index as the input Series. 

The first column of the dataFrame should be the input Series, 
the second column should contain the Series raised to power of two. 
The third column should contain the Series raised to the power of three, 
and so on until (and including) power of k. 
The columns should have indices from 1 to k.

The values should be numbers, but the index can have any type. 
Test your function from the main function. Example of usage:

s = pd.Series([1,2,3,4], index=list("abcd"))
print(powers_of_series(s, 3))
Should print:

   1   2   3
a  1   1   1
b  2   4   8
c  3   9  27
d  4  16  64
"""

import pandas as pd
import numpy as np
def powers_of_series(s, k):

    start = 1
    # create pd dataframes here 
    data = pd.DataFrame(s, columns=[start])
    

    for number in range(start, k+1):
      
        data[number] = data[start] ** number

    return data

"""
def powers_of_series(s, k):
    c=[ s**i for i in range(1,k+1) ]
    df = pd.DataFrame(dict(zip(range(1,k+1), c)))
    return df
"""
def main():
    s = pd.Series([1,2,3,4], index=list("abcd"))
    k = 3
    df = powers_of_series(s, k)
    print(df)
    
main()

   1   2   3
a  1   1   1
b  2   4   8
c  3   9  27
d  4  16  64


In [44]:

"""
Exercise 4.3 (municipal information)
In the main function load a data set of municipal information from the src folder (originally from Statistics Finland).
Use the function pd.read_csv, and note that the separator is a tabulator.

Print the shape of the DataFrame (number of rows and columns) and the column names in the following format:

Shape: r,c
Columns:
col1 
col2
...
Note, sometimes file ending tsv (tab separated values) is used instead of csv if the separator is a tab.
"""
import pandas as pd

def main():

    # seperator is tabulator \t
    data_frame = pd.read_csv("part04-e03_municipal_information/src/municipal.tsv", sep='\t')
    #print(data_frame.head()) # print the first five rows
    r, c = data_frame.shape

    print(f"Shape: {r},{c}")
    print("Columns:")
    for col in data_frame.columns.tolist():
        print(col)
"""
def main():
    df = pd.read_csv("src/municipal.tsv", sep="\t")
    print("Shape: {}, {}".format(*df.shape))
    print("Columns:")
    for name in df.columns:
        print(name)
"""
main()

Shape:  (490, 6)
Columns: 
Index(['Region 2018\t"Population"\t"Population change from the previous year',
       ' %"\t"Share of Swedish-speakers of the population',
       ' %"\t"Share of foreign citizens of the population',
       ' %"\t"Proportion of the unemployed among the labour force',
       ' %"\t"Proportion of pensioners of the population', ' %"'],
      dtype='object')


- **Accessing Elements in a DataFrame:**
  - DataFrames are two-dimensional arrays with differences in accessing elements compared to NumPy arrays.
  - The bracket notation `[]` allows access to only one dimension at a time.

- **Accessing Columns:**
  - Using a single integer within the bracket specifies a column.
    - Example: `df["Wage"]`
      ```
      0    1000
      1    1500
      Name: Wage, dtype: int64
      ```

  - Fancy indexing can be used for multiple columns.
    - Example: `df[["Wage", "Name"]]`
      ```
      	Wage	Name
      0	1000	Jack
      1	1500	John
      ```

- **Accessing Rows:**
  - Using a slice or boolean mask refers to rows.
    - Example: 
      - Slice: `df[0:1]`
        ```
      	Wage	Name	Age
      0	1000	Jack	21
        ```
      - Boolean mask: `df[df.Wage > 1200]`
        ```
      	Wage	Name	Age
      1	1500	John	29
        ```

- **Chaining for Single Value:**
  - When a Series object is returned, chaining bracket calls can extract a single value.
    - Example: `df["Wage"][1]`
      ```
      1500
      ```

- **Note on Indexing with Integers:**
  - If an integer is used for indexing, it specifies a column only if it matches explicit column indices.
    - Example: 
      ```python
      try:
          df[0]
      except KeyError:
          import sys
          print("Key error", file=sys.stderr)
      ```
      Output:
      ```
      Key error
      ```

- **Better Approach for Single Value Retrieval:**
  - There's a more efficient way to retrieve a single value, which will be discussed in the next section.

In [76]:
"""
Exercise 4.4 (municipalities of finland)
Load again the municipal information DataFrame. 
The rows of the DataFrame correspond to various geographical areas of Finland. 
The first row is about Finland as a whole, then rows from Akaa to Äänekoski are 
municipalities of Finland in alphabetical order. 
After that some larger regions are listed.

Write function municipalities_of_finland that returns a DataFrame containing only rows about municipalities.

Give an appropriate argument for pd.read_csv so that it interprets the column about region name as the (row) index. 
This way you can index the DataFrame with the names of the regions.

Test your function from the main function.

"""
import pandas as pd

def municipalities_of_finland():
    data_frame = pd.read_csv("part04-e04_municipalities_of_finland/src/municipal.tsv", sep='\t', index_col='Region 2018')
    
    #return data_frame.loc['Akaa':'Äänekoski']
    return data_frame['Akaa':'Äänekoski']
    
    
def main():
    result = municipalities_of_finland()
    print(result)
main()

             Population  Population change from the previous year, %  \
Region 2018                                                            
Akaa              16769                                         -0.9   
Alajärvi           9831                                         -0.7   
Alavieska          2610                                         -1.1   
Alavus            11713                                         -1.6   
Asikkala           8248                                         -0.9   
...                 ...                                          ...   
Ylivieska         15251                                          0.3   
Ylöjärvi          32878                                          0.2   
Ypäjä              2372                                         -0.4   
Ähtäri             5906                                         -1.3   
Äänekoski         19144                                         -1.2   

             Share of Swedish-speakers of the population, %  \


In [107]:
"""
Exercise 4.5 (swedish and foreigners)
Write function swedish_and_foreigners that

Reads the municipalities data set
Takes the subset about municipalities (like in previous exercise)

Further take a subset of rows that have proportion of Swedish speaking people and 
proportion of foreigners both above 5 % level

From this data set take only columns about population, 
the proportions of Swedish speaking people and foreigners, that is three columns.
The function should return this final DataFrame.

Do you see some kind of correlation between the columns about Swedish speaking and foreign people? 
Do you see correlation between the columns about the population and 
the proportion of Swedish speaking people in this subset?
"""

import pandas as pd
def municipalities_of_finland(path):
    data_frames = pd.read_csv(path, sep='\t', index_col='Region 2018')
    # extract manuciplities of findland 'Akaa':'Äänekoski'
    findland_manucipal = data_frames['Akaa':'Äänekoski']
    return findland_manucipal

def swedish_and_foreigners():
    path = "part04-e05_swedish_and_foreigners/src/municipal.tsv"
    df = municipalities_of_finland(path)

    # Swedish speaking people and 
    # proportion of foreigners both above 5 % level
    swedish_speaking_percent = "Share of Swedish-speakers of the population, %"
    foreign_citizens_percent = "Share of foreign citizens of the population, %"
    percent = 5
    condition = (df[swedish_speaking_percent] > percent) & (df[foreign_citizens_percent] > percent)
    five_percent = df[condition]

    return (five_percent[["Population",swedish_speaking_percent,foreign_citizens_percent]])


def main():
    result = swedish_and_foreigners()
    print(result)
main()

               Population  Share of Swedish-speakers of the population, %  \
Region 2018                                                                 
Brändö                452                                            72.6   
Eckerö                948                                            89.7   
Espoo              279044                                             7.2   
Finström             2580                                            89.8   
Föglö                 532                                            84.2   
Geta                  495                                            86.9   
Hammarland           1547                                            89.7   
Helsinki           643272                                             5.7   
Jomala               4859                                            89.1   
Kaskinen             1274                                            29.9   
Kirkkonummi         39170                                            16.6   