### Creation of DataFrames in Pandas

#### 1. **Using a NumPy Array:**
   - Create a DataFrame from a 2D NumPy array with specified row and column indices.
     ```python
     df = pd.DataFrame(np.random.randn(2, 3), columns=["First", "Second", "Third"], index=["a", "b"])
     ```

   - Access row and column indices using the special Index object:
     ```python
     df.index  # Row indices
     df.columns  # Column indices
     ```

   - If indices are not provided, implicit integer indices are used:
     ```python
     df2 = pd.DataFrame(np.random.randn(2, 3), index=["a", "b"])
     ```

#### 2. **Using Columns:**
   - Create a DataFrame using columns specified as lists, NumPy arrays, or Pandas Series.
     ```python
     s1 = pd.Series([1, 2, 3])
     s2 = pd.Series([4, 5, 6], name="b")
     ```

   - Create a DataFrame specifying column names explicitly:
     ```python
     pd.DataFrame(s1, columns=["a"])
     ```

   - Use the name attribute of Series as the column name:
     ```python
     pd.DataFrame(s2)
     ```

   - For multiple columns, use a dictionary where keys are column names, and values are column content:
     ```python
     pd.DataFrame({"a": s1, "b": s2})
     ```

#### 3. **Using Rows:**
   - Create a DataFrame from a list of rows using dictionaries, lists, Series, or NumPy arrays.
     ```python
     df_dict = pd.DataFrame([{"Wage": 1000, "Name": "Jack", "Age": 21}, {"Wage": 1500, "Name": "John", "Age": 29}])
     ```

   - Alternatively, provide column names explicitly:
     ```python
     df_list = pd.DataFrame([[1000, "Jack", 21], [1500, "John", 29]], columns=["Wage", "Name", "Age"])
     ```

   - Note: Column order matters when creating DataFrames from a dictionary of columns.

#### Additional Notes:
- In Pandas, maintaining column order can enhance readability and convey semantic meaning.
- Choosing appropriate methods for DataFrame creation depends on the available data and desired structure.

In [3]:
"""
Exercise 4.1 (cities)
Write function cities that returns the following DataFrame of top Finnish cities by population:

Population Total area
Helsinki         643272     715.48
Espoo            279044     528.03
Tampere          231853     689.59
Vantaa           223027     240.35
Oulu             201810     3817.52
"""
import pandas as pd

def cities():
    indicies = ["Helsinki", "Espoo", "Tampere", "Vantaa", "Oulu"]
    datas = [
        [643272, 715.48],
        [279044, 528.03],
        [231853, 689.59],
        [223027, 240.35],
        [201810, 3817.52]
        ]
    return pd.DataFrame(datas, columns=["Population", "Total area"],index=indicies)
   
    
def main():
    df = cities()
    cols=df.columns
    ind=df.index
    print(cols)
    print(ind)
    print(df)
    return
"""   df = cities()
    print(df.dtypes)
    print(df)
    """
main()

Index(['Population', 'Total area'], dtype='object')
Index(['Helsinki', 'Espoo', 'Tampere', 'Vantaa', 'Oulu'], dtype='object')
          Population  Total area
Helsinki      643272      715.48
Espoo         279044      528.03
Tampere       231853      689.59
Vantaa        223027      240.35
Oulu          201810     3817.52


In [33]:
"""
Exercise 4.2 (powers of series)
Make function powers_of_series that takes a Series and a positive integer k as parameters and returns a DataFrame.
The resulting DataFrame should have the same index as the input Series. 

The first column of the dataFrame should be the input Series, 
the second column should contain the Series raised to power of two. 
The third column should contain the Series raised to the power of three, 
and so on until (and including) power of k. 
The columns should have indices from 1 to k.

The values should be numbers, but the index can have any type. 
Test your function from the main function. Example of usage:

s = pd.Series([1,2,3,4], index=list("abcd"))
print(powers_of_series(s, 3))
Should print:

   1   2   3
a  1   1   1
b  2   4   8
c  3   9  27
d  4  16  64
"""

import pandas as pd
import numpy as np
def powers_of_series(s, k):

    start = 1
    # create pd dataframes here 
    data = pd.DataFrame(s, columns=[start])
    

    for number in range(start, k+1):
      
        data[number] = data[start] ** number

    return data

"""
def powers_of_series(s, k):
    c=[ s**i for i in range(1,k+1) ]
    df = pd.DataFrame(dict(zip(range(1,k+1), c)))
    return df
"""
def main():
    s = pd.Series([1,2,3,4], index=list("abcd"))
    k = 3
    df = powers_of_series(s, k)
    print(df)
    
main()

   1   2   3
a  1   1   1
b  2   4   8
c  3   9  27
d  4  16  64


In [44]:

"""
Exercise 4.3 (municipal information)
In the main function load a data set of municipal information from the src folder (originally from Statistics Finland).
Use the function pd.read_csv, and note that the separator is a tabulator.

Print the shape of the DataFrame (number of rows and columns) and the column names in the following format:

Shape: r,c
Columns:
col1 
col2
...
Note, sometimes file ending tsv (tab separated values) is used instead of csv if the separator is a tab.
"""
import pandas as pd

def main():

    # seperator is tabulator \t
    data_frame = pd.read_csv("part04-e03_municipal_information/src/municipal.tsv", sep='\t')
    #print(data_frame.head()) # print the first five rows
    r, c = data_frame.shape

    print(f"Shape: {r},{c}")
    print("Columns:")
    for col in data_frame.columns.tolist():
        print(col)
"""
def main():
    df = pd.read_csv("src/municipal.tsv", sep="\t")
    print("Shape: {}, {}".format(*df.shape))
    print("Columns:")
    for name in df.columns:
        print(name)
"""
main()

Shape:  (490, 6)
Columns: 
Index(['Region 2018\t"Population"\t"Population change from the previous year',
       ' %"\t"Share of Swedish-speakers of the population',
       ' %"\t"Share of foreign citizens of the population',
       ' %"\t"Proportion of the unemployed among the labour force',
       ' %"\t"Proportion of pensioners of the population', ' %"'],
      dtype='object')


- **Accessing Elements in a DataFrame:**
  - DataFrames are two-dimensional arrays with differences in accessing elements compared to NumPy arrays.
  - The bracket notation `[]` allows access to only one dimension at a time.

- **Accessing Columns:**
  - Using a single integer within the bracket specifies a column.
    - Example: `df["Wage"]`
      ```
      0    1000
      1    1500
      Name: Wage, dtype: int64
      ```

  - Fancy indexing can be used for multiple columns.
    - Example: `df[["Wage", "Name"]]`
      ```
      	Wage	Name
      0	1000	Jack
      1	1500	John
      ```

- **Accessing Rows:**
  - Using a slice or boolean mask refers to rows.
    - Example: 
      - Slice: `df[0:1]`
        ```
      	Wage	Name	Age
      0	1000	Jack	21
        ```
      - Boolean mask: `df[df.Wage > 1200]`
        ```
      	Wage	Name	Age
      1	1500	John	29
        ```

- **Chaining for Single Value:**
  - When a Series object is returned, chaining bracket calls can extract a single value.
    - Example: `df["Wage"][1]`
      ```
      1500
      ```

- **Note on Indexing with Integers:**
  - If an integer is used for indexing, it specifies a column only if it matches explicit column indices.
    - Example: 
      ```python
      try:
          df[0]
      except KeyError:
          import sys
          print("Key error", file=sys.stderr)
      ```
      Output:
      ```
      Key error
      ```

- **Better Approach for Single Value Retrieval:**
  - There's a more efficient way to retrieve a single value, which will be discussed in the next section.

In [76]:
"""
Exercise 4.4 (municipalities of finland)
Load again the municipal information DataFrame. 
The rows of the DataFrame correspond to various geographical areas of Finland. 
The first row is about Finland as a whole, then rows from Akaa to Äänekoski are 
municipalities of Finland in alphabetical order. 
After that some larger regions are listed.

Write function municipalities_of_finland that returns a DataFrame containing only rows about municipalities.

Give an appropriate argument for pd.read_csv so that it interprets the column about region name as the (row) index. 
This way you can index the DataFrame with the names of the regions.

Test your function from the main function.

"""
import pandas as pd

def municipalities_of_finland():
    data_frame = pd.read_csv("part04-e04_municipalities_of_finland/src/municipal.tsv", sep='\t', index_col='Region 2018')
    
    #return data_frame.loc['Akaa':'Äänekoski']
    return data_frame['Akaa':'Äänekoski']
    
    
def main():
    result = municipalities_of_finland()
    print(result)
main()

             Population  Population change from the previous year, %  \
Region 2018                                                            
Akaa              16769                                         -0.9   
Alajärvi           9831                                         -0.7   
Alavieska          2610                                         -1.1   
Alavus            11713                                         -1.6   
Asikkala           8248                                         -0.9   
...                 ...                                          ...   
Ylivieska         15251                                          0.3   
Ylöjärvi          32878                                          0.2   
Ypäjä              2372                                         -0.4   
Ähtäri             5906                                         -1.3   
Äänekoski         19144                                         -1.2   

             Share of Swedish-speakers of the population, %  \


In [107]:
"""
Exercise 4.5 (swedish and foreigners)
Write function swedish_and_foreigners that

Reads the municipalities data set
Takes the subset about municipalities (like in previous exercise)

Further take a subset of rows that have proportion of Swedish speaking people and 
proportion of foreigners both above 5 % level

From this data set take only columns about population, 
the proportions of Swedish speaking people and foreigners, that is three columns.
The function should return this final DataFrame.

Do you see some kind of correlation between the columns about Swedish speaking and foreign people? 
Do you see correlation between the columns about the population and 
the proportion of Swedish speaking people in this subset?
"""

import pandas as pd
def municipalities_of_finland(path):
    data_frames = pd.read_csv(path, sep='\t', index_col='Region 2018')
    # extract manuciplities of findland 'Akaa':'Äänekoski'
    findland_manucipal = data_frames['Akaa':'Äänekoski']
    return findland_manucipal

def swedish_and_foreigners():
    path = "part04-e05_swedish_and_foreigners/src/municipal.tsv"
    df = municipalities_of_finland(path)

    # Swedish speaking people and 
    # proportion of foreigners both above 5 % level
    swedish_speaking_percent = "Share of Swedish-speakers of the population, %"
    foreign_citizens_percent = "Share of foreign citizens of the population, %"
    percent = 5
    condition = (df[swedish_speaking_percent] > percent) & (df[foreign_citizens_percent] > percent)
    five_percent = df[condition]

    return (five_percent[["Population",swedish_speaking_percent,foreign_citizens_percent]])


def main():
    result = swedish_and_foreigners()
    print(result)
main()

               Population  Share of Swedish-speakers of the population, %  \
Region 2018                                                                 
Brändö                452                                            72.6   
Eckerö                948                                            89.7   
Espoo              279044                                             7.2   
Finström             2580                                            89.8   
Föglö                 532                                            84.2   
Geta                  495                                            86.9   
Hammarland           1547                                            89.7   
Helsinki           643272                                             5.7   
Jomala               4859                                            89.1   
Kaskinen             1274                                            29.9   
Kirkkonummi         39170                                            16.6   

In [119]:
"""
Exercise 4.6 (growing municipalities)
Write function growing_municipalities that gets subset of municipalities (a DataFrame) as a parameter and 
returns the proportion of municipalities with increasing population in that subset.

Test your function from the main function using some subset of the municipalities. 
Print the proportion as percentages using 1 decimal precision.

Example output:

Proportion of growing municipalities: 12.4%
"""
import pandas as pd

def growing_municipalities(df):
    # returns the proportion of municipalities with increasing population in that subset.
    # check change of population from above 0
    increasing_pop = df[df["Population change from the previous year, %"]>0]
    value = (len(increasing_pop) / len(df))
    return value

def main():
    path = "src/municipal.tsv"
    df = pd.read_csv(path, sep='\t', index_col='Region 2018')
    # extract manuciplities of findland 'Akaa':'Äänekoski'
    findland_manucipal = df['Akaa':'Äänekoski']

    
    growing_proportion = growing_municipalities(findland_manucipal)
    print(f"Proportion of growing municipalities: {growing_proportion:.1f}%")
main()


"""def growing_municipalities(df):
    c="Population change from the previous year, %"
    n = len(df)
    k = sum(df[c] > 0.0)
    return k / n
 
def main():
    df = pd.read_csv("src/municipal.tsv", index_col=0, sep="\t")
    df = df["Akaa":"Äänekoski"]
    proportion = growing_municipalities(df)
    print(f"Proportion of growing municipalities: {proportion:.1%}")
    """

Proportion of growing municipalities: 47.9%


### Loc and Iloc Attributes in Pandas DataFrames

#### 1. **Overview:**
   - Pandas DataFrames offer two primary attributes for indexing and data selection: `loc` and `iloc`.
   - These attributes provide an alternative to the methods discussed in the previous section.

#### 2. **Functionality:**
   - **`loc` Attribute:**
     - Uses explicit indices for both rows and columns.
     - Allows the use of index pairs to access a single element.
     - Example:
       ```python
       df.loc[1, "Wage"]
       # Output: 1500
       ```
     - Example with multiple columns:
       ```python
       df.loc[1, ["Name", "Wage"]]
       # Output:
       # Name    John
       # Wage    1500
       # Name: 1, dtype: object
       ```

   - **`iloc` Attribute:**
     - Uses implicit integer indices for both rows and columns.
     - Mimics the behavior of NumPy arrays regarding indexing, slicing, fancy indexing, masking, and their combinations.
     - Example:
       ```python
       df.iloc[-1, -1]  # Right lower corner of the DataFrame
       # Output: 29
       ```

#### 3. **Order of Dimensions:**
   - Both `loc` and `iloc` maintain the same order of dimensions as NumPy arrays.
   - The first index specifies rows, and the second index specifies columns.

#### 4. **Usage Recommendations:**
   - Choose between `loc` and `iloc` based on your preference for explicit or implicit indices.
   - Can be used interchangeably, and sometimes in combination with methods from the previous section as shortcuts.

#### 5. **Compatibility:**
   - `loc` and `iloc` attributes provide a clear and unambiguous way of data selection, avoiding confusion about implicit or explicit indices.

#### 6. **Comparison Example:**
   - Illustrative example comparing `loc` and `iloc`:
     ```python
     # Using iloc
     df.iloc[0:2, 1:3]  # Rows 0 to 1, Columns 1 to 2
     # Equivalent using loc
     df.loc[df.index[0:2], df.columns[1:3]]
     ```

#### 7. **Note:**
   - Ensure a clear understanding of why specific examples work as they do to utilize `loc` and `iloc` effectively.

By leveraging `loc` and `iloc` attributes, you can enhance the clarity and simplicity of data selection in Pandas DataFrames.

In [125]:
import pandas as pd
"""Exercise 4.7 (subsetting with loc)
Write function subsetting_with_loc that in one go takes 
the subset of municipalities from Akaa to Äänekoski and restricts it to columns: 
"Population", "Share of Swedish-speakers of the population, %", and "Share of foreign citizens of the population, %". 

The function should return this content as a DataFrame. 
Use the attribute loc.
"""
def subsetting_with_loc():
    df = pd.read_csv("part04-e07_subsetting_with_loc/src/municipal.tsv",sep='\t', index_col='Region 2018')
    df = df['Akaa':'Äänekoski']
    df = df.loc[:,["Population", "Share of Swedish-speakers of the population, %", "Share of foreign citizens of the population, %"]]
    return df
"""def subsetting_with_loc():
    df = pd.read_csv("src/municipal.tsv", index_col=0, sep="\t")
    df = df.loc["Akaa":"Äänekoski", ["Population",
                                     "Share of Swedish-speakers of the population, %",
                                     "Share of foreign citizens of the population, %"]]
"""
def main():
    df = subsetting_with_loc()
    print(df)
main()


             Population  Share of Swedish-speakers of the population, %  \
Region 2018                                                               
Akaa              16769                                             0.2   
Alajärvi           9831                                             0.1   
Alavieska          2610                                             0.2   
Alavus            11713                                             0.1   
Asikkala           8248                                             0.2   
...                 ...                                             ...   
Ylivieska         15251                                             0.3   
Ylöjärvi          32878                                             0.3   
Ypäjä              2372                                             0.7   
Ähtäri             5906                                             0.1   
Äänekoski         19144                                             0.1   

             Share of fo

In [139]:
"""
Exercise 4.8 (subsetting by positions)
Write function subsetting_by_positions that does the following.

Read the data set of the top forty singles from the beginning of the year 1964 from the src folder. 

Return the top 10 entries and only the columns Title and Artist. 
Get these elements by their positions, that is, by using a single call to the iloc attribute. 
The function should return these as a DataFrame.

"""

import pandas as pd

def subsetting_by_positions():
    #return pd.DataFrame()
    df = pd.read_csv("part04-e08_subsetting_by_positions/src/UK-top40-1964-1-2.tsv", sep='\t')
    # top 10 with Title and Artist
    return (df.iloc[0:10, [2,3]])
 
"""def subsetting_by_positions():
    df = pd.read_csv("src/UK-top40-1964-1-2.tsv", sep="\t")
    return df.iloc[:10,2:4]"""
def main():
    result = subsetting_by_positions()
    print(result)
main()

                          Title                    Artist
0      I WANT TO HOLD YOUR HAND               THE BEATLES
1                 GLAD ALL OVER       THE DAVE CLARK FIVE
2                 SHE LOVES YOU               THE BEATLES
3          YOU WERE MADE FOR ME  FREDDIE AND THE DREAMERS
4  TWENTY FOUR HOURS FROM TULSA               GENE PITNEY
5    I ONLY WANT TO BE WITH YOU         DUSTY SPRINGFIELD
6                     DOMINIQUE           THE SINGING NUN
7                   MARIA ELENA      LOS INDIOS TABAJARAS
8                   SECRET LOVE               KATHY KIRBY
9             DON'T TALK TO HIM             CLIFF RICHARD
0


**Summary Statistics in Pandas:**

1. **Mean Calculation:**
   - The `mean()` method calculates the mean (average) of each numeric column in the DataFrame.
   - Example:
     ```python
     wh2.mean()
     ```
     Output:
     ```
     Precipitation amount (mm)    1.966301
     Snow depth (cm)              0.966480
     Air temperature (degC)       6.527123
     dtype: float64
     ```

2. **Describe Method:**
   - The `describe()` method provides various summary statistics for each numeric column in the DataFrame.
   - It includes count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum.
   - Example:
     ```python
     wh.describe()
     ```
     Output:
     ```
    	          Year	    m	        d	      Precipitation amount (mm)	Snow depth (cm)	Air temperature (degC)
     count	365.0	    365.000000	365.000000	365.000000	             358.000000	    365.000000
     mean	  2017.0	6.526027	15.720548	1.966301	              0.966480	      6.527123
     std	  0.0	     3.452584	8.808321	4.858423	              3.717472	      7.183934
     min	  2017.0	1.000000	1.000000	-1.000000	             -1.000000	    -17.800000
     25%	  2017.0	4.000000	8.000000	-1.000000	             -1.000000	      1.200000
     50%	  2017.0	7.000000	16.000000	0.200000	              -1.000000	      4.800000
     75%	  2017.0	10.000000	23.000000	2.700000	              0.000000	      12.900000
     max	  2017.0	12.000000	31.000000	35.000000	            15.000000	     19.600000
     ```

**Note:**
- The `describe()` method is particularly useful for exploratory data analysis, providing a comprehensive overview of the distribution of values in each column.

In [165]:
import pandas as pd
"""Exercise 4.9 (snow depth)
Write function snow_depth that reads in the weather DataFrame from the src folder and
returns the maximum amount of snow in the year 2017.

Print the result in the main function in the following form:

Max snow depth: xx.x
"""
def snow_depth():
    """returns the maximum amount of snow in the year 2017."""
    # read csv file and calculate the satistics
    df = pd.read_csv("part04-e09_snow_depth/src/kumpula-weather-2017.csv", sep=',')
    snow_amount = 'Snow depth (cm)'
  
    # extract rows from year 2017
    target_year = df[df.loc[:, "Year"] == 2017]
    max_snow_amount = max(target_year.loc[:,snow_amount])
    
    return max_snow_amount
"""def snow_depth():
    df = pd.read_csv("src/kumpula-weather-2017.csv")
    return df["Snow depth (cm)"].max()
 """
def main():
    result = snow_depth()
    print(f"Max snow depth: {result:.1f}")
main()

Max snow depth: 15.0


In [174]:
"""
Exercise 4.10 (average temperature)
Write function average_temperature that reads the weather data set and returns the average temperature in July.

Print the result in the main function in the following form:

Average temperature in July: xx.x
"""

import pandas as pd

def average_temperature():
    """returns the average temperature in July."""
    df = pd.read_csv("part04-e10_average_temperature/src/kumpula-weather-2017.csv", sep=',')
    month = 7
    target_month = df[df.loc[:, 'm'] == 7]
    temperature = target_month['Air temperature (degC)']
    return sum(temperature)/len(temperature)
"""def average_temperature():
    df = pd.read_csv("src/kumpula-weather-2017.csv", sep=",")
    m = df["m"] == 7
    return df[m]["Air temperature (degC)"].mean()"""
def main():
    avg = average_temperature()
    print(f"Average temperature in July: {avg:.1f}")
main()

Average temperature in July: 16.0


In [179]:
"""
Exercise 4.11 (below zero)
Write function below_zero that returns the number of days when the temperature was below zero.

Print the result in the main function in the following form:

Number of days below zero: xx
"""

import pandas as pd

def below_zero():
    """returns the number of days when the temperature was below zero."""
    df = pd.read_csv("part04-e11_below_zero/src/kumpula-weather-2017.csv", sep=',')
    number_of_days = 0
    temperature = df[df['Air temperature (degC)'] < 0] 
    return len(temperature)
"""def below_zero():
    df = pd.read_csv("src/kumpula-weather-2017.csv")
    return sum(df["Air temperature (degC)"] < 0.0)"""
def main():
    result = below_zero()
    print(f"Number of days below zero: {result:02d}")
main()

Number of days below zero: 49


**Missing Data in Pandas DataFrame:**

1. **Observations:**
   - Minimum values in precipitation and snow depth fields are -1, indicating no rain or snow on those days.
   - Snow depth column has a count of 358, while other columns have 365 entries.

2. **Unique Values in Snow Depth Column:**
   - Using `wh["Snow depth (cm)"].unique()`:
     - Values include -1, various numeric depths, and `nan` (Not A Number).
     - `nan` represents missing values due to measurement issues or other problems.

3. **Handling Missing Values in Pandas:**
   - Float types allow `nan` values.

   - Examples:
     - Integer series with missing values gets promoted to float.
       ```python
       pd.Series([1,3,2])       # dtype: int64
       pd.Series([1,3,2, np.nan])  # dtype: float64
       ```

     - Non-numeric types use `None` for missing values, dtype gets promoted to object.
       ```python
       pd.Series(["jack", "joe", None])  # dtype: object
       ```

4. **Locating Missing Values:**
   - Use `isnull()` to create a boolean mask DataFrame.
     ```python
     wh.isnull()  # boolean mask DataFrame
     ```

   - Combine with `any()` to find rows with at least one missing value.
     ```python
     wh[wh.isnull().any(axis=1)]
     ```

5. **Excluding Missing Values from Statistics:**
   - Pandas excludes missing values from summary statistics.

6. **Handling Missing Values - `dropna()`:**
   - `dropna()` method drops rows or columns with missing values.
     ```python
     wh.dropna().shape   # Default axis is 0, drops rows
     wh.dropna(axis=1).shape   # Drops columns containing missing values
     ```

7. **`dropna()` Parameters:**
   - `how` and `thresh` parameters specify conditions for dropping rows/columns.

8. **Filling Missing Values - `fillna()`:**
   - `fillna()` method can fill missing values with constants or interpolated values.

   - Example:
     ```python
     wh = wh.fillna(method='ffill')   # Forward fill
     ```

9. **Interpolating Missing Values:**
   - The `interpolate` method offers more elaborate ways to interpolate missing values from neighboring non-missing values (not covered in detail here).

In [221]:
"""Exercise 4.12 (cyclists)
Write function cyclists that does the following.

Load the Helsinki bicycle data set from the src folder (https://hri.fi/data/dataset//helsingin-pyorailijamaarat). 
The dataset contains the number of cyclists passing by measuring points per hour. 
The data is gathered over about four years, and there are 20 measuring points around Helsinki. 
The dataset contains some empty rows at the end. Get rid of these. 
Also, get rid of columns that contain only missing values. 
Return the cleaned dataset.
"""
import pandas as pd

def cyclists():
    df = pd.read_csv("part04-e12_cyclists/src/Helsingin_pyorailijamaarat.csv", sep=';')
    # Example DataFrame
    dropw_column = df.dropna(axis = 1, how='all')
    drop_row = dropw_column.dropna(axis = 0, how='all')
    return(drop_row.shape)
  

def main():
    cyclists()
main()

(37128, 21)



**Exercise 4.13: Missing Value Types**

1. **Objective:**
   - Create a function named `missing_value_types` that returns a DataFrame with specific requirements.

2. **DataFrame Structure:**
   - Index: State column
   - Columns:
     - Year of independence (float type)
     - President (object type)

3. **Handling Missing Values:**
   - Replace dashes with appropriate missing value symbols.

4. **Example Input:**
   ```python
   missing_value_types(input_dataframe)
   ```
   
5. **Example Output:**
   ```
                        Year of independence  President
   State                                               
   United Kingdom                         NaN        NaN
   Finland                             1917.0   Niinistö
   USA                                 1776.0      Trump
   Sweden                              1523.0        NaN
   Germany                                NaN  Steinmeier
   Russia                              1992.0      Putin
   ```

Note: The provided examples are illustrative, and the actual implementation details are not provided as per your request.

In [2]:
import pandas as pd
import numpy as np

def missing_value_types():
    
    
    indecies = ["United Kindom", "Finland", "USA", "Sweden", "Germany", "Russia"]

    presidents = [np.nan, 'Niinistö', 'Trump', np.nan, 'Steinmeier', 'Putin']
    years = [np.nan, 1917, 1776, 1523, np.nan, 1992]
    datas = {
        'Year of independence':years,
        'President':presidents
    }
  
    result = pd.DataFrame(datas, index=indecies)
  

    return(result)
"""def missing_value_types():
    df=pd.DataFrame([["United Kingdom", np.nan, None],
                     ["Finland",        1917,   "Niinistö"],
                     ["USA",            1776,   "Trump"],
                     ["Sweden",         1523,   None],
                     ["Germany",        np.nan, "Steinmeier"],
                     ["Russia",         1992,   "Putin"]],
                    columns=["State", "Year of independence", "President"])
    df = df.set_index("State")"""       
def main():
    result = missing_value_types()
    print(result)
main()

               Year of independence   President
United Kindom                   NaN         NaN
Finland                      1917.0    Niinistö
USA                          1776.0       Trump
Sweden                       1523.0         NaN
Germany                         NaN  Steinmeier
Russia                       1992.0       Putin


In [44]:
"""Exercise 4.14 (special missing values)
Write function special_missing_values that does the following.

Read the data set of the top forty singles from the beginning of the year 1964 from the src folder. 
Return the rows whose singles' position dropped compared to last week's position (column LW=Last Week).

To do this you first have to convert the special values "New" and "Re" (Re-entry) to missing values (None).
"""

import pandas as pd
import numpy as np

def special_missing_values():
    """
    Read the data set of the top forty singles from the beginning of the year 1964 from the src folder. 
    Return the rows whose singles' position dropped compared to last week's position (column LW=Last Week).
    """
    # read the file and transform in to data frame 

    df = pd.read_csv("part04-e14_special_missing_values/src/UK-top40-1964-1-2.tsv", sep='\t')
    df['LW'] = df['LW'].replace(['New', 'Re'], None)
    df['LW'] = pd.to_numeric(df['LW'], errors='coerce')
    return(df[df['Pos'] > df['LW']])
   


def main():
    df = special_missing_values()
    #print(df.shape ==  (17,7))
    return
main()

    Pos    LW                                 Title  \
2     3   2.0                         SHE LOVES YOU   
3     4   3.0                  YOU WERE MADE FOR ME   
5     6   5.0            I ONLY WANT TO BE WITH YOU   
8     9   4.0                           SECRET LOVE   
9    10   8.0                     DON'T TALK TO HIM   
11   12  11.0                              GERONIMO   
14   15  14.0                   I WANNA BE YOUR MAN   
15   16  12.0               YOU'LL NEVER WALK ALONE   
20   21  13.0               I'LL KEEP YOU SATISFIED   
21   22  21.0                  IF I RULED THE WORLD   
23   24  20.0  ALL I WANT FOR CHRISTMAS IS A BEATLE   
29   30  22.0                  IT'S ALMOST TOMORROW   
30   31  24.0                       HUNGRY FOR LOVE   
33   34  33.0                           DEEP PURPLE   
34   35  31.0                   BLOWING IN THE WIND   
37   38  30.0                       SUGAR AND SPICE   
38   39  37.0                      YESTERDAY'S GONE   

         