# Assignment 1 - Python and Pandas Review
Author:  Akinyi Wendy

# Goal 
This assignment gives you the opportunity to practice key aspects of data gathering and preprocessing, a crucial step of the data science workflow.

Execute the following two cells so I can gather information on versions of commonly used libraries:

In [118]:
import numpy as np
import pandas as pd
import sklearn as sk
import re

In [119]:
print('PACKAGE VERSIONS')
print(f'Numpy:   {np.__version__}')
print(f'Pandas:  {pd.__version__}')
print(f'SKLearn: {sk.__version__}')

PACKAGE VERSIONS
Numpy:   1.26.4
Pandas:  2.2.0
SKLearn: 1.4.2


# Question 1
### Part $a$

Create a data set from several independent files

- Read the provided csv files into individual Pandas DataFrames with the following naming convention:
    - boston_weather_1.csv $\rightarrow$ $boston\_1$
    - boston_weather_2.csv $\rightarrow$ $boston\_2$
    - boston_weather_3.csv $\rightarrow$ $boston\_3$
    - boston_weather_4.csv $\rightarrow$ $boston\_4$
    - more_weather_variables.csv $\rightarrow$ $more\_variables$
    - Tip: Consider the Pandas function $read\_csv$

In [120]:
boston_1= pd.read_csv('boston_weather_1.csv')
boston_2 = pd.read_csv('boston_weather_2.csv')
boston_3 = pd.read_csv('boston_weather_3.csv')
boston_4 = pd.read_csv('boston_weather_4.csv')
more_variables = pd.read_csv('more_weather_variables.csv')


### Part $b$
- In the live session notebook demonstration, details of each DataFrame were displayed (shape and column labels)
- The code was repeated to present these details
    - Review the notebooks shared after the first live session 
- Here, construct a function input a DataFrame and display the following:
    - DataFrame dimensions
    - Column labels
    - The first five rows
    - The last five rows
    - The Data Range (earliest Month, Year to latest Month, Year)
- In a Markdown cell, document the following:
    - Compare the dimensions of each of the five DataFrames
    - Compare the column labels (feature names) in each DataFrame
    - What can be assumed about the order of the data in each of the $boston\_*$ DataFrame?
    - What is the Date Range (earliest Month, Year to latest Month, Year) in each of the $boston\_*$ and $more\_variables$ DataFrames?

boston_1, boston_2 and  boston_3 have 300rows*8columns, boston_4 has 240rows*8columns while more_variables has 1140rows*7columns.
column labels.
boston_1, boston_2, boston_3 and boston_4 have identical column names: Month, Year, LowTemp, HighTemp, WarmestMin, ColdestHigh, AveMin, AveMax.
more_variables has different columns: Month, Year, meanTemp, TotPrecip, TotSnow, Max24hrPrecip, Max24hrSnow.
Each dataframe starts with a specific month/year and progresses forward although the data in $more\_variables$ is not in order

In [121]:
def dataframe_display(df, df_name):
    print(f"\nDataframe name: {df_name}")
    print(f"Dataframe Dimensions: {df.shape}")
    print(f"Column Labels: {list(df.columns)}")
    print("First five rows:")
    print(df.head())
    print("Last five rows:")
    print(df.tail())
    earliest = f"{df['Month'].min()}/{df['Year'].min()}"
    latest = f"{df['Month'].max()}/{df['Year'].max()}"
    print(f"Date Range: {earliest} to {latest}")


# Display details for each DataFrame
dataframe_display(boston_1, 'boston_1')
dataframe_display(boston_2, 'boston_2')
dataframe_display(boston_3, 'boston_3')
dataframe_display(boston_4, 'boston_4')
dataframe_display(more_variables, 'more_variables')


Dataframe name: boston_1
Dataframe Dimensions: (300, 8)
Column Labels: ['Month', 'Year', 'LowTemp', 'HighTemp', 'WarmestMin', 'ColdestHigh', 'AveMin', 'AveMax']
First five rows:
   Month  Year  LowTemp  HighTemp  WarmestMin  ColdestHigh  AveMin  AveMax
0      1  1920     -8.0      48.0        37.0         12.0    12.5    29.5
1      2  1920     -7.0      52.0        34.0         16.0    20.6    34.6
2      3  1920     10.0      72.0        45.0         25.0    30.5    48.0
3      4  1920     26.0      67.0        46.0         37.0    38.0    52.1
4      5  1920     36.0      86.0        61.0         45.0    47.5    61.8
Last five rows:
     Month  Year  LowTemp  HighTemp  WarmestMin  ColdestHigh  AveMin  AveMax
295      8  1944     53.0     101.0        76.0         69.0    65.1    84.3
296      9  1944     42.0      90.0        70.0         59.0    57.3    72.7
297     10  1944     30.0      86.0        64.0         45.0    45.4    62.1
298     11  1944     29.0      69.0        48.0

You should observe the following from the above analysis:
- Each $boston\_*$ DataFrame is ordered monthly temperature data
- If the individal $boston\_*$ DataFrames were combined in order, the combined DataFrame would contain 95 years of ordered, non-overlapping monthly data
- The total number of rows of the $boston\_*$ DataFrames is equal to the number of rows in $more\_variables$.
- The Date Range of $more\_variables$ is equivalent to the date range observed in $boston\_*$
- The data in $more\_variables$ is not in order

### Part $c$
- Combine the $boston\_1$, $boston\_2$, $boston\_3$, and $boston\_4$ DataFrames and assign the results to a DataFrame called $combined\_boston$. 
    - From the details learned above, these four data sets should be combined vertically since they represent the same features over different date ranges.

In [122]:
combined_boston = pd.concat([boston_1, boston_2, boston_3, boston_4], axis=0)
print(combined_boston.head())

   Month  Year  LowTemp  HighTemp  WarmestMin  ColdestHigh  AveMin  AveMax
0      1  1920     -8.0      48.0        37.0         12.0    12.5    29.5
1      2  1920     -7.0      52.0        34.0         16.0    20.6    34.6
2      3  1920     10.0      72.0        45.0         25.0    30.5    48.0
3      4  1920     26.0      67.0        46.0         37.0    38.0    52.1
4      5  1920     36.0      86.0        61.0         45.0    47.5    61.8


### Part $d$
- From the details we gathered above, $more\_variables$ represents additional weather data over the same date range of $combined\_boston$
- Let's combine the $combined\_boston$ and $more\_variables$ DataFrames into a DataFrame called $boston\_data$
    - These two DataFrames have the same number of instances (Verify for yourself with $shape$)
    - Therefore, when $merge$ is called, how they are merged (left, right, inner, or outer) will return the same result.
    - However, consider which feature labels are necessary for alignment. 
- Verify and show that this combined data set ($boston\_data$) has 1140 instances and 13 features.

In [123]:
boston_data = pd.merge(combined_boston, more_variables, on= ['Year', 'Month'], how= 'inner')
# Verify the final shape
print("\nFinal boston_data shape:", boston_data.shape)

# Display all column names to confirm proper combination
print("\nAll columns in boston_data:")
print(boston_data.columns.tolist())
# Display a few rows to verify proper alignment
print(boston_data.head())

# Additional verification of date range
print("\nDate range in final dataset:")
earliest = f"{boston_data['Month'].min()}/{boston_data['Year'].min()}"
latest = f"{boston_data['Month'].max()}/{boston_data['Year'].max()}"
print(f"From {earliest} to {latest}")


Final boston_data shape: (1140, 13)

All columns in boston_data:
['Month', 'Year', 'LowTemp', 'HighTemp', 'WarmestMin', 'ColdestHigh', 'AveMin', 'AveMax', 'meanTemp', 'TotPrecip', 'TotSnow', 'Max24hrPrecip', 'Max24hrSnow']
   Month  Year  LowTemp  HighTemp  WarmestMin  ColdestHigh  AveMin  AveMax  \
0      1  1920     -8.0      48.0        37.0         12.0    12.5    29.5   
1      2  1920     -7.0      52.0        34.0         16.0    20.6    34.6   
2      3  1920     10.0      72.0        45.0         25.0    30.5    48.0   
3      4  1920     26.0      67.0        46.0         37.0    38.0    52.1   
4      5  1920     36.0      86.0        61.0         45.0    47.5    61.8   

   meanTemp  TotPrecip TotSnow  Max24hrPrecip Max24hrSnow  
0      21.0       2.72    24.8           0.74         6.7  
1      27.6       4.11    32.5           1.01        12.2  
2      39.3       3.72      11           1.15           6  
3      45.1       5.68       2           1.19           2  
4      

### Part $e$
Format the column labels
- Print the original columns of the $boston\_data$ DataFrame as a list
- Remove any potential whitespace from the column names and convert the column names to snake case
    - As a reminder, for column names that consist of multiple words, include an underscore between the lowercased words
    - For example, meanTemp should become mean_temp, and Max24hrPrep can become max_24hr_prep, and HighTemp becomes high_temp, etc.
    - Tip: This can be written in a single line of code
    - Tip: Consider the "re" library 
- Print the columns of $boston\_data$ as a list again

In [124]:
# Print original column names

print("Original column names:")
print(list(boston_data.columns))


# Update column names using list comprehension with regex
boston_data.columns = [re.sub(
    r'(?<!^)(?=[A-Z])', '_', col).lower().strip() for col in boston_data.columns]

# Print updated column names
print("\nUpdated column names:")
print(list(boston_data.columns))

Original column names:
['Month', 'Year', 'LowTemp', 'HighTemp', 'WarmestMin', 'ColdestHigh', 'AveMin', 'AveMax', 'meanTemp', 'TotPrecip', 'TotSnow', 'Max24hrPrecip', 'Max24hrSnow']

Updated column names:
['month', 'year', 'low_temp', 'high_temp', 'warmest_min', 'coldest_high', 'ave_min', 'ave_max', 'mean_temp', 'tot_precip', 'tot_snow', 'max24hr_precip', 'max24hr_snow']


### Part $f$
Inspection of the data set
- Calculate and show the number of missing data values in each column of the $boston\_data$ DataFrame
    - Which columns have missing values?
- Remove the instances that contain any missing data, and assign the resulting DataFrame to a variable called $clean\_boston\_data$.
    - Be sure to use $.copy()$ to avoid future warnings.
- Verify that $clean\_boston\_data$ has no missing data.
- Print the shapes of $boston\_data$ and $clean\_boston\_data$.
    - How many rows were removed due to null values?
    - What $month, year$ pairings were removed from $boston\_data$?
        - Display a DataFrame of the $month$ and $year$ values that were removed
        - Tip: Consider Pandas functions $isna$ and $any$ to identify the rows removed

In [125]:
# Check for missing values in each column
print("Missing values in each column:")
print(boston_data.isna().sum())

# Create clean dataset removing rows with any missing values
clean_boston_data = boston_data.dropna().copy()
# Verify no missing values in clean dataset
print("\nMissing values in clean dataset:")
print(clean_boston_data.isna().sum())
# Print shapes to compare
print("\nDataFrame shapes:")
print(f"boston_data shape: {boston_data.shape}")
print(f"clean_boston_data shape: {clean_boston_data.shape}")
# Calculate number of rows removed
rows_removed = boston_data.shape[0] - clean_boston_data.shape[0]
print(f"\nNumber of rows removed: {rows_removed}")
# Find the month/year pairs that were removed
# Create boolean mask of rows with missing values
missing_mask = boston_data.isna().any(axis=1)
# Get month/year values of removed rows
removed_dates = boston_data[missing_mask][[
    'month', 'year']].sort_values(['year', 'month'])

print("\nMonth/Year pairs removed due to missing values:")
print(removed_dates)

Missing values in each column:
month             0
year              0
low_temp          6
high_temp         5
warmest_min       6
coldest_high      9
ave_min           3
ave_max           6
mean_temp         0
tot_precip        0
tot_snow          0
max24hr_precip    0
max24hr_snow      0
dtype: int64

Missing values in clean dataset:
month             0
year              0
low_temp          0
high_temp         0
warmest_min       0
coldest_high      0
ave_min           0
ave_max           0
mean_temp         0
tot_precip        0
tot_snow          0
max24hr_precip    0
max24hr_snow      0
dtype: int64

DataFrame shapes:
boston_data shape: (1140, 13)
clean_boston_data shape: (1112, 13)

Number of rows removed: 28

Month/Year pairs removed due to missing values:
      month  year
5         6  1920
9        10  1920
17        6  1921
18        7  1921
201      10  1936
202      11  1936
203      12  1936
204       1  1937
209       6  1937
223       8  1938
224       9  1938
225      10

### Part $g$
Filtering and analyzing the data set
- From the $clean\_boston\_data$ DataFrame, select all data, except instances where the year is 1930. 
    - You can assign this subset to a variable $excluding\_1930$. 
- Using the $excluding\_1930$ DataFrame, output the first 20 unique values in the year column.
    - You should observe that 1930 is absent

In [126]:
# Create DataFrame excluding year 1930
excluding_1930 = clean_boston_data[clean_boston_data['year'] != 1930].copy()

# Show first 20 unique years
print("First 20 unique years in excluding_1930:")
print(sorted(excluding_1930['year'].unique())[:20])

# Additional verification
print("\nVerification:")
print(f"Original shape of clean_boston_data: {clean_boston_data.shape}")
print(f"Shape after excluding 1930: {excluding_1930.shape}")

# Count how many rows were from 1930
rows_1930 = clean_boston_data[clean_boston_data['year'] == 1930].shape[0]
print(f"Number of rows removed (from year 1930): {rows_1930}")

# Verify 1930 is not in the filtered dataset
print("\nIs 1930 present in filtered dataset?",
      1930 in excluding_1930['year'].values)

First 20 unique years in excluding_1930:
[1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940]

Verification:
Original shape of clean_boston_data: (1112, 13)
Shape after excluding 1930: (1100, 13)
Number of rows removed (from year 1930): 12

Is 1930 present in filtered dataset? False


### Part $h$
- Select the data from the $clean\_boston\_data$ where $year$ is 1995 AND the $high\_temp$ is greater than or equal to 90.
- Output or display the entire selected data. 

In [127]:
# Select data from 1995 where high_temp >= 90
hot_days_1995 = clean_boston_data[(clean_boston_data['year'] == 1995) &
                                  (clean_boston_data['high_temp'] >= 90)]

# Sort by month for better readability
hot_days_1995_sorted = hot_days_1995.sort_values('month')

# Display the results
print("Days in 1995 with high temperature >= 90°F:")
print("\nShape of selected data:", hot_days_1995.shape)
print("\nDetailed data:")
print(hot_days_1995_sorted)

Days in 1995 with high temperature >= 90°F:

Shape of selected data: (3, 13)

Detailed data:
     month  year  low_temp  high_temp  warmest_min  coldest_high  ave_min  \
905      6  1995      53.0       95.0         73.0          61.0     60.2   
906      7  1995      59.0      100.0         75.0          67.0     67.1   
907      8  1995      55.0       96.0         72.0          66.0     64.1   

     ave_max  mean_temp  tot_precip tot_snow  max24hr_precip max24hr_snow  
905     77.0       68.6        1.55        0            0.53            0  
906     84.6       75.9        2.06        0            0.88            0  
907     81.5       72.8        0.82        0            0.63            0  


### Part $i$
- From $clean\_boston\_data$, identify the $month$, $year$, and $high\_temp$ of the warmest $high\_temp$ in the data set
- Tip: The highest temperature occurs more than once

In [128]:
# Find the maximum high_temp value
max_temp = clean_boston_data['high_temp'].max()

# Select all instances with this maximum temperature
warmest_days = clean_boston_data[clean_boston_data['high_temp'] == max_temp][[
    'month', 'year', 'high_temp']]

# Sort by year and month for clear display
warmest_days_sorted = warmest_days.sort_values(['year', 'month'])

print(f"The highest temperature recorded was {max_temp}°F")
print("\nThis occurred on the following dates:")
print(warmest_days_sorted)

The highest temperature recorded was 103.0°F

This occurred on the following dates:
      month  year  high_temp
78        7  1926      103.0
1098      7  2011      103.0


### Part $j$
- From the $clean\_boston\_data$, identify the $month$, $year$, and $tot\_precip$ for the month with the largest $tot\_precip$, but only consider the months of March, April, May, or June

In [129]:
spring_months = [3, 4, 5, 6]
spring_data = clean_boston_data[clean_boston_data['month'].isin(spring_months)]
max_precip_row = spring_data.loc[spring_data['tot_precip'].idxmax(), [
    'month', 'year', 'tot_precip']]
print("\nMonth, year, and total precipitation for the month with the largest total precipitation in March, April, May, or June:")
print("\nMonth values: 3=March, 4=April, 5=May, 6=June")
print(max_precip_row)


Month, year, and total precipitation for the month with the largest total precipitation in March, April, May, or June:

Month values: 3=March, 4=April, 5=May, 6=June
month             3
year           2010
tot_precip    14.87
Name: 1082, dtype: object


### Part $k$
- From $clean\_boston\_data$, calculate the count of each month where the $tot\_snow$ is greater than zero.
- Tip: You should get an error

In [130]:
# This will give an error
# snow_counts = clean_boston_data[clean_boston_data['tot_snow']
                                # > 0]['month'].value_counts()

### Part $l$
- You should have gotten: TypeError: '>' not supported between instances of 'str' and 'int'
- Check the data types in $clean\_boston\_data$, and you'll observe that $tot\_snow$ is an object type
- If you naively try to convert $tot\_snow$ to float64, an error will be returned.
    - The column has 'NR' (Not Recorded) values present.
        - Recall from above that $isna$ returned zero null values for this column.
- Convert the 'NR' values in $tot\_snow$ to null values
    - There are a few ways of doing this:  Pandas $where$, NumPy $where$, or using Pandas $loc$
- Convert the $tot\_snow$ data type to float
- Now, calculate the count of each month where the $tot\_snow$ is greater than zero.

In [131]:
# Check data type and unique values
print("Data type of tot_snow:", clean_boston_data['tot_snow'].dtype)
print("\nUnique values in tot_snow (not sorted):")
print(clean_boston_data['tot_snow'].unique())

# Replace 'NR' values
clean_boston_data['tot_snow'] = clean_boston_data['tot_snow'].replace({
                                                                      'NR': None})

# Convert tot_snow to float
clean_boston_data['tot_snow'] = pd.to_numeric(
    clean_boston_data['tot_snow'], errors='coerce')

# Count months with zero snow
zero_snow_counts = clean_boston_data[clean_boston_data['tot_snow']
                                     == 0]['month'].value_counts()

# Sort by month number
zero_snow_counts_sorted = zero_snow_counts.sort_index()

print("\nNumber of instances with zero snow by month:")
for month, count in zero_snow_counts_sorted.items():
    print(f"Month {month}: {count}")

# Find Decembers with no snow
december_no_snow = clean_boston_data[
    (clean_boston_data['month'] == 12) &
    (clean_boston_data['tot_snow'] == 0)
][['year', 'month', 'tot_snow']]

print("\nDecembers with no snow:")
print(december_no_snow.sort_values('year'))

# Verify total December records
print("\nVerification:")
total_decembers = clean_boston_data[clean_boston_data['month'] == 12].shape[0]
print(f"Total Decembers in dataset: {total_decembers}")

Data type of tot_snow: object

Unique values in tot_snow (not sorted):
['24.8' '32.5' '11' '2' '0' '1.8' '5.5' '1.6' '23.2' '5.7' '4.4' '2.7'
 '15.5' '10.1' '3.3' '0.9' '14.7' '25.3' '14.4' '10.6' '3' '8.5' '10.4'
 '7.4' '0.5' '0.2' '20.7' '27.3' '2.8' '24.2' '16.1' '5.1' '8.7' '6.5'
 '1.4' '4.6' '13.5' '19.6' '3.6' '0.6' 'NR' '3.2' '7.7' '10.2' '0.1'
 '11.7' '14' '7.3' '1.7' '10.8' '5.8' '8.6' '1.3' '20.6' '5' '2.9' '15.6'
 '0.8' '32.9' '10.5' '13.1' '2.1' '1.9' '1' '0.3' '13.2' '12.6' '34' '10'
 '0.4' '7' '4.7' '18.3' '23.8' '1.5' '9.5' '3.5' '20.1' '12.5' '8.2' '6.2'
 '7.8' '6.8' '26.4' '4.5' '8' '9.6' '12.7' '1.2' '10.7' '42.3' '26.3'
 '4.9' '24.6' '9.8' '4' '9' '1.1' '26.8' '17' '11.8' '13.7' '6.9' '9.9'
 '2.4' '7.9' '15.2' '13.9' '9.2' '3.9' '8.4' '12.1' '13.6' '11.4' '2.2'
 '22.8' '2.5' '14.5' '25.8' '15.4' '11.5' '6.6' '23.9' '12' '4.1' '14.6'
 '2.3' '22.3' '16.9' '18.7' '14.9' '28.7' '5.3' '17.7' '12.2' '22.2' '9.7'
 '23.5' '22.9' '3.4' '41.3' '6.1' '18.2' '27.9' '8.1' '21' '1

### Part $m$
- From the counts of total snowfall per month, Boston gets snow as early as October to as late as May
- It seems as though some Decembers didn't get any snow
    - We know the original $boston\_data$ had 95 years of monthly data, and from the analysis in Part b, only two December instances were removed
- From $clean\_boston\_data$, calculate and show the count of each month where the $tot\_snow$ is equal to zero.
- Identify and show the six years that the month of December not receive any snow?

In [132]:
# Count of months where tot_snow equals 0
zero_snow_counts = clean_boston_data[clean_boston_data['tot_snow']
                                     == 0]['month'].value_counts()

# Sort by month number
zero_snow_counts_sorted = zero_snow_counts.sort_index()

print("Number of instances with zero snow by month:")
for month, count in zero_snow_counts_sorted.items():
   print(f"Month {month}: {count}")

# Decembers with no snow
december_no_snow = clean_boston_data[
    (clean_boston_data['month'] == 12) &
    (clean_boston_data['tot_snow'] == 0)
][['year', 'month', 'tot_snow']]

print("\nDecembers with no snow:")
print(december_no_snow.sort_values('year'))

# Total December count
total_decembers = clean_boston_data[clean_boston_data['month'] == 12].shape[0]
print(f"\nTotal number of Decembers in dataset: {total_decembers}")
print(f"Number of Decembers with no snow: {december_no_snow.shape[0]}")

Number of instances with zero snow by month:
Month 2: 3
Month 3: 5
Month 4: 52
Month 5: 90
Month 6: 87
Month 7: 91
Month 8: 90
Month 9: 90
Month 10: 84
Month 11: 45
Month 12: 6

Decembers with no snow:
      year  month  tot_snow
95    1927     12       0.0
407   1953     12       0.0
455   1957     12       0.0
647   1973     12       0.0
959   1999     12       0.0
1103  2011     12       0.0

Total number of Decembers in dataset: 93
Number of Decembers with no snow: 6
