# Data Preprocessing

## 1. Introduction

Briefly describe the objectives of this notebook, which primarily focuses on preprocessing the data to prepare it for analysis.

## 2. Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import sys

# Add the path to the scripts folder and import the functions
sys.path.append("../scripts/")

# from merged_data import get_weather_migraine_dataframe
from raw_data import get_raw_dataframes

## 3. Load Data

In [None]:
# Load data
city_data, country_data, weather_data, migraine_data = get_raw_dataframes()

city_data.shape, country_data.shape, weather_data.shape, migraine_data.shape



## 4. Data Cleaning

### 4.1 Drop Unnecessary Columns

#### 4.1.1 DataFrame: `city_data`

Keeping all columns for now.

#### 4.1.2 DataFrame: `country_data`

*Keeping* the following columns:
- 'country'
- 'iso2'
- 'iso3'
- 'population'
- 'region'
- 'continent'

*Removing* the following columns:
- 'native_name'
- 'area'
- 'capital'
- 'capital_lat'
- 'capital_lon'

In [None]:
# Drop columns that are not needed for the analysis
country_data.drop(columns=['native_name', 'area', 'capital', 'capital_lat', 'capital_lng'], inplace=True)
country_data.shape

#### 4.1.3 DataFrame: `weather_data`

*Keeping* the following columns:
- 'station_id'
- 'city_name'
- 'date'
- 'season'
- '*_temp_c' (avg, min, max)
- 'precipitation_mm'
- 'avg_sea_level_pres_hpa'
- 'sunshine_total_min'

*Removing* the following columns:
- 'snow_depth_mm'
- 'avg_wind_dir_deg'
- 'avg_wind_speed_kmh'
- 'peak_wind_gust_kmh'


In [None]:
# Drop columns that are not needed for the analysis
weather_data.drop(columns=['snow_depth_mm', 'avg_wind_dir_deg', 'avg_wind_speed_kmh', 'peak_wind_gust_kmh'], inplace=True)
weather_data.shape

#### 4.1.4 DataFrame: `migraine_data`

*Keeping* only the following columns:
- 'measure_name'
- 'location_name'
- 'sex_name'
- 'age_name'
- 'cause_name'
- 'metric_name'
- 'year'
- 'val'
- 'upper'
- 'lower'

*Removing* the following columns:
- '*_id' (measure, location, sex, age, cause, metric)

In [None]:
# Drop columns that are not needed for the analysis
migraine_data.drop(columns=['measure_id', 'location_id', 'sex_id', 'age_id', 'cause_id', 'metric_id'], inplace=True)
migraine_data.shape

### 4.2 Handle Missing Values

#### 4.2.1 DataFrame: `city_data`

In [None]:
# Check for missing values
print("\nCity:\n")
print(city_data.isnull().sum())

#### 4.2.2 DataFrame: `country_data`

In [None]:
# Check for missing values
print("\nCountry:\n")
print(country_data.isnull().sum())

#### 4.2.3 DataFrame: `weather_data`

In [None]:
# Check for missing values
print("\nWeather:\n")
print(weather_data.isnull().sum())

# weather_data.drop_duplicates(inplace=True)
# weather_data.shape
# weather_data.isnull().sum()

#### 4.2.4 DataFrame: `migraine_data`

In [None]:
# Check for missing values
print("\nMigraine:\n")
print(migraine_data.isnull().sum())

### 4.3 Aggregate Weather Data to Annual Level

Given that the migraine data is annual, we need to aggregate the weather data to the annual level as well. We will do this by taking the mean of the weather data for each year.

In [None]:
# Code for aggregating weather data


## 5. Data Integration

### 5.1 Join Countries and Cities Tables

Join the countries and cities tables on the `country` column to give more context to the weather data.

In [None]:
# Code for joining countries and cities
city_country = city_data.merge(country_data, 
                               how='left', 
                               left_on=['country', 'iso2', 'iso3'], 
                               right_on=['country', 'iso2', 'iso3']
                               )

# Review the shape of the new dataframe
city_country.head()

### 5.2 Join Weather Data with Countries and Cities

Join the weather data with the combined countries and cities tables on the `station_id` column.

In [None]:
# Combine city/country with daily weather data
combined_weather = weather_data.merge(city_country, 
                                      how='left', 
                                      left_on=['station_id', 'city_name'], 
                                      right_on=['station_id', 'city_name']
)

# Review the shape of the new dataframe
combined_weather.shape

In [None]:
# Filter the combined weather data to only include the US
usa_weather = combined_weather[combined_weather['iso3'] == 'USA']

# Review the shape of the new dataframe
usa_weather.shape

# Check for missing values
usa_weather.isnull().sum()

# Backfill missing values
usa_weather_bfill = usa_weather.bfill()
usa_weather_bfill.info()

### 5.3 Join Migraine Data with Weather Data

Finally, join the migraine data with the combined weather data (countries, cities, daily weather) on the `city_name` column from weather data and on the `location_name` column from the migraine data.

In [None]:
# Combine USA combined weather with migraine data
weather_migraine = usa_weather.merge(migraine_data, 
                                      how='left', 
                                      left_on='city_name', 
                                      right_on='location_name'
                                      )

# Review the shape of the new dataframe
weather_migraine.shape

## 6. Feature Engineering

Discussing any new features that were created and why they were created. Also, discuss any features that were dropped and why they were dropped.

In [None]:
# Code for feature engineering


## 7. Summary

Summarize the data preprocessing steps that were taken in this notebook.

## 8. Next Steps

Discuss any next steps that should be taken in the data analysis process/modeling phases.