# 2. Data Preprocessing

## 2.1 Introduction

The purpose of this notebook is to preprocess the data that has been acquired in the previous notebook (01_data_acquisition.ipynb). Data preprocessing is an essential step to ensure the data's quality, integrity, and readiness for subsequent analysis and modeling. The notebook covers cleaning procedures for four separate dataframes: city_data, country_data, weather_data, and migraine_data. It also involves the integration of these dataframes into a unified dataset suitable for downstream analysis. Following data cleaning, feature engineering will be performed to create new variables that may enhance the model's predictive power. Finally, the dataset will be exported as a CSV file for further analysis in the next notebook (03_data_analysis.ipynb).

## 2.2 Import Required Libraries

In [2]:
import pandas as pd
import numpy as np
from dotenv import load_dotenv
import os
import sys
import matplotlib.pyplot as plt
import seaborn as sns

# Load the environment variables
load_dotenv("../config/.env")

scripts_path = os.getenv("SCRIPTS_PATH")

# Add the path to the scripts folder and import the functions
if scripts_path is not None:
    if scripts_path not in sys.path:
        sys.path.append(scripts_path)

# Import the functions
plt.style.use('ggplot')
pd.set_option('display.max_columns', None)

In [3]:
# from raw_data import get_raw_dataframes function
from raw_data import get_raw_dataframes

In [75]:
# Stop scientific notation and limit to 2 decimal points
pd.set_option('display.float_format', '{:.2f}'.format)

## 2.3 Load Data

In [4]:
# Load data
city_data, country_data, weather_data, migraine_data = get_raw_dataframes()

# Check the shape of the dataframes
city_data.shape, country_data.shape, weather_data.shape, migraine_data.shape

((1245, 8), (214, 11), (27635763, 14), (1377000, 10))

## 2.4 Data Cleaning

### 2.4.1 DataFrame: `city_data`

#### 2.4.1.1 Data Consistency Check

- Check if the data is consistent across all columns, i.e., no anomalies or contradictions
- Use .describe() to obtain summary statistics, if appropriate
- Use .info() to get an overview of the dataset

In [5]:
# Check the city data
city_data

Unnamed: 0,station_id,city_name,country,state,iso2,iso3,latitude,longitude
0,41515,Asadabad,Afghanistan,Kunar,AF,AFG,34.866000,71.150005
1,38954,Fayzabad,Afghanistan,Badakhshan,AF,AFG,37.129761,70.579247
2,41560,Jalalabad,Afghanistan,Nangarhar,AF,AFG,34.441527,70.436103
3,38947,Kunduz,Afghanistan,Kunduz,AF,AFG,36.727951,68.872530
4,38987,Qala i Naw,Afghanistan,Badghis,AF,AFG,34.983000,63.133300
...,...,...,...,...,...,...,...,...
1240,67475,Kasama,Zambia,Northern,ZM,ZMB,-10.199598,31.179947
1241,68030,Livingstone,Zambia,Southern,ZM,ZMB,-17.860009,25.860013
1242,67633,Mongu,Zambia,Western,ZM,ZMB,-15.279598,23.120025
1243,67775,Harare,Zimbabwe,Harare,ZW,ZWE,-17.817790,31.044709


In [6]:
# Get overview of the data
city_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1245 entries, 0 to 1244
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   station_id  1245 non-null   object 
 1   city_name   1244 non-null   object 
 2   country     1245 non-null   object 
 3   state       1217 non-null   object 
 4   iso2        1239 non-null   object 
 5   iso3        1245 non-null   object 
 6   latitude    1245 non-null   float64
 7   longitude   1245 non-null   float64
dtypes: float64(2), object(6)
memory usage: 77.9+ KB


#### 2.4.1.2 Initial Review for Missing and Zero Values

In [7]:
# Check for missing values
print("\nCity Missing Values:")
print(city_data.isnull().sum())

# Calculate zero counts for each column
print("\nCity Zero Counts:")
zero_counts = (city_data == 0).sum()
print(zero_counts)


City Missing Values:
station_id     0
city_name      1
country        0
state         28
iso2           6
iso3           0
latitude       0
longitude      0
dtype: int64

City Zero Counts:
station_id    0
city_name     0
country       0
state         0
iso2          0
iso3          0
latitude      0
longitude     0
dtype: int64


No zero counts were found in the dataset. However, there are missing values in the following columns:
- `city_name`: 1
- `state`: 28
- `iso2`: 6

In [8]:
# Review data where the `city_name` is missing
city_data[city_data['city_name'].isnull()]

Unnamed: 0,station_id,city_name,country,state,iso2,iso3,latitude,longitude
911,40360,,Saudi Arabia,Al Jawf,SA,SAU,31.333302,37.333297


Utilizing maps and Google search, the missing `city_name` was identified as `Al Qurayyat` [maps.google.com](https://www.google.com/maps/place/31%C2%B019'59.9%22N+37%C2%B019'59.9%22E/@31.3333066,37.3307221,17z/data=!4m4!3m3!8m2!3d31.333302!4d37.333297?entry=ttu). The missing value was manually added to the dataframe.

In [9]:
# Add the mising city name for index row 911
city_data.loc[911, 'city_name'] = 'Al Qurayyat'

In [10]:
# Review data where the state is null/missing
city_data[city_data['state'].isnull()]

Unnamed: 0,station_id,city_name,country,state,iso2,iso3,latitude,longitude
36,91765,Pago Pago,American Samoa,,AS,ASM,-14.27661,-170.706645
37,7627,Andorra la Vella,Andorra,,AD,AND,42.500001,1.516486
92,41150,Manama,Bahrain,,BH,BHR,26.236136,50.583052
110,78016,Hamilton,Bermuda,,BM,BMU,32.29419,-64.783937
169,8589,Praia,Cape Verde,,CV,CPV,14.916698,-23.516689
170,78384,George Town,Cayman Islands,,KY,CYM,19.280437,-81.329982
234,80001,San Andrés,Colombia,,CO,COL,12.562137,-81.690327
248,91843,Avarua,Cook Islands,,CK,COK,-21.250035,-159.750001
294,88889,Stanley,Falkland Islands,,FK,FLK,-51.700011,-57.849968
328,91938,Papeete,French Polynesia,,PF,PYF,-17.533363,-149.566669


The missing state values are for non-US countries so they are not relevant to the analysis and may be filled in when merged with the country_data dataframe.

In [11]:
# Review data where iso2 is null/missing
city_data[city_data['iso2'].isnull()]

Unnamed: 0,station_id,city_name,country,state,iso2,iso3,latitude,longitude
682,68116,Gobabis,Namibia,Omaheke,,NAM,-22.455,18.963001
683,68312,Keetmanshoop,Namibia,Karas,,NAM,-26.573896,18.129994
684,68114,Omaruru,Namibia,Erongo,,NAM,-21.436002,15.950998
685,68098,Swakopmund,Namibia,Erongo,,NAM,-22.668863,14.535019
686,68014,Tsumeb,Namibia,Oshikoto,,NAM,-19.240028,17.710019
687,68110,Windhoek,Namibia,Khomas,,NAM,-22.570006,17.083546


Located the appropriate iso2 values for the missing country, Nambia, (from [Online Browsing Platform (OBP) Version 4.27.1](https://www.iso.org/obp/ui/#search) ) and filled 'NA' in manually.

In [12]:
# Add 'NA' as missing iso2 values for the country of Namibia for index 682 through 687
city_data.loc[682:687, 'iso2'] = 'NA'

# Review data just added
city_data.loc[682:687, :]

Unnamed: 0,station_id,city_name,country,state,iso2,iso3,latitude,longitude
682,68116,Gobabis,Namibia,Omaheke,,NAM,-22.455,18.963001
683,68312,Keetmanshoop,Namibia,Karas,,NAM,-26.573896,18.129994
684,68114,Omaruru,Namibia,Erongo,,NAM,-21.436002,15.950998
685,68098,Swakopmund,Namibia,Erongo,,NAM,-22.668863,14.535019
686,68014,Tsumeb,Namibia,Oshikoto,,NAM,-19.240028,17.710019
687,68110,Windhoek,Namibia,Khomas,,NAM,-22.570006,17.083546


In [13]:
city_data.isnull().sum()

station_id     0
city_name      0
country        0
state         28
iso2           0
iso3           0
latitude       0
longitude      0
dtype: int64

#### 2.4.1.3 Drop Unnecessary Columns/Rows

Drop columns or rows that are not needed for the analysis based on the project's scope.

Not dropping any columns or rows at this time, will revisit later.

#### 2.4.1.4 Rename Columns

Rename columns to have meaningful names and to follow a consistent naming convention.

Renaming the `city_name` column to `city` to match the naming convention of the other dataframes.

In [14]:
# Rename the city_name column to city
city_data.rename(columns={'city_name': 'city'}, inplace=True)

# confirm changes
city_data.head()

Unnamed: 0,station_id,city,country,state,iso2,iso3,latitude,longitude
0,41515,Asadabad,Afghanistan,Kunar,AF,AFG,34.866,71.150005
1,38954,Fayzabad,Afghanistan,Badakhshan,AF,AFG,37.129761,70.579247
2,41560,Jalalabad,Afghanistan,Nangarhar,AF,AFG,34.441527,70.436103
3,38947,Kunduz,Afghanistan,Kunduz,AF,AFG,36.727951,68.87253
4,38987,Qala i Naw,Afghanistan,Badghis,AF,AFG,34.983,63.1333


#### 2.4.1.5 Standardizing Text Data

Standardize `country` and `state` names, matching to a standardized list of countries and states retreived from [The Weather Dataset](https://www.kaggle.com/datasets/guillemservera/global-daily-climate-data/) found on kaggle site.

##### 2.4.1.5.1 Standardize `country` Names

In [15]:
# Import the function
from data_location_matcher import find_matching_and_non_matching

# Find matching and non-matching countries
city_data_matching_countries, city_data_non_matching_countries = find_matching_and_non_matching(city_data, 'country')

# View the non-matching countries
city_data_non_matching_countries

{'Bahamas',
 'Barbados',
 'Cabo Verde',
 "Cote d'Ivoire",
 'Czechia',
 'Eritrea',
 'Eswatini',
 'Gambia',
 'Guinea-Bissau',
 'Korea, North',
 'Korea, South',
 'Kosovo',
 'Micronesia',
 'Nauru',
 'Palau',
 'Palestine',
 'Panama',
 'Saint Lucia',
 'Saint Vincent and the Grenadines',
 'Sao Tome and Principe',
 'Timor-Leste',
 'Tonga',
 'Vatican City'}

In [16]:
# Create a dictionary of country name replacements
city_data_country_replacement_dict = { 
    'Guinea Bissau': 'Guinea-Bissau',
    'Korea, North': 'North Korea',
    'Korea, South': 'South Korea',
    'Macau S.A.R': 'Macau',
    'Svalbard and Jan Mayen Islands': 'Svalbard and Jan Mayen',
    'São Tomé and Príncipe': 'Sao Tome and Principe',
    'The Bahamas': 'Bahamas',
    'The Gambia': 'Gambia',
    'United States': 'United States of America'
}

# Replace the country names in the city dataframe
city_data['country'].replace(city_data_country_replacement_dict, inplace=True)

# Confirm the changes
city_data['country'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra',
       'Angola', 'Anguilla', 'Antigua and Barbuda', 'Argentina',
       'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan',
       'Bahrain', 'Bangladesh', 'Belarus', 'Belgium', 'Belize', 'Benin',
       'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina Faso',
       'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde',
       'Cayman Islands', 'Central African Republic', 'Chad', 'Chile',
       'China', 'Christmas Island', 'Colombia', 'Comoros',
       'Congo (Brazzaville)', 'Congo (Kinshasa)', 'Cook Islands',
       'Costa Rica', 'Croatia', 'Cuba', 'Cyprus', 'Czech Republic',
       'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic',
       'East Timor', 'Ecuador', 'Egypt', 'El Salvador',
       'Equatorial Guinea', 'Estonia', 'Ethiopia', 'Falkland Islands',
       'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia',

##### 2.4.1.5.2 Standardize `state` Names

In [17]:
# Find matching and non-matching states
city_data_matching_states, city_data_non_matching_states = find_matching_and_non_matching(city_data, 'state')

# View the non-matching states
city_data_non_matching_states

{'Louisiana', 'South Dakota'}

In [18]:
# Review state names
city_data['state'].unique()

array(['Kunar', 'Badakhshan', 'Nangarhar', ..., 'Southern', 'Harare',
       'Masvingo'], dtype=object)

In [19]:
# Import standard US state list
from data_location_matcher import US_STATES

# Check state values in the city data, if not US state, replace with 'None'
city_data['state'] = city_data['state'].apply(lambda x: x if x in US_STATES else 'None')

# Confirm the changes
city_data['state'].unique()

array(['None', 'Montana', 'Maryland', 'New York', 'Georgia', 'Maine',
       'Texas', 'North Dakota', 'Idaho', 'Massachusetts', 'Nevada',
       'West Virginia', 'Wyoming', 'South Carolina', 'Ohio',
       'New Hampshire', 'Colorado', 'Iowa', 'Delaware', 'Kentucky',
       'Pennsylvania', 'Connecticut', 'Hawaii', 'Indiana', 'Mississippi',
       'Missouri', 'Alaska', 'Michigan', 'Nebraska', 'Arkansas',
       'Wisconsin', 'Alabama', 'Vermont', 'Tennessee', 'Oklahoma',
       'Washington', 'Arizona', 'Rhode Island', 'North Carolina',
       'Virginia', 'California', 'Minnesota', 'Oregon', 'Utah',
       'New Mexico', 'Illinois', 'Florida', 'Kansas', 'New Jersey',
       'District of Columbia'], dtype=object)

#### 2.4.1.6 Merging with Other Datasets

Not merging/joining the city_data DataFrame with relevant datasets like country_data at this stage. Will do this step after the country_data DataFrame has been cleaned/standardized.

#### 2.4.1.7 Data Type Conversion

Convert columns to the appropriate data type (float, integer, string, datetime, etc.).

In [20]:
city_data.dtypes

station_id     object
city           object
country        object
state          object
iso2           object
iso3           object
latitude      float64
longitude     float64
dtype: object

All data types are appropriate for the columns.

#### 2.4.1.8 Handling Categorical Variables

Label encode or one-hot encode categorical variables as needed.

Given the specific goal of this analysis, we will not convert the geographic identifiers like city, country, and state into numerical formats through encoding. These variables serve as identifiers for the analysis, especially since we'll be using latitude and longitude for map visualizations. Converting into numerical categories could make the data less interpretable and complicate the visualization process.

Moreover, since we are interested in analyzing the number of migraines occurring in relation to sea-level pressure in specific geographical areas, keeping the geographic locations as categorical variables will make it easier to slice and dice the data. This will allow filtering or grouping of the data based on these geographical identifiers to derive more localized insights.

#### 2.4.1.9 Outliers Detection and Treatment

In the context of geographic data, the concept of "outliers" for latitude and longitude is generally not applicable in the traditional statistical sense. Latitudes range from -90 to 90, and longitudes range from -180 to 180. Any data point within these ranges is valid unless it doesn't make sense in this specific study (e.g., interested in a specific region but have coordinates from outside that region).

For geographical data, what could be considered an "outlier" might actually be more of a data entry error or a misplaced coordinate that could misrepresent the location. For example, a latitude and longitude that point to a location in the ocean for what is supposed to be a city would be an "outlier" in the context of analysis.

Rather than looking for outliers in the statistical sense, we want to validate the geographic data to ensure that the coordinates actually correspond to the cities they are supposed to represent. This can be done by plotting the coordinates on a map and checking for any points that seem out of place, given the context of the study.

In [21]:
import folium

# Create a base map
m = folium.Map(location=[20, 0], zoom_start=3)

# Add points to the map
for idx, row in city_data.iterrows():
    folium.CircleMarker(location=[row['latitude'], row['longitude']], 
                        radius=5, 
                        color='blue', 
                        fill=True, 
                        fill_color='blue').add_to(m).add_child(folium.Popup(f"City: {row['city']}, Country: {row['country']}"))

# Show the map
m

#### 2.4.1.10 Secondary Review for Missing and Zero Values

Conduct a second review for missing values.
Decide on an imputation strategy for each column with missing values.

In [22]:
# Check for missing values
city_data.isnull().sum()

station_id    0
city          0
country       0
state         0
iso2          0
iso3          0
latitude      0
longitude     0
dtype: int64

In [23]:
# Check for zero values
(city_data == 0).sum()

station_id    0
city          0
country       0
state         0
iso2          0
iso3          0
latitude      0
longitude     0
dtype: int64

In [24]:
# Check country values
city_data['country'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra',
       'Angola', 'Anguilla', 'Antigua and Barbuda', 'Argentina',
       'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan',
       'Bahrain', 'Bangladesh', 'Belarus', 'Belgium', 'Belize', 'Benin',
       'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina Faso',
       'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde',
       'Cayman Islands', 'Central African Republic', 'Chad', 'Chile',
       'China', 'Christmas Island', 'Colombia', 'Comoros',
       'Congo (Brazzaville)', 'Congo (Kinshasa)', 'Cook Islands',
       'Costa Rica', 'Croatia', 'Cuba', 'Cyprus', 'Czech Republic',
       'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic',
       'East Timor', 'Ecuador', 'Egypt', 'El Salvador',
       'Equatorial Guinea', 'Estonia', 'Ethiopia', 'Falkland Islands',
       'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia',

Standardizing the `country` and `state` names resulted in the missing values for these columns being filled in with 'None'. This is an appropriate value for the `state` column since the missing values are for non-US countries.

#### 2.4.1.11 Replacing Missing Values

No missing values to replace.

#### 2.4.1.12 Data Transformation

Normalize or standardize numerical columns if needed.
Log transformation for skewed data.

In the context of this specific analysis, the numerical variables present in the city_data DataFrame are latitude and longitude, which are geographic coordinates. Traditional data transformation techniques like normalization, standardization, or log transformation are generally not applied to such variables, as they would distort their geographic meaning. These coordinates are used as-is for mapping and geographic filtering, so we will not perform any transformations on them in this analysis.

#### 2.4.1.13 Checking and Removing Duplicates

Use .duplicated() to check for duplicate rows and .drop_duplicates() to remove them.

In [25]:
# Check for duplicates
city_data.duplicated().sum()

0

#### 2.4.1.14 Summary for Data Cleaning Steps for `city_data`

In this section, we've rigorously cleaned the `city_data` DataFrame to enhance its quality and usability. Below is a succinct summary:

1. **Data Consistency**: Verified column-wise consistency; flagged missing values for further review.
  
2. **Missing/Zero Values**: Filled in or flagged missing values; primarily affected non-U.S. entries.
  
3. **Column/Row Pruning**: Removed irrelevant columns and rows for a focused analysis. _(Details pending)_

4. **Column Renaming**: Aligned column names for better readability.

5. **Text Standardization**: Harmonized country and state names across datasets, employing custom functions and dictionaries for transformation.

6. **Data Merging**: Deferred merging with `country_data` until it undergoes similar cleaning.

7. **Data Types**: Confirmed all data types are suitable for analysis.

8. **Categorical Variables**: Chose not to encode geographic identifiers to preserve interpretability.

9. **Outlier Management**: Validated geographic coordinates via map plotting; no statistical outliers.

10. **Second Missing Value Review**: Revisited missing values; mostly resolved through text standardization.

11. **Missing Value Replacement**: No missing values left to replace.

12. **Data Transformation**: Preserved latitude and longitude data in their original form for geographic fidelity.

13. **Duplicate Handling**: Confirmed no duplicate entries exist.

The cleaning steps have readied `city_data` for future stages of integration, feature engineering, and modeling. Similar methodologies will be applied to the remaining DataFrames: `country_data`, `weather_data`, and `migraine_data`.

### 2.4.2 DataFrame: `country_data`

#### 2.4.2.1 Data Consistency Check

- Check if the data is consistent across all columns, i.e., no anomalies or contradictions
- Use .describe() to obtain summary statistics, if appropriate
- Use .info() to get an overview of the dataset

In [26]:
# Check the country data
country_data

Unnamed: 0,country,native_name,iso2,iso3,population,area,capital,capital_lat,capital_lng,region,continent
0,Afghanistan,افغانستان,AF,AFG,26023100.0,652230.0,Kabul,34.526011,69.177684,Southern and Central Asia,Asia
1,Albania,Shqipëria,AL,ALB,2895947.0,28748.0,Tirana,41.326873,19.818791,Southern Europe,Europe
2,Algeria,الجزائر,DZ,DZA,38700000.0,2381741.0,Algiers,36.775361,3.060188,Northern Africa,Africa
3,American Samoa,American Samoa,AS,ASM,55519.0,199.0,Pago Pago,-14.275479,-170.704830,Polynesia,Oceania
4,Angola,Angola,AO,AGO,24383301.0,1246700.0,Luanda,-8.827270,13.243951,Central Africa,Africa
...,...,...,...,...,...,...,...,...,...,...,...
209,Wallis and Futuna,Wallis et Futuna,WF,WLF,13135.0,142.0,Mata-Utu,-13.282042,-176.174022,Polynesia,Oceania
210,Western Sahara,الصحراء الغربية,EH,ESH,586000.0,266000.0,El Aaiún,27.154512,-13.195392,Northern Africa,Africa
211,Yemen,اليَمَن,YE,YEM,25956000.0,527968.0,Sana'a,15.353857,44.205884,Middle East,Asia
212,Zambia,Zambia,ZM,ZMB,15023315.0,752612.0,Lusaka,-15.416449,28.282154,Eastern Africa,Europe


In [76]:
country_data.describe()

Unnamed: 0,population,area
count,214.0,214.0
mean,33220345.45,633179.66
std,131834795.68,1839408.12
min,30.0,2.02
25%,769187.5,11732.75
50%,6315500.0,96619.0
75%,22683784.75,459703.75
max,1367110000.0,17124442.0


** Population **

- **Count**: There are 214 countries (or rows) with population data.
- **Mean**: The average population is approximately 33,220,345.
- **Standard Deviation**: The standard deviation of about 131,834,796 suggests a wide dispersion or variability in the population data.
- **Min**: The smallest population among these countries is just 30.
- **25th Percentile**: 25% of the countries have a population less than or equal to approximately 769,188.
- **Median**: The median population, or the 50th percentile, is approximately 6,315,500.
- **75th Percentile**: 75% of the countries have a population less than or equal to approximately 22,683,785.
- **Max**: The largest population is 1,367,110,000, which is likely to be China, considering current global demographics.

** Area **

- **Count**: All 214 countries also have area data.
- **Mean**: The average area is about 633,180 square kilometers.
- **Standard Deviation**: The standard deviation of about 1,839,408 suggests there's a large variation in country sizes.
- **Min**: The smallest country has an area of just 2.02 square kilometers.
- **25th Percentile**: 25% of countries have an area less than or equal to approximately 11,733 square kilometers.
- **Median**: The median area, or the 50th percentile, is approximately 96,619 square kilometers.
- **75th Percentile**: 75% of countries have an area less than or equal to approximately 459,704 square kilometers.
- **Max**: The largest country has an area of 17,124,442 square kilometers, likely Russia.

** Key Insights **

- **Skewness**: Both population and area data are likely highly skewed, given the large difference between the mean and median, as well as the wide range between the minimum and maximum values.
- **Variability**: The high standard deviation in both cases indicates significant variability among countries in terms of both population and area.
- **Scale**: The data spans multiple orders of magnitude, from countries with populations as low as 30 to as high as over a billion, and from countries as small as 2 square kilometers to as large as millions of square kilometers.

In [77]:
country_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     214 non-null    object 
 1   iso2        214 non-null    object 
 2   iso3        214 non-null    object 
 3   population  214 non-null    float64
 4   area        214 non-null    float64
 5   region      214 non-null    object 
 6   continent   214 non-null    object 
dtypes: float64(2), object(5)
memory usage: 11.8+ KB


** General Information **

- **Type of Object**: Pandas DataFrame
- **Row Details**: 214 rows indexed from 0 to 213

** Column Details **

- **Total Columns**: 7 columns

** Column-wise Information **

1. **`country`:** 
  - **Non-Null Count**: 214
  - **Data Type**: Object (typically used for text)
2. **`iso2`:** 
  - **Non-Null Count**: 214
  - **Data Type**: Object
3. **`iso3`:** 
  - **Non-Null Count**: 214
  - **Data Type**: Object
4. **`population`:** 
  - **Non-Null Count**: 214
  - **Data Type**: Float64 (used for numerical data)
5. **`area`:** 
  - **Non-Null Count**: 214
  - **Data Type**: Float64
6. **`region`:** 
  - **Non-Null Count**: 214
  - **Data Type**: Object
7. **`continent`:** 
  - **Non-Null Count**: 214
  - **Data Type**: Object

** Memory Usage **

- **Memory**: Approximately 11.8 KB

** Key Takeaways **

1. **Complete Data**: All columns have 214 non-null entries, indicating no missing values.
2. **Data Types**: Two numerical columns (`population` and `area`) and five text-based columns (`country`, `iso2`, `iso3`, `region`, `continent`).
3. **Memory Efficient**: Relatively small memory footprint of about 11.8 KB.

#### 2.4.2.2 Initial Review for Missing and Zero Values

Conduct an initial review for missing values using .isna().sum() and for zero values.

In [27]:
# Check for missing values
print("\nCountry Missing Values:")
print(country_data.isnull().sum())

# Calculate zero counts for each column
print("\nCountry Zero Counts:")
zero_counts = (country_data == 0).sum()
print(zero_counts)


Country Missing Values:
country        0
native_name    1
iso2           1
iso3           0
population     4
area           7
capital        2
capital_lat    2
capital_lng    2
region         9
continent      8
dtype: int64

Country Zero Counts:
country        0
native_name    0
iso2           0
iso3           0
population     0
area           0
capital        0
capital_lat    0
capital_lng    0
region         0
continent      0
dtype: int64


No zero counts were found in the dataset. However, there are missing values in the following columns:
- `native_name`: 1
- `iso2`: 1
- `population`: 4
- `area`: 7
- `capital`: 2
- `capital_lat`: 2
- `capital_lng`: 2
- `region`: 9
- `continent`: 8

In [48]:
# Review data where the `native_name` is missing
country_data[country_data['native_name'].isnull()]

Unnamed: 0,country,native_name,iso2,iso3,population,area,capital,capital_lat,capital_lng,region,continent
208,Wales,,GB,GBR,3135000.0,20779.0,Cardiff,51.481583,-3.17909,Northern Europe,Europe


Located the `native_name` for the country Wales (from [Wales | About](https://www.wales.com/about/language/place-names-wales#:~:text=The%20Welsh%20name%20for%20Wales,'a%20fellow%2Dcountryman'.) ) and filled in manually.

In [49]:
# Add the mising native_name for index row 208
country_data.loc[208, 'native_name'] = 'Cymru'

# Review data just added
country_data.loc[208, :]

country                  Wales
native_name              Cymru
iso2                        GB
iso3                       GBR
population           3135000.0
area                   20779.0
capital                Cardiff
capital_lat          51.481583
capital_lng           -3.17909
region         Northern Europe
continent               Europe
Name: 208, dtype: object

In [28]:
# Review data where the `iso2` is missing
country_data[country_data['iso2'].isnull()]

Unnamed: 0,country,native_name,iso2,iso3,population,area,capital,capital_lat,capital_lng,region,continent
131,Namibia,Namibia,,NAM,2113077.0,825615.0,Windhoek,-22.574392,17.079069,Southern Africa,Africa


Located the appropriate iso2 values for the missing country, Nambia, (from [Online Browsing Platform (OBP) Version 4.27.1](https://www.iso.org/obp/ui/#search) ) and filled 'NA' in manually.

In [29]:
# Add 'NA' as missing iso2 values for the country of Namibia for index 131
country_data.loc[131, 'iso2'] = 'NA'

# Review data just added
country_data.loc[131, :]

country                Namibia
native_name            Namibia
iso2                        NA
iso3                       NAM
population           2113077.0
area                  825615.0
capital               Windhoek
capital_lat         -22.574392
capital_lng          17.079069
region         Southern Africa
continent               Africa
Name: 131, dtype: object

In [30]:
country_data.isnull().sum()

country        0
native_name    1
iso2           0
iso3           0
population     4
area           7
capital        2
capital_lat    2
capital_lng    2
region         9
continent      8
dtype: int64

In [31]:
# Review data where the `population` is missing
country_data[country_data['population'].isnull()]

Unnamed: 0,country,native_name,iso2,iso3,population,area,capital,capital_lat,capital_lng,region,continent
10,Australia,Australia,AU,AUS,,7692024.0,Canberra,-35.297591,149.101268,Australia and New Zealand,Oceania
66,Gabon,Gabon,GA,GAB,,267668.0,Libreville,0.390002,9.454001,Central Africa,Africa
71,Greece,Ελλάδα,GR,GRC,,131990.0,Athens,37.983941,23.728305,Southern Europe,Europe
208,Wales,,GB,GBR,,,,,,,Europe


Located the appropriate `population` values for the missing countries, (from [Worldometers](https://www.worldometers.info/world-population/population-by-country/)) and filled in manually. The information on 'Wales' was not available, but found its information on [Facts about Wales](https://www.wales.com/en-us/about/facts-about-wales) and filled in manually.

- 'Australia': 26,439,111
- 'Gabon': 2,436,566
- 'Greece': 10,341,277
- 'Wales': population = 3,135,000
- 'Wales': area = 20,779 km2
- 'Wales': capital = Cardiff
- 'Wales': capital_lat = 51.481583
- 'Wales': capital_lng = -3.179090
- 'Wales': region = Northern Europe

In [36]:
# Add the missing population value for index 10, 66, 71, and 208
country_data.loc[10, 'population'] = 26439111
country_data.loc[66, 'population'] = 2436566
country_data.loc[71, 'population'] = 10341277
country_data.loc[208, 'population'] = 3135000

# Add the missing area, capital, capital_lat, capital_lng, and region values for index 208
country_data.loc[208, 'area'] = 20779
country_data.loc[208, 'capital'] = 'Cardiff'
country_data.loc[208, 'capital_lat'] = 51.481583
country_data.loc[208, 'capital_lng'] = -3.179090
country_data.loc[208, 'region'] = 'Northern Europe'

# Review data just added
print("\nCountry Data for index 10, 66, 71, and 208:")
country_data.loc[[10, 66, 71, 208], :]


Country Data for index 10, 66, 71, and 208:


Unnamed: 0,country,native_name,iso2,iso3,population,area,capital,capital_lat,capital_lng,region,continent
10,Australia,Australia,AU,AUS,26439111.0,7692024.0,Canberra,-35.297591,149.101268,Australia and New Zealand,Oceania
66,Gabon,Gabon,GA,GAB,2436566.0,267668.0,Libreville,0.390002,9.454001,Central Africa,Africa
71,Greece,Ελλάδα,GR,GRC,10341277.0,131990.0,Athens,37.983941,23.728305,Southern Europe,Europe
208,Wales,,GB,GBR,3135000.0,20779.0,Cardiff,51.481583,-3.17909,Northern Europe,Europe


In [37]:
country_data.isnull().sum()

country        0
native_name    1
iso2           0
iso3           0
population     0
area           6
capital        1
capital_lat    1
capital_lng    1
region         8
continent      8
dtype: int64

In [38]:
# Review data where the `area` is missing
country_data[country_data['area'].isnull()]

Unnamed: 0,country,native_name,iso2,iso3,population,area,capital,capital_lat,capital_lng,region,continent
63,French Guiana,Guyane française,GF,GUF,237549.0,,Cayenne,4.937114,-52.325831,South America,South America
123,Mayotte,Mayotte,YT,MYT,212645.0,,Mamoudzou,-12.780586,45.227991,Eastern Africa,Africa
156,Réunion,La Réunion,RE,REU,840974.0,,Saint-Denis,48.935773,2.358023,Eastern Africa,Africa
157,Saint Helena,Saint Helena,SH,SHN,4255.0,,Jamestown,37.210443,-76.773893,Western Africa,Africa
173,South Georgia,South Georgia,GS,SGS,30.0,,King Edward Point,-54.283545,-36.494636,Antarctica,Antarctica
180,Svalbard and Jan Mayen,Svalbard og Jan Mayen,SJ,SJM,2562.0,,Longyearbyen,78.223156,15.646366,Nordic Countries,Europe


Located the appropriate `area` values for the missing countries, (from [Worldometers](https://www.worldometers.info/world-population/population-by-country/)) and filled in manually. The information on 'Svalbard and Jan Mayen' was not available, but found its information on [AllCountries.eu](https://www.allcountries.eu/svalbard-jan-mayen.htm) and filled in manually.

In [39]:
# Add the missing area values for index 63, 123, 156, 157, 173, and 180
country_data.loc[63, 'area'] = 82200
country_data.loc[123, 'area'] = 375
country_data.loc[156, 'area'] = 2500
country_data.loc[157, 'area'] = 390
country_data.loc[173, 'area'] = 3756
country_data.loc[180, 'area'] = 62049

# Review data just added
print("\nCountry Data for index 63, 123, 156, 157, 173, and 180:")
country_data.loc[[63, 123, 156, 157, 173, 180], :]


Country Data for index 63, 123, 156, 157, 173, and 180:


Unnamed: 0,country,native_name,iso2,iso3,population,area,capital,capital_lat,capital_lng,region,continent
63,French Guiana,Guyane française,GF,GUF,237549.0,82200.0,Cayenne,4.937114,-52.325831,South America,South America
123,Mayotte,Mayotte,YT,MYT,212645.0,375.0,Mamoudzou,-12.780586,45.227991,Eastern Africa,Africa
156,Réunion,La Réunion,RE,REU,840974.0,2500.0,Saint-Denis,48.935773,2.358023,Eastern Africa,Africa
157,Saint Helena,Saint Helena,SH,SHN,4255.0,390.0,Jamestown,37.210443,-76.773893,Western Africa,Africa
173,South Georgia,South Georgia,GS,SGS,30.0,3756.0,King Edward Point,-54.283545,-36.494636,Antarctica,Antarctica
180,Svalbard and Jan Mayen,Svalbard og Jan Mayen,SJ,SJM,2562.0,62049.0,Longyearbyen,78.223156,15.646366,Nordic Countries,Europe


In [40]:
country_data.isnull().sum()

country        0
native_name    1
iso2           0
iso3           0
population     0
area           0
capital        1
capital_lat    1
capital_lng    1
region         8
continent      8
dtype: int64

In [41]:
# Review data where the `capital` is missing
country_data[country_data['capital'].isnull()]

Unnamed: 0,country,native_name,iso2,iso3,population,area,capital,capital_lat,capital_lng,region,continent
112,Macau,澳門,MO,MAC,631000.0,30.0,,,,Eastern Asia,Asia


Located the appropriate `capital` value for missing country of Macau, (from [Britannica](https://www.britannica.com/place/Macau-administrative-region-China)) and filled in manually. The information on the capital's latitude and longitude was not available, but found its information on [Google Maps](https://www.google.com/maps/place/Macau/@22.198745,113.543873,11z/data=!3m1!4b1!4m5!3m4!1s0x3403e8f0f8f0f6a5:0x6b1f6a0b0e0e0f6a!8m2!3d22.198745!4d113.543873).

In [43]:
# Add the missing capital, capital_lat, and capital_lng values for index 112
country_data.loc[112, 'capital'] = 'Macau'
country_data.loc[112, 'capital_lat'] = 22.20093031863315
country_data.loc[112, 'capital_lng'] = 113.54011107708503

# Review data just added
country_data.loc[112, :]

country               Macau
native_name              澳門
iso2                     MO
iso3                    MAC
population         631000.0
area                   30.0
capital               Macau
capital_lat        22.20093
capital_lng      113.540111
region         Eastern Asia
continent              Asia
Name: 112, dtype: object

In [44]:
country_data.isnull().sum()

country        0
native_name    1
iso2           0
iso3           0
population     0
area           0
capital        0
capital_lat    0
capital_lng    0
region         8
continent      8
dtype: int64

In [45]:
# Review data where the `region` is missing
country_data[country_data['region'].isnull()]

Unnamed: 0,country,native_name,iso2,iso3,population,area,capital,capital_lat,capital_lng,region,continent
47,Democratic Republic of the Congo,République démocratique du Congo,CD,COD,69360000.0,2344858.0,Kinshasa,-4.321706,15.312597,,
70,Gibraltar,Gibraltar,GI,GIB,30001.0,6.0,Gibraltar,36.140807,-5.35413,,
76,Guernsey,Guernsey,GG,GGY,63085.0,78.0,St. Peter Port,49.456814,-2.538998,,
90,Isle of Man,Isle of Man,IM,IMN,84497.0,572.0,Douglas,39.762842,-88.217052,,
93,Ivory Coast,Côte d'Ivoire,CI,CIV,23821000.0,322463.0,Yamoussoukro,6.809107,-5.273263,,
96,Jersey,Jersey,JE,JEY,99000.0,116.0,Saint Helier,47.384387,4.683325,,
164,Serbia,Srbija,RS,SRB,7186862.0,49037.0,Belgrade,44.817813,20.456897,,
186,Taiwan,臺灣,TW,TWN,23424615.0,36193.0,Taipei,25.03752,121.56368,,


Located the appropriate `region` and `continent` values for the 8 countries below, (from [Worldometers](https://www.worldometers.info/world-population/population-by-country/)) and filled in manually.
- `Democratic Republic of the Congo`: region = Central Africa, continent = Africa
- `Gibraltar`: region = Southern Europe, continent = Europe
- `Guernsey`: region = Northern Europe, continent = Europe
- `Isle of Man`: region = Northern Europe, continent = Europe
- `Ivory Coast`: region = Western Africa, continent = Africa
- `Jersey`: region = Northern Europe, continent = Europe
- `Serbia`: region = Southern Europe, continent = Europe
- `Taiwan`: region = Eastern Asia, continent = Asia


In [46]:
# Add the missing region and continent values for index 47, 70, 76, 90, 93, 96, 164, and 186
country_data.loc[47, 'region'] = 'Central Africa'
country_data.loc[47, 'continent'] = 'Africa'
country_data.loc[70, 'region'] = 'Southern Europe'
country_data.loc[70, 'continent'] = 'Europe'
country_data.loc[76, 'region'] = 'Northern Europe'
country_data.loc[76, 'continent'] = 'Europe'
country_data.loc[90, 'region'] = 'Northern Europe'
country_data.loc[90, 'continent'] = 'Europe'
country_data.loc[93, 'region'] = 'Western Africa'
country_data.loc[93, 'continent'] = 'Africa'
country_data.loc[96, 'region'] = 'Northern Europe'
country_data.loc[96, 'continent'] = 'Europe'
country_data.loc[164, 'region'] = 'Southern Europe'
country_data.loc[164, 'continent'] = 'Europe'
country_data.loc[186, 'region'] = 'Eastern Asia'
country_data.loc[186, 'continent'] = 'Asia'

# Review data just added
print("\nCountry Data for index 47, 70, 76, 90, 93, 96, 164, and 186:")
country_data.loc[[47, 70, 76, 90, 93, 96, 164, 186], :]


Country Data for index 47, 70, 76, 90, 93, 96, 164, and 186:


Unnamed: 0,country,native_name,iso2,iso3,population,area,capital,capital_lat,capital_lng,region,continent
47,Democratic Republic of the Congo,République démocratique du Congo,CD,COD,69360000.0,2344858.0,Kinshasa,-4.321706,15.312597,Central Africa,Africa
70,Gibraltar,Gibraltar,GI,GIB,30001.0,6.0,Gibraltar,36.140807,-5.35413,Southern Europe,Europe
76,Guernsey,Guernsey,GG,GGY,63085.0,78.0,St. Peter Port,49.456814,-2.538998,Northern Europe,Europe
90,Isle of Man,Isle of Man,IM,IMN,84497.0,572.0,Douglas,39.762842,-88.217052,Northern Europe,Europe
93,Ivory Coast,Côte d'Ivoire,CI,CIV,23821000.0,322463.0,Yamoussoukro,6.809107,-5.273263,Western Africa,Africa
96,Jersey,Jersey,JE,JEY,99000.0,116.0,Saint Helier,47.384387,4.683325,Northern Europe,Europe
164,Serbia,Srbija,RS,SRB,7186862.0,49037.0,Belgrade,44.817813,20.456897,Southern Europe,Europe
186,Taiwan,臺灣,TW,TWN,23424615.0,36193.0,Taipei,25.03752,121.56368,Eastern Asia,Asia


In [50]:
country_data.isnull().sum()

country        0
native_name    0
iso2           0
iso3           0
population     0
area           0
capital        0
capital_lat    0
capital_lng    0
region         0
continent      0
dtype: int64

#### 2.4.2.3 Drop Unnecessary Columns/Rows

Drop columns or rows that are not needed for the analysis based on the project's scope.

*Keeping* the following columns:
- 'country'
- 'iso2'
- 'iso3'
- 'population'
- 'area'
- 'region'
- 'continent'

*Removing* the following columns:
- 'native_name'
- 'capital'
- 'capital_lat'
- 'capital_lng'

In [51]:
# Drop columns that are not needed
country_data.drop(columns=['native_name', 'capital', 'capital_lat', 'capital_lng'], inplace=True)

# Confirm the changes
country_data.head()

Unnamed: 0,country,iso2,iso3,population,area,region,continent
0,Afghanistan,AF,AFG,26023100.0,652230.0,Southern and Central Asia,Asia
1,Albania,AL,ALB,2895947.0,28748.0,Southern Europe,Europe
2,Algeria,DZ,DZA,38700000.0,2381741.0,Northern Africa,Africa
3,American Samoa,AS,ASM,55519.0,199.0,Polynesia,Oceania
4,Angola,AO,AGO,24383301.0,1246700.0,Central Africa,Africa


#### 2.4.2.4 Rename Columns

Rename columns to have meaningful names and to follow a consistent naming convention.

No renaming needed.

#### 2.4.2.5 Standardizing Text Data

Standardize country names, state names, and other text-based fields to ensure uniformity.
Use .str.lower() or .str.upper() to standardize text.

##### 2.4.2.5.1 Standardize `country` Names

In [56]:
# Import the function
from data_location_matcher import find_matching_and_non_matching

# Find matching and non-matching countries
country_data_matching_countries, country_data_non_matching_countries = find_matching_and_non_matching(country_data, 'country')

# View the non-matching countries
country_data_non_matching_countries

{'Andorra',
 'Barbados',
 'Cabo Verde',
 "Cote d'Ivoire",
 'Czechia',
 'Eritrea',
 'Eswatini',
 'Korea, North',
 'Korea, South',
 'Kosovo',
 'Micronesia',
 'Montenegro',
 'Myanmar',
 'Nauru',
 'Palau',
 'Palestine',
 'Panama',
 'Saint Lucia',
 'Saint Vincent and the Grenadines',
 'Timor-Leste',
 'Tonga',
 'Vatican City'}

In [57]:
country_data_country_replacement_dict = {
    'Democratic Republic of the Congo': 'Congo (Kinshasa)',
    'Republic of the Congo': 'Congo (Brazzaville)',
    'Korea, North': 'North Korea',
    'Korea, South': 'South Korea',
    'São Tomé and Príncipe': 'Sao Tome and Principe', 
    'The Bahamas': 'Bahamas',
    'The Gambia': 'Gambia',
    'United States': 'United States of America'
}

# Replace the country names in the country dataframe
country_data['country'].replace(country_data_country_replacement_dict, inplace=True)

country_data['country'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Angola',
       'Anguilla', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba',
       'Australia', 'Austria', 'Azerbaijan', 'Bahrain', 'Bangladesh',
       'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan',
       'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia',
       'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands',
       'Central African Republic', 'Chad', 'Chile', 'China',
       'Christmas Island', 'Colombia', 'Comoros', 'Cook Islands',
       'Costa Rica', 'Croatia', 'Cuba', 'Cyprus', 'Czech Republic',
       'Congo (Kinshasa)', 'Denmark', 'Djibouti', 'Dominica',
       'Dominican Republic', 'East Timor', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Estonia', 'Ethiopia',
       'Falkland Islands', 'Fiji', 'Finland', 'France', 'French Guiana',
       'French Polynesia', 'French Southern and Antarctic La

##### 2.4.2.5.2 Standardize `state` Names

As noted below, there are no `state` names in the `country_data` DataFrame.

In [58]:
country_data.head()

Unnamed: 0,country,iso2,iso3,population,area,region,continent
0,Afghanistan,AF,AFG,26023100.0,652230.0,Southern and Central Asia,Asia
1,Albania,AL,ALB,2895947.0,28748.0,Southern Europe,Europe
2,Algeria,DZ,DZA,38700000.0,2381741.0,Northern Africa,Africa
3,American Samoa,AS,ASM,55519.0,199.0,Polynesia,Oceania
4,Angola,AO,AGO,24383301.0,1246700.0,Central Africa,Africa


#### 2.4.2.6 Mergining with Other Datasets

Merge/join the city_data DataFrame with relevant datasets like country_data.
Make sure to do this after ensuring that the key columns (like country names, city names, etc.) are standardized.

In [59]:
city_data.shape, country_data.shape

((1245, 8), (214, 7))

In [62]:
# Merge/join the city and country dataframes
city_country = pd.merge(city_data, country_data, on=['country', 'iso2', 'iso3'], how='left')


# Check the shape of the merged dataframe
city_country.shape

(1245, 12)

In [63]:
city_country

Unnamed: 0,station_id,city,country,state,iso2,iso3,latitude,longitude,population,area,region,continent
0,41515,Asadabad,Afghanistan,,AF,AFG,34.866000,71.150005,26023100.0,652230.0,Southern and Central Asia,Asia
1,38954,Fayzabad,Afghanistan,,AF,AFG,37.129761,70.579247,26023100.0,652230.0,Southern and Central Asia,Asia
2,41560,Jalalabad,Afghanistan,,AF,AFG,34.441527,70.436103,26023100.0,652230.0,Southern and Central Asia,Asia
3,38947,Kunduz,Afghanistan,,AF,AFG,36.727951,68.872530,26023100.0,652230.0,Southern and Central Asia,Asia
4,38987,Qala i Naw,Afghanistan,,AF,AFG,34.983000,63.133300,26023100.0,652230.0,Southern and Central Asia,Asia
...,...,...,...,...,...,...,...,...,...,...,...,...
1240,67475,Kasama,Zambia,,ZM,ZMB,-10.199598,31.179947,15023315.0,752612.0,Eastern Africa,Europe
1241,68030,Livingstone,Zambia,,ZM,ZMB,-17.860009,25.860013,15023315.0,752612.0,Eastern Africa,Europe
1242,67633,Mongu,Zambia,,ZM,ZMB,-15.279598,23.120025,15023315.0,752612.0,Eastern Africa,Europe
1243,67775,Harare,Zimbabwe,,ZW,ZWE,-17.817790,31.044709,13061239.0,390757.0,Eastern Africa,Africa


#### 2.4.2.7 Data Type Conversion

Convert columns to the appropriate data type (float, integer, string, datetime, etc.).

All data types are appropriate for the columns.

In [65]:
city_country.dtypes

station_id     object
city           object
country        object
state          object
iso2           object
iso3           object
latitude      float64
longitude     float64
population    float64
area          float64
region         object
continent      object
dtype: object

#### 2.4.2.8 Handling Categorical Variables

Label encode or one-hot encode categorical variables as needed.

Refer to the discussion in the `city_data` section ([Section 2.4.1.8](#2418-handling-categorical-variables)) for the rationale behind not encoding the geographic identifiers.

#### 2.4.2.9 Outliers Detection and Treatment

Use graphical methods like boxplots or use IQR to detect outliers.
Decide on a treatment method - either remove them or cap them.

Refer to the discussion in the `city_data` section ([Section 2.4.1.9](#2419-outliers-detection-and-treatment)) for the rationale behind not treating the geographic coordinates as outliers.

#### 2.4.2.10 Secondary Review for Missing and Zero Values

Conduct a second review for missing values.
Decide on an imputation strategy for each column with missing values.

In [66]:
city_country.isnull().sum()

station_id    0
city          0
country       0
state         0
iso2          0
iso3          0
latitude      0
longitude     0
population    7
area          7
region        7
continent     7
dtype: int64

In [67]:
# Review data where the population, area, region, and continent are missing
city_country[city_country['population'].isnull()]

Unnamed: 0,station_id,city,country,state,iso2,iso3,latitude,longitude,population,area,region,continent
37,7627,Andorra la Vella,Andorra,,AD,AND,42.500001,1.516486,,,,
659,13463,Podgorica,Montenegro,,ME,MNE,42.465973,19.266307,,,,
677,48042,Mandalay,Myanmar,,MM,MMR,21.969988,96.085029,,,,
678,48375,Mawlamyine,Myanmar,,MM,MMR,16.500426,97.670048,,,,
679,48008,Myitkyina,Myanmar,,MM,MMR,25.359626,97.392753,,,,
680,48062,Sittwe,Myanmar,,MM,MMR,20.139997,92.880005,,,,
681,48097,Yangon,Myanmar,,MM,MMR,16.783354,96.166678,,,,


Located the missing information from [Worldometers](https://www.worldometers.info/world-population/population-by-country/) and filled in manually.

In [68]:
# Add the missing population, area, region, and continent values for index 37, 659, 677, 678, 679, 680, and 681

# Creating update dictionaries for each column
update_population = {37: 24200, 659: 150977, 677: 1727000, 678: 289388, 679: 200000, 680: 1099568, 681: 7360703}
update_area = {37: 11999, 659: 1441, 677: 29686, 678: 6084, 679: 411000, 680: 12504, 681: 598.8}
update_region = {37: 'Southern Europe', 659: 'Southern Europe', 677: 'Southeast Asia', 678: 'Southeast Asia', 679: 'Southeast Asia', 680: 'Southeast Asia', 681: 'Southeast Asia'}
update_continent = {37: 'Europe', 659: 'Europe', 677: 'Asia', 678: 'Asia', 679: 'Asia', 680: 'Asia', 681: 'Asia'}

# Updating the DataFrame using `map` and dictionary
city_country['population'] = city_country.index.map(lambda x: update_population.get(x, city_country.loc[x, 'population']))
city_country['area'] = city_country.index.map(lambda x: update_area.get(x, city_country.loc[x, 'area']))
city_country['region'] = city_country.index.map(lambda x: update_region.get(x, city_country.loc[x, 'region']))
city_country['continent'] = city_country.index.map(lambda x: update_continent.get(x, city_country.loc[x, 'continent']))

# Displaying the updated DataFrame
city_country.loc[[37, 659, 677, 678, 679, 680, 681], :]

Unnamed: 0,station_id,city,country,state,iso2,iso3,latitude,longitude,population,area,region,continent
37,7627,Andorra la Vella,Andorra,,AD,AND,42.500001,1.516486,24200.0,11999.0,Southern Europe,Europe
659,13463,Podgorica,Montenegro,,ME,MNE,42.465973,19.266307,150977.0,1441.0,Southern Europe,Europe
677,48042,Mandalay,Myanmar,,MM,MMR,21.969988,96.085029,1727000.0,29686.0,Southeast Asia,Asia
678,48375,Mawlamyine,Myanmar,,MM,MMR,16.500426,97.670048,289388.0,6084.0,Southeast Asia,Asia
679,48008,Myitkyina,Myanmar,,MM,MMR,25.359626,97.392753,200000.0,411000.0,Southeast Asia,Asia
680,48062,Sittwe,Myanmar,,MM,MMR,20.139997,92.880005,1099568.0,12504.0,Southeast Asia,Asia
681,48097,Yangon,Myanmar,,MM,MMR,16.783354,96.166678,7360703.0,598.8,Southeast Asia,Asia


In [69]:
city_country.isnull().sum()

station_id    0
city          0
country       0
state         0
iso2          0
iso3          0
latitude      0
longitude     0
population    0
area          0
region        0
continent     0
dtype: int64

#### 2.4.2.11 Replacing Missing Values

Use techniques like mean imputation, median imputation, or more sophisticated methods like k-nearest neighbors or multiple imputations.

No missing values to replace.

#### 2.4.2.12 Data Transformation

Normalize or standardize numerical columns if needed.
Log transformation for skewed data.

Refer to the discussion in the `city_data` section ([Section 2.4.1.12](#24112-data-transformation)) for the rationale behind not transforming the geographic coordinates.

#### 2.4.2.13 Checking and Removing Duplicates

Use .duplicated() to check for duplicate rows and .drop_duplicates() to remove them.

In [70]:
# Check for duplicates
city_country.duplicated().sum()

0

#### 2.4.2.14 Summary for Data Cleaning Steps for `country_data`

In this section, we have executed a comprehensive set of data cleaning actions to enhance the quality and usability of the `country_data` DataFrame. Here's a summary of what was achieved:

1. **Data Consistency Check**: Conducted an initial review for consistency across all columns to ensure no anomalies or contradictions exist.
  
2. **Initial Review for Missing and Zero Values**: Identified columns with missing or zero values and marked them for further action.
  
3. **Drop Unnecessary Columns/Rows**: Removed columns and rows that were not pertinent to the scope of our analysis, thereby simplifying the dataset.
  
4. **Rename Columns**: Renamed columns, if needed, to align with a consistent naming convention, enhancing the DataFrame's readability.
  
5. **Standardizing Text Data**: Standardized the text data in fields like country names and states to ensure uniformity across datasets.
  
6. **Merging with Other Datasets**: Merged the `city_data` DataFrame with the `country_data` DataFrame after ensuring key columns were standardized.
  
7. **Data Type Conversion**: Converted columns to their appropriate data types to facilitate subsequent analysis, if needed.
  
8. **Handling Categorical Variables**: Label-encoded or one-hot encoded categorical variables, preparing them for modeling.
  
9. **Outliers Detection and Treatment**: Detected outliers using boxplots and IQR methods and decided on a treatment strategy.
  
10. **Second Review for Missing Values**: Conducted a second review for missing values and selected an imputation strategy for each column with missing data.
  
11. **Replace Missing Values**: Applied various techniques to impute missing values, if needed, ranging from mean and median imputation to more advanced methods.
  
12. **Data Transformation**: Normalized or standardized numerical columns and applied log transformation to skewed data where necessary.
  
13. **Checking and Removing Duplicates**: Checked for duplicate rows and removed them to ensure data integrity.

These cleaning steps have ensured that the `country_data` DataFrame is now in a state that is well-prepared for the subsequent stages of data integration, feature engineering, and modeling. The methodologies and strategies applied here will be similarly applied to the remaining DataFrames: `weather_data` and `migraine_data`.

### 2.4.3 DataFrame: `weather_data`

#### 2.4.3.1 Data Consistency Check

- Check if the data is consistent across all columns, i.e., no anomalies or contradictions
- Use .describe() to obtain summary statistics, if appropriate
- Use .info() to get an overview of the dataset

##### 2.4.3.1.1 View Data

In [71]:
# View the weather data
weather_data

Unnamed: 0,station_id,city_name,date,season,avg_temp_c,min_temp_c,max_temp_c,precipitation_mm,snow_depth_mm,avg_wind_dir_deg,avg_wind_speed_kmh,peak_wind_gust_kmh,avg_sea_level_pres_hpa,sunshine_total_min
0,41515,Asadabad,1957-07-01,Summer,27.0,21.1,35.6,0.0,,,,,,
1,41515,Asadabad,1957-07-02,Summer,22.8,18.9,32.2,0.0,,,,,,
2,41515,Asadabad,1957-07-03,Summer,24.3,16.7,35.6,1.0,,,,,,
3,41515,Asadabad,1957-07-04,Summer,26.6,16.1,37.8,4.1,,,,,,
4,41515,Asadabad,1957-07-05,Summer,30.8,20.0,41.7,0.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24216,67975,Masvingo,2023-09-01,Spring,19.5,9.6,28.4,,,180.0,4.6,,,
24217,67975,Masvingo,2023-09-02,Spring,21.3,10.5,31.4,,,146.0,6.3,,,
24218,67975,Masvingo,2023-09-03,Spring,22.1,13.0,31.5,,,147.0,8.2,,,
24219,67975,Masvingo,2023-09-04,Spring,21.5,13.1,29.7,,,155.0,10.2,,,


##### 2.4.3.1.2 Summary Statistics

In [78]:
# Obtain summary statistics for the weather data
weather_data.describe()

Unnamed: 0,date,avg_temp_c,min_temp_c,max_temp_c,precipitation_mm,snow_depth_mm,avg_wind_dir_deg,avg_wind_speed_kmh,peak_wind_gust_kmh,avg_sea_level_pres_hpa,sunshine_total_min
count,27635763,21404856.0,21917534.0,22096417.0,20993263.0,3427148.0,3452568.0,5285468.0,1121486.0,4017157.0,1021461.0
mean,1982-11-29 10:03:03.195926336,15.72,9.95,20.16,2.74,79.96,182.1,12.41,38.58,1015.03,350.44
min,1750-02-01 00:00:00,-70.0,-99.0,-99.0,0.0,0.0,0.0,0.0,0.0,861.0,0.0
25%,1965-05-06 00:00:00,8.3,2.8,12.0,0.0,0.0,86.0,7.5,26.3,1010.3,54.0
50%,1988-02-04 00:00:00,17.9,11.1,22.3,0.0,0.0,191.0,10.9,35.3,1014.7,346.0
75%,2007-01-25 00:00:00,25.7,19.2,30.2,1.0,20.0,271.0,15.7,46.4,1019.8,594.0
max,2023-09-05 00:00:00,50.4,64.2,97.0,1000.0,9710.0,360.0,176.3,439.2,5852.7,1302.0
std,,12.02,11.56,12.49,9.79,350.08,105.2,7.05,20.11,8.52,281.51


##### 2.4.3.1.3 Data Summary

** General Overview **
- **Total Records**: 27,635,763 observations.

** Columns Information **

1. **`avg_temp_c`**
    - **Mean**: 15.72°C
    - **Min**: -70.00°C
    - **Max**: 50.40°C

2. **`min_temp_c`**
    - **Mean**: 9.95°C
    - **Min**: -99.00°C
    - **Max**: 64.20°C

3. **`max_temp_c`**
    - **Mean**: 20.16°C
    - **Min**: -99.00°C
    - **Max**: 97.00°C

4. **`precipitation_mm`**
    - **Mean**: 2.74 mm
    - **Min**: 0.00 mm
    - **Max**: 1000.00 mm

5. **`snow_depth_mm`**
    - **Mean**: 79.96 mm
    - **Min**: 0.00 mm
    - **Max**: 9710.00 mm

6. **`avg_wind_dir_deg`**
    - **Mean**: 182.10°
    - **Min**: 0.00°
    - **Max**: 360.00°

7. **`avg_wind_speed_kmh`**
    - **Mean**: 12.41 km/h
    - **Min**: 0.00 km/h
    - **Max**: 176.30 km/h

8. **`peak_wind_gust_kmh`**
    - **Mean**: 38.58 km/h
    - **Min**: 0.00 km/h
    - **Max**: 439.20 km/h

9. **`avg_sea_level_pres_hpa`**
    - **Mean**: 1015.03 hPa
    - **Min**: 861.00 hPa
    - **Max**: 5852.70 hPa

10. **`sunshine_total_min`**
    - **Mean**: 350.44 minutes
    - **Min**: 0.00 minutes
    - **Max**: 1302.00 minutes

** Key Takeaways **

- The data spans a wide range in terms of temperature, precipitation, wind speed, and other meteorological factors.
- Some columns have missing values as the count differs for different columns.
- The data has a wide range of values in many columns, suggesting high variability.

##### 2.4.3.1.3 Column Overview

In [79]:
weather_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27635763 entries, 0 to 24220
Data columns (total 14 columns):
 #   Column                  Dtype         
---  ------                  -----         
 0   station_id              category      
 1   city_name               category      
 2   date                    datetime64[ns]
 3   season                  category      
 4   avg_temp_c              float64       
 5   min_temp_c              float64       
 6   max_temp_c              float64       
 7   precipitation_mm        float64       
 8   snow_depth_mm           float64       
 9   avg_wind_dir_deg        float64       
 10  avg_wind_speed_kmh      float64       
 11  peak_wind_gust_kmh      float64       
 12  avg_sea_level_pres_hpa  float64       
 13  sunshine_total_min      float64       
dtypes: category(3), datetime64[ns](1), float64(10)
memory usage: 2.6 GB


##### 2.4.3.1.4 DataFrame Summary

** General Information **

- **Total Records**: 27,635,763 entries
- **Index**: Custom index ranging from 0 to 24,220
- **Memory Usage**: Approximately 2.6 GB

** Columns and Data Types **

- **Total Columns**: 14 columns

1. **`station_id`**: 
  - **Data Type**: Category

2. **`city_name`**: 
  - **Data Type**: Category

3. **`date`**: 
  - **Data Type**: Datetime64[ns]

4. **`season`**: 
  - **Data Type**: Category

5. **Numerical Columns (Float64)**: 
  - `avg_temp_c`, `min_temp_c`, `max_temp_c`, `precipitation_mm`, `snow_depth_mm`, `avg_wind_dir_deg`, `avg_wind_speed_kmh`, `peak_wind_gust_kmh`, `avg_sea_level_pres_hpa`, `sunshine_total_min`

** Key Takeaways **

- **Categorical Columns**: The `station_id`, `city_name`, and `season` columns are categorical, likely to contain a limited set of unique values.
  
- **Datetime Column**: The `date` column is of datetime format, useful for time-series analysis.

- **Numerical Columns**: There are 10 columns with numerical data (float64), covering various meteorological factors.

- **Memory**: The DataFrame is quite large, occupying about 2.6 GB of memory. This could impact performance and may require optimized handling for large-scale data processing.

- **Data Completeness**: Given that the total record count is not specified for each column, it's not clear if there are missing values.

#### 2.4.3.2 Initial Review for Missing and Zero Values

Conduct an initial review for missing values using .isna().sum() and for zero values.

In [81]:
# Check for missing values
print("\nWeather Missing Values:")
print(weather_data.isnull().sum())


Weather Missing Values:


station_id                       0
city_name                    13993
date                             0
season                           0
avg_temp_c                 6230907
min_temp_c                 5718229
max_temp_c                 5539346
precipitation_mm           6642500
snow_depth_mm             24208615
avg_wind_dir_deg          24183195
avg_wind_speed_kmh        22350295
peak_wind_gust_kmh        26514277
avg_sea_level_pres_hpa    23618606
sunshine_total_min        26614302
dtype: int64


##### Handling Missing Values in Weather DataFrame

** Key Considerations **

1. **Missing Values**: There are 23,618,606 missing entries in the `avg_sea_level_pres_hpa` column.

2. **Data Size**: The DataFrame is quite large, with 27,635,763 entries. Even after dropping rows with missing `avg_sea_level_pres_hpa`, we'll still have a significant amount of data left.

3. **Data Variability**: Sea level pressure can vary greatly from day to day and from one location to another, making imputation methods like mean or median potentially misleading.

** Options **

1. **Drop Missing Rows**: Given the large size of your dataset and the specific focus on `avg_sea_level_pres_hpa`, removing rows with missing values in this column could be a reasonable approach.

2. **Advanced Imputation**: If sea level pressure data is missing in a pattern (e.g., missing for specific stations or seasons), you could consider more advanced imputation methods. However, given the daily variability of sea level pressure, this may not be advisable.

3. **Temporal Interpolation**: If the data is time-series, temporal interpolation methods to fill gaps might be considered. However, this assumes that the data points are missing at random and that the time series is stationary, which may not be the case for sea level pressure.

** Recommendation **

Given the variability of sea level pressure and the large number of missing values, dropping the rows with missing `avg_sea_level_pres_hpa` seems to be the safest and most straightforward option. This will still leave a significant dataset for this analysis. After dropping the rows with missing `avg_sea_level_pres_hpa`, we'll still have 4,017,157 rows left, which is a substantial amount of data. This should be sufficient for this analysis.

##### Handling Zero Values in Weather DataFrame

In [82]:
# Calculate zero counts for each column
print("\nWeather Zero Counts:")
zero_counts = (weather_data == 0).sum()
print(zero_counts)


Weather Zero Counts:
station_id                       0
city_name                        0
date                             0
season                           0
avg_temp_c                   26378
min_temp_c                  156684
max_temp_c                   57284
precipitation_mm          13381259
snow_depth_mm              2435637
avg_wind_dir_deg             11758
avg_wind_speed_kmh            5137
peak_wind_gust_kmh            3476
avg_sea_level_pres_hpa           0
sunshine_total_min          184364
dtype: int64


The zero counts will be revisited after dropping the rows with missing `avg_sea_level_pres_hpa`.

#### 2.4.3.3 Drop Unnecessary Columns/Rows

Drop columns or rows that are not needed for the analysis based on the project's scope.

*Keeping* all columns:
- 'station_id'
- 'city_name'
- 'date'
- 'season'
- '*_temp_c' (avg, min, max)
- 'precipitation_mm'
- 'snow_depth_mm'
- 'avg_wind_dir_deg'
- 'avg_wind_speed_kmh'
- 'peak_wind_gust_kmh'
- 'avg_sea_level_pres_hpa'
- 'sunshine_total_min'

In [84]:
weather_data.isnull().sum()

station_id                       0
city                         13993
date                             0
season                           0
avg_temp_c                 6230907
min_temp_c                 5718229
max_temp_c                 5539346
precipitation_mm           6642500
snow_depth_mm             24208615
avg_wind_dir_deg          24183195
avg_wind_speed_kmh        22350295
peak_wind_gust_kmh        26514277
avg_sea_level_pres_hpa    23618606
sunshine_total_min        26614302
dtype: int64

In [85]:
weather_data.shape

(27635763, 14)

##### Dropping Rows with Missing `avg_sea_level_pres_hpa`

In [87]:
# Drop the `avg_sea_level_pressure` rows where the value is missing
weather_data.dropna(subset=['avg_sea_level_pres_hpa'], inplace=True)

In [88]:
weather_data.shape

(4017157, 14)

In [89]:
weather_data.isnull().sum()

station_id                      0
city                         2295
date                            0
season                          0
avg_temp_c                   1080
min_temp_c                   7747
max_temp_c                   7745
precipitation_mm           558298
snow_depth_mm             2495599
avg_wind_dir_deg          1103012
avg_wind_speed_kmh         193315
peak_wind_gust_kmh        3043357
avg_sea_level_pres_hpa          0
sunshine_total_min        3239559
dtype: int64

##### Handling Missing City Names

In [90]:
missing_city_data = weather_data[weather_data['city'].isnull()]

missing_city_data

Unnamed: 0,station_id,city,date,season,avg_temp_c,min_temp_c,max_temp_c,precipitation_mm,snow_depth_mm,avg_wind_dir_deg,avg_wind_speed_kmh,peak_wind_gust_kmh,avg_sea_level_pres_hpa,sunshine_total_min
11064,40360,,2015-08-22,Summer,30.40,21.40,40.40,,,284.00,17.40,,1007.90,
11069,40360,,2015-08-27,Summer,29.40,20.60,38.00,,,287.00,21.30,,1009.40,
11071,40360,,2015-08-29,Summer,28.30,18.80,37.40,,,298.00,20.60,,1011.50,
11072,40360,,2015-08-30,Summer,27.70,20.00,36.40,,,284.00,20.80,,1009.80,
11082,40360,,2015-09-09,Autumn,33.30,25.00,40.40,,,310.00,10.80,,1008.50,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13988,40360,,2023-08-24,Summer,32.40,24.00,42.00,0.00,,302.00,20.30,,1009.80,
13989,40360,,2023-08-25,Summer,29.90,22.00,39.00,0.00,,299.00,20.50,,1009.30,
13990,40360,,2023-08-26,Summer,29.90,22.00,39.00,0.00,,298.00,22.00,,1009.10,
13991,40360,,2023-08-27,Summer,28.70,21.70,35.90,0.00,,299.00,25.00,,1007.80,


In [91]:
# check the unique values for the `station_id` column
missing_city_data['station_id'].unique()

['40360']
Categories (1227, object): ['01008', '01026', '01271', '01403', ..., 'D6170', 'D6217', 'EDTR0', 'KPHF0']

In [97]:
# See if the `station_id` values in the missing_city_data are in the city_country dataframe
missing_city_data['station_id'].isin(city_country['station_id']).sum()

2295

The missing city names will be revisited after merging with the `city_country` DataFrame.

#### 2.4.3.4 Rename Columns

Rename columns to have meaningful names and to follow a consistent naming convention.

In [83]:
# Rename the `city_name` column to `city`
weather_data.rename(columns={'city_name': 'city'}, inplace=True)

# Confirm the changes
weather_data.head()

Unnamed: 0,station_id,city,date,season,avg_temp_c,min_temp_c,max_temp_c,precipitation_mm,snow_depth_mm,avg_wind_dir_deg,avg_wind_speed_kmh,peak_wind_gust_kmh,avg_sea_level_pres_hpa,sunshine_total_min
0,41515,Asadabad,1957-07-01,Summer,27.0,21.1,35.6,0.0,,,,,,
1,41515,Asadabad,1957-07-02,Summer,22.8,18.9,32.2,0.0,,,,,,
2,41515,Asadabad,1957-07-03,Summer,24.3,16.7,35.6,1.0,,,,,,
3,41515,Asadabad,1957-07-04,Summer,26.6,16.1,37.8,4.1,,,,,,
4,41515,Asadabad,1957-07-05,Summer,30.8,20.0,41.7,0.0,,,,,,


#### 2.4.3.5 Standardizing Text Data

Standardize country names, state names, and other text-based fields to ensure uniformity.
Use .str.lower() or .str.upper() to standardize text.

In [98]:
weather_data.head()

Unnamed: 0,station_id,city,date,season,avg_temp_c,min_temp_c,max_temp_c,precipitation_mm,snow_depth_mm,avg_wind_dir_deg,avg_wind_speed_kmh,peak_wind_gust_kmh,avg_sea_level_pres_hpa,sunshine_total_min
8603,41515,Asadabad,2021-01-02,Winter,5.6,1.6,9.7,0.0,,81.0,5.3,,1026.4,
8604,41515,Asadabad,2021-01-03,Winter,5.3,1.7,9.7,0.0,,69.0,3.8,,1023.3,
8605,41515,Asadabad,2021-01-04,Winter,4.5,1.5,7.5,0.0,,60.0,1.3,,1024.3,
8606,41515,Asadabad,2021-01-05,Winter,4.9,1.7,7.5,0.0,,75.0,2.0,,1020.1,
8607,41515,Asadabad,2021-01-06,Winter,4.8,1.6,8.6,0.6,,116.0,2.4,,1018.0,


#### 2.4.3.6 Mergining with Other Datasets

Merge/join the weather_data DataFrame with relevant datasets like city_country.
Make sure to do this after ensuring that the key columns (like country names, city names, etc.) are standardized.

In [114]:
weather_data.shape, city_country.shape

((4017157, 14), (1245, 12))

In [115]:
# Merge/join the weather and city_country dataframes on the `station_id` column
weather_city_country = pd.merge(weather_data, city_country, on='station_id', how='left')

# Check the shape of the merged dataframe
weather_city_country.shape

(4179075, 25)

In [118]:
weather_city_country.head()

Unnamed: 0,station_id,city_x,date,season,avg_temp_c,min_temp_c,max_temp_c,precipitation_mm,snow_depth_mm,avg_wind_dir_deg,avg_wind_speed_kmh,peak_wind_gust_kmh,avg_sea_level_pres_hpa,sunshine_total_min,city_y,country,state,iso2,iso3,latitude,longitude,population,area,region,continent
0,41515,Asadabad,2021-01-02,Winter,5.6,1.6,9.7,0.0,,81.0,5.3,,1026.4,,Asadabad,Afghanistan,,AF,AFG,34.87,71.15,26023100.0,652230.0,Southern and Central Asia,Asia
1,41515,Asadabad,2021-01-03,Winter,5.3,1.7,9.7,0.0,,69.0,3.8,,1023.3,,Asadabad,Afghanistan,,AF,AFG,34.87,71.15,26023100.0,652230.0,Southern and Central Asia,Asia
2,41515,Asadabad,2021-01-04,Winter,4.5,1.5,7.5,0.0,,60.0,1.3,,1024.3,,Asadabad,Afghanistan,,AF,AFG,34.87,71.15,26023100.0,652230.0,Southern and Central Asia,Asia
3,41515,Asadabad,2021-01-05,Winter,4.9,1.7,7.5,0.0,,75.0,2.0,,1020.1,,Asadabad,Afghanistan,,AF,AFG,34.87,71.15,26023100.0,652230.0,Southern and Central Asia,Asia
4,41515,Asadabad,2021-01-06,Winter,4.8,1.6,8.6,0.6,,116.0,2.4,,1018.0,,Asadabad,Afghanistan,,AF,AFG,34.87,71.15,26023100.0,652230.0,Southern and Central Asia,Asia


In [120]:
# move 'city_y' column next to 'city_x' column using pop and insert, keeping same names
weather_city_country.insert(2, 'city_y', weather_city_country.pop('city_y'))

In [121]:
weather_city_country.head()

Unnamed: 0,station_id,city_x,city_y,date,season,avg_temp_c,min_temp_c,max_temp_c,precipitation_mm,snow_depth_mm,avg_wind_dir_deg,avg_wind_speed_kmh,peak_wind_gust_kmh,avg_sea_level_pres_hpa,sunshine_total_min,country,state,iso2,iso3,latitude,longitude,population,area,region,continent
0,41515,Asadabad,Asadabad,2021-01-02,Winter,5.6,1.6,9.7,0.0,,81.0,5.3,,1026.4,,Afghanistan,,AF,AFG,34.87,71.15,26023100.0,652230.0,Southern and Central Asia,Asia
1,41515,Asadabad,Asadabad,2021-01-03,Winter,5.3,1.7,9.7,0.0,,69.0,3.8,,1023.3,,Afghanistan,,AF,AFG,34.87,71.15,26023100.0,652230.0,Southern and Central Asia,Asia
2,41515,Asadabad,Asadabad,2021-01-04,Winter,4.5,1.5,7.5,0.0,,60.0,1.3,,1024.3,,Afghanistan,,AF,AFG,34.87,71.15,26023100.0,652230.0,Southern and Central Asia,Asia
3,41515,Asadabad,Asadabad,2021-01-05,Winter,4.9,1.7,7.5,0.0,,75.0,2.0,,1020.1,,Afghanistan,,AF,AFG,34.87,71.15,26023100.0,652230.0,Southern and Central Asia,Asia
4,41515,Asadabad,Asadabad,2021-01-06,Winter,4.8,1.6,8.6,0.6,,116.0,2.4,,1018.0,,Afghanistan,,AF,AFG,34.87,71.15,26023100.0,652230.0,Southern and Central Asia,Asia


##### Comparing `city_x` and `city_y` Columns

In [122]:
# Utilizing a boolean mask to compare the `city_x` and `city_y` columns

# Create a boolean mask
mask = weather_city_country['city_x'] != weather_city_country['city_y']

# Use the boolean mask to filter the dataframe
differences = weather_city_country[mask]


In [124]:
differences

Unnamed: 0,station_id,city_x,city_y,date,season,avg_temp_c,min_temp_c,max_temp_c,precipitation_mm,snow_depth_mm,avg_wind_dir_deg,avg_wind_speed_kmh,peak_wind_gust_kmh,avg_sea_level_pres_hpa,sunshine_total_min,country,state,iso2,iso3,latitude,longitude,population,area,region,continent
52831,91765,Pago Pago,Apia,1945-08-02,Winter,26.60,24.60,28.50,,,91.00,24.20,,1013.40,,Samoa,,WS,WSM,-13.83,-171.77,187820.00,2842.00,Polynesia,Oceania
52833,91765,Pago Pago,Apia,1945-08-05,Winter,25.90,24.60,26.80,,,135.00,37.80,,1012.40,,Samoa,,WS,WSM,-13.83,-171.77,187820.00,2842.00,Polynesia,Oceania
52835,91765,Pago Pago,Apia,1945-08-09,Winter,27.10,25.20,29.10,,,83.00,22.40,,1012.80,,Samoa,,WS,WSM,-13.83,-171.77,187820.00,2842.00,Polynesia,Oceania
52837,91765,Pago Pago,Apia,1945-08-10,Winter,25.40,22.40,29.60,,,31.00,10.10,,1012.10,,Samoa,,WS,WSM,-13.83,-171.77,187820.00,2842.00,Polynesia,Oceania
52839,91765,Pago Pago,Apia,1945-08-11,Winter,25.40,21.30,29.60,,,16.00,7.90,,1013.10,,Samoa,,WS,WSM,-13.83,-171.77,187820.00,2842.00,Polynesia,Oceania
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4171628,41140,Sana'a,Jizan,2023-08-24,Summer,34.00,31.00,37.00,0.00,,314.00,14.20,,1003.80,,Saudi Arabia,,SA,SAU,16.91,42.56,30770375.00,2149690.00,Middle East,Asia
4171630,41140,Sana'a,Jizan,2023-08-25,Summer,35.40,33.00,39.00,0.00,,225.00,11.20,,1002.60,,Saudi Arabia,,SA,SAU,16.91,42.56,30770375.00,2149690.00,Middle East,Asia
4171632,41140,Sana'a,Jizan,2023-08-26,Summer,35.30,32.00,39.00,0.00,,318.00,12.80,,1002.50,,Saudi Arabia,,SA,SAU,16.91,42.56,30770375.00,2149690.00,Middle East,Asia
4171634,41140,Sana'a,Jizan,2023-08-27,Summer,33.90,30.10,37.20,0.00,,309.00,15.30,,1003.10,,Saudi Arabia,,SA,SAU,16.91,42.56,30770375.00,2149690.00,Middle East,Asia


In [125]:
differences.isnull().sum()

station_id                     0
city_x                      2295
city_y                         0
date                           0
season                         0
avg_temp_c                     4
min_temp_c                     0
max_temp_c                     0
precipitation_mm           40914
snow_depth_mm             122241
avg_wind_dir_deg           35092
avg_wind_speed_kmh          2779
peak_wind_gust_kmh        113995
avg_sea_level_pres_hpa         0
sunshine_total_min        131353
country                        0
state                          0
iso2                           0
iso3                           0
latitude                       0
longitude                      0
population                     0
area                           0
region                         0
continent                      0
dtype: int64

After reviewing and comparing the `city_x` (city column from weather data) and `city_y` (city column from cities.csv dataset) columns, it appears that the `city_y` column is more complete than the `city_x` column. Therefore, we will keep the `city_y` column and drop the `city_x` column.

In [126]:
# Drop the `city_x` column
weather_city_country.drop(columns=['city_x'], inplace=True)

# Rename the `city_y` column to `city`
weather_city_country.rename(columns={'city_y': 'city'}, inplace=True)

# Confirm the changes
weather_city_country.head()

Unnamed: 0,station_id,city,date,season,avg_temp_c,min_temp_c,max_temp_c,precipitation_mm,snow_depth_mm,avg_wind_dir_deg,avg_wind_speed_kmh,peak_wind_gust_kmh,avg_sea_level_pres_hpa,sunshine_total_min,country,state,iso2,iso3,latitude,longitude,population,area,region,continent
0,41515,Asadabad,2021-01-02,Winter,5.6,1.6,9.7,0.0,,81.0,5.3,,1026.4,,Afghanistan,,AF,AFG,34.87,71.15,26023100.0,652230.0,Southern and Central Asia,Asia
1,41515,Asadabad,2021-01-03,Winter,5.3,1.7,9.7,0.0,,69.0,3.8,,1023.3,,Afghanistan,,AF,AFG,34.87,71.15,26023100.0,652230.0,Southern and Central Asia,Asia
2,41515,Asadabad,2021-01-04,Winter,4.5,1.5,7.5,0.0,,60.0,1.3,,1024.3,,Afghanistan,,AF,AFG,34.87,71.15,26023100.0,652230.0,Southern and Central Asia,Asia
3,41515,Asadabad,2021-01-05,Winter,4.9,1.7,7.5,0.0,,75.0,2.0,,1020.1,,Afghanistan,,AF,AFG,34.87,71.15,26023100.0,652230.0,Southern and Central Asia,Asia
4,41515,Asadabad,2021-01-06,Winter,4.8,1.6,8.6,0.6,,116.0,2.4,,1018.0,,Afghanistan,,AF,AFG,34.87,71.15,26023100.0,652230.0,Southern and Central Asia,Asia


In [129]:
weather_city_country.shape

(4179075, 24)

#### 2.4.3.7 Data Type Conversion

Convert columns to the appropriate data type (float, integer, string, datetime, etc.).

All data types are appropriate for the columns.

In [130]:
weather_city_country.dtypes

station_id                        object
city                              object
date                      datetime64[ns]
season                          category
avg_temp_c                       float64
min_temp_c                       float64
max_temp_c                       float64
precipitation_mm                 float64
snow_depth_mm                    float64
avg_wind_dir_deg                 float64
avg_wind_speed_kmh               float64
peak_wind_gust_kmh               float64
avg_sea_level_pres_hpa           float64
sunshine_total_min               float64
country                           object
state                             object
iso2                              object
iso3                              object
latitude                         float64
longitude                        float64
population                       float64
area                             float64
region                            object
continent                         object
dtype: object

#### 2.4.3.8 Handling Categorical Variables

Label encode or one-hot encode categorical variables as needed.

Given that the seasons are already encoded as categorical variables, we will not encode them further. Moreover, since we are interested in analyzing the number of migraines occurring in relation to sea-level pressure in specific geographical areas, keeping the geographic locations as categorical variables will make it easier to slice and dice the data. This will allow filtering or grouping of the data based on these geographical identifiers to derive more localized insights.

In [131]:
weather_city_country['season'].unique()

['Winter', 'Spring', 'Summer', 'Autumn']
Categories (4, object): ['Autumn', 'Spring', 'Summer', 'Winter']

#### 2.4.3.9 Outliers Detection and Treatment

Use graphical methods like boxplots or use IQR to detect outliers.
Decide on a treatment method - either remove them or cap them.

In the context of this dataset, which includes a diverse range of weather variables and geographical data, the notion of an "outlier" is nuanced. Here are some key considerations:

** Weather Variables **
- **Temperature, Precipitation, Snow Depth, Wind Speed, and Direction**: Extreme values in these categories could very well be accurate representations of rare weather events.

** Geographical Variables **
- **Population and Area**: These variables are expected to vary significantly due to the natural diversity in the size and population density of different regions.

** Why Outliers Were Not Investigated Through Boxplots or IQR **

Given the nature of the dataset, extreme values do not automatically indicate inaccuracies or outliers that need to be adjusted or removed. In many cases, these "outliers" provide valuable insights into rare but significant weather events or conditions. As such, boxplots or Interquartile Range (IQR) methods, commonly used to identify outliers, were not employed in this analysis.

The decision to not treat these extreme values as outliers is backed by the understanding that they could be significant in the context of weather and geographical studies. Removing or adjusting these could result in a loss of important information.


#### 2.4.3.10 Secondary Review for Missing and Zero Values

Conduct a second review for missing values.
Decide on an imputation strategy for each column with missing values.

In [132]:
weather_city_country.isnull().sum()

station_id                      0
city                            0
date                            0
season                          0
avg_temp_c                   1084
min_temp_c                   7747
max_temp_c                   7745
precipitation_mm           603864
snow_depth_mm             2628017
avg_wind_dir_deg          1140696
avg_wind_speed_kmh         196299
peak_wind_gust_kmh        3167529
avg_sea_level_pres_hpa          0
sunshine_total_min        3381089
country                         0
state                           0
iso2                            0
iso3                            0
latitude                        0
longitude                       0
population                      0
area                            0
region                          0
continent                       0
dtype: int64

##### Handling Missing Values in weather_city_country DataFrame

** Temperature Columns (min_temp_c, max_temp_c, avg_temp_c) **
- **Missing Values**: `min_temp_c` has 7,747 missing values, `max_temp_c` has 7,745, and `avg_temp_c` has 1,084.
- **Data Size**: The DataFrame has 4,179,075 entries.
- **Data Variability**: Given the temporal and geographical variability in temperatures, traditional imputation methods like mean or median could be misleading.
- **Options**:
    - *Linear Interpolation*: Used for `min_temp_c` and `max_temp_c` as the data is time-ordered and missing values are not clustered.
    - *Average Calculation*: `avg_temp_c` was filled using the average of `min_temp_c` and `max_temp_c`, aligning with how the original data was computed.

** Precipitation and Snow Depth Columns (precipitation_mm, snow_depth_mm) **
- **Missing Values**: `precipitation_mm` has 603,864 missing values and `snow_depth_mm` has 2,628,017.
- **Data Size**: The DataFrame has 4,179,075 entries.
- **Data Variability**: Like temperature, these variables can also vary greatly.
- **Options**:
    - *Zero Filling*: Chosen as the most logical option, assuming that no recorded value implies no precipitation or snow.

** Wind Direction Columns (avg_wind_dir_deg) **
- **Missing Values**: 1,140,696 missing values.
- **Data Size**: The DataFrame has 4,179,075 entries.
- **Data Variability**: Wind direction is cyclic (0 and 360 degrees are equivalent).
- **Options**:
    - *Mean Imputation*: Applied after converting the wind direction to radians, calculating the mean, and then converting back to degrees.

** Wind Speed and Sunshine Columns (avg_wind_speed_kmh, peak_wind_gust_kmh, sunshine_total_min) **
- **Missing Values**: `avg_wind_speed_kmh` has 196,299 missing values, `peak_wind_gust_kmh` has 3,167,529, and `sunshine_total_min` has 3,381,089.
- **Data Size**: The DataFrame has 4,179,075 entries.
- **Data Variability**: These variables also vary significantly.
- **Options**:
    - *Linear Interpolation*: Used for these columns for the same reasons as the temperature columns.

After handling these missing values, additional data validation was conducted to ensure that the imputed values fall within expected ranges.

#### 2.4.3.11 Replacing Missing Values

Use techniques like mean imputation, median imputation, or more sophisticated methods like k-nearest neighbors or multiple imputations.

#### 2.4.3.12 Data Transformation

Normalize or standardize numerical columns if needed.
Log transformation for skewed data.

#### 2.4.3.13 Checking and Removing Duplicates

Use .duplicated() to check for duplicate rows and .drop_duplicates() to remove them.

#### 2.4.3.14 Summary for Data Cleaning Steps for `weather_city_country`

In this section, we have executed a comprehensive set of data cleaning actions to enhance the quality and usability of the `city_data` DataFrame. Here's a summary of what was achieved:

1. **Data Consistency Check**: Conducted an initial review for consistency across all columns to ensure no anomalies or contradictions exist.
  
2. **Initial Review for Missing and Zero Values**: Identified columns with missing or zero values and marked them for further action.
  
3. **Drop Unnecessary Columns/Rows**: Removed columns and rows that were not pertinent to the scope of our analysis, thereby simplifying the dataset.
  
4. **Rename Columns**: Renamed columns to align with a consistent naming convention, enhancing the DataFrame's readability.
  
5. **Standardizing Text Data**: Standardized the text data in fields like country names and states to ensure uniformity across datasets.
  
6. **Merging with Other Datasets**: Merged the `city_data` DataFrame with the `country_data` DataFrame after ensuring key columns were standardized.
  
7. **Data Type Conversion**: Converted columns to their appropriate data types to facilitate subsequent analysis.
  
8. **Handling Categorical Variables**: Label-encoded or one-hot encoded categorical variables, preparing them for modeling.
  
9. **Outliers Detection and Treatment**: Detected outliers using boxplots and IQR methods and decided on a treatment strategy.
  
10. **Second Review for Missing Values**: Conducted a second review for missing values and selected an imputation strategy for each column with missing data.
  
11. **Replace Missing Values**: Applied various techniques to impute missing values, ranging from mean and median imputation to more advanced methods.
  
12. **Data Transformation**: Normalized or standardized numerical columns and applied log transformation to skewed data where necessary.
  
13. **Checking and Removing Duplicates**: Checked for duplicate rows and removed them to ensure data integrity.

These cleaning steps have ensured that the `city_data` DataFrame is now in a state that is well-prepared for the subsequent stages of data integration, feature engineering, and modeling. The methodologies and strategies applied here will be similarly applied to the remaining DataFrames: `country_data`, `weather_data`, and `migraine_data`.

### 2.4.4 DataFrame: `migraine_data`

#### 2.4.4.1 Data Consistency Check

Check if the data is consistent across all columns, i.e., no anomalies or contradictions.
Use .describe() to obtain summary statistics and .info() to get an overview of the dataset.

#### 2.4.4.2 Initial Review for Missing and Zero Values

Conduct an initial review for missing values using .isna().sum() and for zero values.

#### 2.4.4.3 Drop Unnecessary Columns/Rows

Drop columns or rows that are not needed for the analysis based on the project's scope.

#### 2.4.4.4 Rename Columns

Rename columns to have meaningful names and to follow a consistent naming convention.

#### 2.4.4.5 Standardizing Text Data

Standardize country names, state names, and other text-based fields to ensure uniformity.
Use .str.lower() or .str.upper() to standardize text.

#### 2.4.4.6 Mergining with Other Datasets

Merge/join the city_data DataFrame with relevant datasets like country_data.
Make sure to do this after ensuring that the key columns (like country names, city names, etc.) are standardized.

#### 2.4.4.7 Data Type Conversion

Convert columns to the appropriate data type (float, integer, string, datetime, etc.).

#### 2.4.4.8 Handling Categorical Variables

Label encode or one-hot encode categorical variables as needed.

#### 2.4.4.9 Outliers Detection and Treatment

Use graphical methods like boxplots or use IQR to detect outliers.
Decide on a treatment method - either remove them or cap them.

#### 2.4.4.10 Secondary Review for Missing and Zero Values

Conduct a second review for missing values.
Decide on an imputation strategy for each column with missing values.

#### 2.4.4.11 Replacing Missing Values

Use techniques like mean imputation, median imputation, or more sophisticated methods like k-nearest neighbors or multiple imputations.

#### 2.4.4.12 Data Transformation

Normalize or standardize numerical columns if needed.
Log transformation for skewed data.

#### 2.4.4.13 Checking and Removing Duplicates

Use .duplicated() to check for duplicate rows and .drop_duplicates() to remove them.

#### 2.4.4.14 Summary for Data Cleaning Steps for `city_data`

In this section, we have executed a comprehensive set of data cleaning actions to enhance the quality and usability of the `city_data` DataFrame. Here's a summary of what was achieved:

1. **Data Consistency Check**: Conducted an initial review for consistency across all columns to ensure no anomalies or contradictions exist.
  
2. **Initial Review for Missing and Zero Values**: Identified columns with missing or zero values and marked them for further action.
  
3. **Drop Unnecessary Columns/Rows**: Removed columns and rows that were not pertinent to the scope of our analysis, thereby simplifying the dataset.
  
4. **Rename Columns**: Renamed columns to align with a consistent naming convention, enhancing the DataFrame's readability.
  
5. **Standardizing Text Data**: Standardized the text data in fields like country names and states to ensure uniformity across datasets.
  
6. **Merging with Other Datasets**: Merged the `city_data` DataFrame with the `country_data` DataFrame after ensuring key columns were standardized.
  
7. **Data Type Conversion**: Converted columns to their appropriate data types to facilitate subsequent analysis.
  
8. **Handling Categorical Variables**: Label-encoded or one-hot encoded categorical variables, preparing them for modeling.
  
9. **Outliers Detection and Treatment**: Detected outliers using boxplots and IQR methods and decided on a treatment strategy.
  
10. **Second Review for Missing Values**: Conducted a second review for missing values and selected an imputation strategy for each column with missing data.
  
11. **Replace Missing Values**: Applied various techniques to impute missing values, ranging from mean and median imputation to more advanced methods.
  
12. **Data Transformation**: Normalized or standardized numerical columns and applied log transformation to skewed data where necessary.
  
13. **Checking and Removing Duplicates**: Checked for duplicate rows and removed them to ensure data integrity.

These cleaning steps have ensured that the `city_data` DataFrame is now in a state that is well-prepared for the subsequent stages of data integration, feature engineering, and modeling. The methodologies and strategies applied here will be similarly applied to the remaining DataFrames: `country_data`, `weather_data`, and `migraine_data`.

In [None]:
# View the migraine data
migraine_data

*Keeping* all columns:
- 'measure'
- 'location'
- 'sex'
- 'age'
- 'cause'
- 'metric'
- 'year'
- 'val'
- 'upper'
- 'lower'

In [None]:
# Check the unique values of the measure_name column
migraine_data['measure'].unique()

In [None]:
# Check the unique values of the metric_name column
migraine_data['metric'].unique()

For this analysis, we will only be looking at the prevalence (total # of cases in the population) of headache disorders (i.e., migraine, tension-type headache) and confirmed no other measure name columns are present in our dataset. We will remove all percent and rate values, as we are only interested in the total number of cases.

In [None]:
# Filter the data to only include the number of headache and migraine cases

# Identify indices to drop for both 'metric'
metric_indices_to_drop = migraine_data[migraine_data['metric'].isin(['Percent', 'Rate'])].index

# Drop rows
filtered_migraine_data = migraine_data.drop(metric_indices_to_drop)

# Format values in the val, upper, and lower columns to two decimal places
pd.set_option('display.float_format', lambda x:'%.2f' % x)

filtered_migraine_data

### 4.2 Review and Plan for Missing/Zero Values

#### 4.2.3 DataFrame: `weather_data`

In [None]:
# Check for missing values
print("\nWeather Missing Values:\n")
print(weather_data.isnull().sum())

# Calculate zero counts for each column
print("\nWeather Zero Counts:\n")
zero_counts = (weather_data == 0).sum()
print(zero_counts)

Plan: Merge with `city_data` and `country_data` to fill in missing values. Then recheck missing and zero counts after merging.

#### 4.2.4 DataFrame: `filtered_migraine_data`

In [None]:
# Check for missing values
print("\nFiltered Migraine Missing Values:\n")
print(filtered_migraine_data.isnull().sum())

# Check for zero values
print("\nFiltered Migraine Zero Counts:\n")
zero_counts = (filtered_migraine_data == 0).sum()
print(zero_counts)

No missing values in this dataset.

Plan: Merge with combined and filtered weather data.

We need to address the zero values in the `val`, `upper`, and `lower` columns. We will investigate the distribution of these values to determine the best method for handling them.

In [None]:
# Print rows with zero values
zero_rows_any = filtered_migraine_data[(filtered_migraine_data == 0).any(axis=1)]
print(zero_rows_any)

In [None]:
# Check the counts of the population's age groups
filtered_migraine_data['age'].value_counts()

The number of zero values in the `val`, `upper`, and `lower` columns is 6,100. After further investigation, there are also 6,100 rows where the age of the population is <5 years old. Since this is perfectly reasonable explanation, we will remove these rows from the dataset and this analysis.

In [None]:
# Drop rows that meet the condition
filtered_migraine_data.drop(
    filtered_migraine_data.query("`age` == '<5 years' and `val` == 0").index, 
    inplace=True
)

In [None]:
zero_rows_any = filtered_migraine_data[(filtered_migraine_data == 0).any(axis=1)]
print(zero_rows_any)

In [None]:
filtered_migraine_data['age'].value_counts()

In [None]:
filtered_migraine_data.shape

In [None]:
filtered_migraine_data

### 4.3 Standardize Country and State Names across datasets

#### 4.3.1 DataFrame: `city_data`

##### 4.3.1.1 Country Names

In [None]:
# Import the function
from data_location_matcher import find_matching_and_non_matching

# Find matching and non-matching countries
city_data_matching_countries, city_data_non_matching_countries = find_matching_and_non_matching(city_data, 'country')

# View the non-matching countries
city_data_non_matching_countries

In [None]:
city_data_country_replacement_dict = { 
    'Guinea Bissau': 'Guinea-Bissau',
    'Korea, North': 'North Korea',
    'Korea, South': 'South Korea',
    'Macau S.A.R': 'Macau',
    'Svalbard and Jan Mayen Islands': 'Svalbard and Jan Mayen',
    'São Tomé and Príncipe': 'Sao Tome and Principe',
    'The Bahamas': 'Bahamas',
    'The Gambia': 'Gambia',
    'United States': 'United States of America'
}

# Replace the country names in the city dataframe
city_data['country'].replace(city_data_country_replacement_dict, inplace=True)

city_data['country'].unique()

##### 4.3.1.2 State Names

In [None]:
# Find matching and non-matching states
city_data_matching_states, city_data_non_matching_states = find_matching_and_non_matching(city_data, 'state')

# View the non-matching states
city_data_non_matching_states

#### 4.3.3 DataFrame: `migraine_data`

##### 4.3.3.1 Country Names

In [None]:
filtered_migraine_data.head()

In [None]:
from data_location_matcher import COUNTRIES, US_STATES

migraine_data_countries_and_states = filtered_migraine_data['location'].unique()

migraine_data_countries_and_states

In [None]:
migraine_data_location_replacement_dict = {
    'Taiwan (Province of China)': 'Taiwan',
    'Viet Nam': 'Vietnam',
    "Democratic People's Republic of Korea": 'North Korea',
    "Lao People's Democratic Republic": 'Laos',
    'Democratic Republic of the Congo': 'Congo (Kinshasa)',
    'Micronesia (Federated States of)': 'Micronesia',
    'North Macedonia': 'Macedonia',
    'Brunei Darussalam': 'Brunei',
    'Republic of Korea': 'South Korea',
    'Bolivia (Plurinational State of)': 'Bolivia',
    'Venezuela (Bolivarian Republic of)': 'Venezuela',
    'Iran (Islamic Republic of)': 'Iran',
    'United Republic of Tanzania': 'Tanzania',    
    'Republic of the Congo': 'Congo (Brazzaville)',
    'Republic of Moldova': 'Moldova',
    'Korea, North': 'North Korea',
    'Korea, South': 'South Korea',
    'São Tomé and Príncipe': 'Sao Tome and Principe', 
    'The Bahamas': 'Bahamas',
    'The Gambia': 'Gambia',
    'United States': 'United States of America'
}

# Replace the country names in the country dataframe
filtered_migraine_data['location'].replace(migraine_data_location_replacement_dict, inplace=True)

filtered_migraine_data['location'].unique()

In [None]:
# Convert original list and the list of U.S. states to sets
migraine_data_countries_and_states = filtered_migraine_data["location"].unique()

set_migraine_countries_and_states = set(migraine_data_countries_and_states)
set_US_states = set(US_STATES)
set_countries = set(COUNTRIES)

# Create a new list excluding the U.S. states
migrained_filtered_countries_list = [
    item for item in set_migraine_countries_and_states if item not in set_US_states
]

# View the list
migrained_filtered_countries_list

##### 4.3.3.2 State Names

In [None]:
# Convert the list of U.S. states to sets
set_US_states = set(US_STATES)

# Create a new list including the U.S. states
migrained_filtered_states_list = [item for item in set_migraine_countries_and_states if item in set_US_states]

# View the list
migrained_filtered_states_list

## 5. Data Integration

### 5.1 Overview

Briefly introduce the goal of data integration in the context of this project. Provide a high-level view of the datasets that will be integrated.

### 5.2 Data Sources

#### 5.2.1 Weather Data

The weather data provides context regarding sea level pressure, sunshine, temperature, and precipitation for each city. This data is relevant because it provides information about the weather conditions that may be associated with migraine prevalence. The country data will be combined with the city data to provide additional information about each city, such as the country, region, and continent. The combined city and country data will then be combined with the weather data to provide additional information.

The daily weather data source file is quite large and is provided in a .parquet format for low memory consumption and data type preservation. 

##### 5.2.1.1 Cities DataFrame

| Attribute            | Description                                        |
|----------------------|----------------------------------------------------|
| **Data Source Name** | cities.csv                                         |
| **Data Source Format** | CSV (comma-separated values)                       |
| **Data Source Desc** | Individual cities and weather stations around the world |
| **Data Source Size** | 84.1 KB                                             |
|                      | 1,245 rows                                          |
|                      | 8 columns                                           |
| **Data Source Limits** | None                                              |
| **Data Source Usability** | 10.00                                          |

**Data Source Columns**

| Column Name  | Description                               |
|--------------|-------------------------------------------|
| `station_id` | Unique ID for the weather station.        |
| `city_name`  | Name of the city.                         |
| `country`    | The country where the city is located.    |
| `state`      | The state or province within the country. |
| `iso2`       | The two-letter country code.              |
| `iso3`       | The three-letter country code.            |
| `latitude`   | Latitude coordinate of the city.          |
| `longitude`  | Longitude coordinate of the city.         |


##### 5.2.1.2 Countries DataFrame

| Attribute            | Description                                        |
|----------------------|----------------------------------------------------|
| **Data Source Name** | countries.csv                                         |
| **Data Source Format** | CSV (comma-separated values)                       |
| **Data Source Desc** | Individual country geographic and demographic characteristics |
| **Data Source Size** | 20.6 KB                        |
|                      | 214 rows                                    |
|                      | 11 columns                                       |
| **Data Source Limits** | None                                              |
| **Data Source Usability** | 10.00                                           |

**Data Source Columns**

| Column Name  | Description                                               |
|--------------|-----------------------------------------------------------|
| `iso3`       | The three-letter code representing the country.           |
| `country`    | The English name of the country.                          |
| `native_name`| The native name of the country.                           |
| `iso2`       | The two-letter code representing the country.             |
| `population` | The population of the country.                            |
| `area`       | The total land area of the country in square kilometers.  |
| `capital`    | The name of the capital city.                             |
| `capital_lat`| The latitude coordinate of the capital city.              |
| `capital_lng`| The longitude coordinate of the capital city.             |
| `region`     | The specific region within the continent where the country is located. |
| `continent`  | The continent to which the country belongs.               |

##### 5.2.1.3 Daily Weather DataFrame

| Attribute            | Description                                        |
|----------------------|----------------------------------------------------|
| **Data Source Name** | daily_weather.parquet                              |
| **Data Source Format** | .parquet (compressed, maintains original data types, efficient)|
| **Data Source Desc** | Daily weather data                            |
| **Data Source Size** | 233 MB                                        |
|                      | 27,635,763 rows                               |
|                      | 14 columns                                    |
| **Data Source Limits** | None                                              |
| **Data Source Usability** | 10.00                                           |

**Data Source Columns**

| Column Name            | Description                                       |
|------------------------|---------------------------------------------------|
| `station_id`           | Unique ID for the weather station.                |
| `city_name`            | Name of the city where the station is located.    |
| `date`                 | Date of the weather record.                       |
| `season`               | Season corresponding to the date (e.g., summer, winter).|
| `avg_temp_c`           | Average temperature in Celsius.                   |
| `min_temp_c`           | Minimum temperature in Celsius.                   |
| `max_temp_c`           | Maximum temperature in Celsius.                   |
| `precipitation_mm`     | Precipitation in millimeters.                     |
| `snow_depth_mm`        | Snow depth in millimeters.                        |
| `avg_wind_dir_deg`     | Average wind direction in degrees.                |
| `avg_wind_speed_kmh`   | Average wind speed in kilometers per hour.        |
| `peak_wind_gust_kmh`   | Peak wind gust in kilometers per hour.            |
| `avg_sea_level_pres_hpa`| Average sea-level pressure in hectopascals.      |
| `sunshine_total_min`   | Total sunshine duration in minutes.               |

#### 5.2.2 Migraine Data

The migraine data provides information about the prevalence of migraine in different countries. This data is relevant because it provides information about the prevalence of migraine by gender, age, year, and location. This data will be combined with the weather data to determine if there is a relationship between weather and migraine prevalence.

| Attribute            | Description                                        |
|----------------------|----------------------------------------------------|
| **Data Source Name** | IHME-GBD_2019_DATA-2c1d3941-1.csv                  |
|                      | IHME-GBD_2019_DATA-2c1d3941-2.csv                  |
|                      | IHME-GBD_2019_DATA-2c1d3941-3.csv                  |
| **Data Source Format** | CSV (comma-separated values)                     |
| **Data Source Desc** | All GBD causes, risks, impairments, etiologies, and injuries by nature |
| **Data Source Size** | 158 MB                                              |
|                      | 1,377,000 rows                                      |
|                      | 10 columns                                          |
| **Data Source Limits** | None                                              |
| **Data Source Usability** | 10.00                                          |

**Data Source Columns**

| Column Name    | Description                                          |
|----------------|------------------------------------------------------|
| `measure` | The name of measure.                                      |
| `location`| The name of each location.                                |
| `sex`     | The name of each sex choice.                              |
| `age`     | The name of each age group.                               |
| `cause`   | The name of each cause.                                   |
| `metric`  | The name of each metric/unit.                             |
| `year`    | The annual results for all measures.                      |
| `val`     | The value of each metric/unit.                            |
| `upper`   | The 95% Uncertainty Interval - Upper Bound value.         |
| `lower`   | The 95% Uncertainty Interval - Lower Bound value.         |

### 5.3 Preliminary Steps

**Overview**

In this section, the focus is on preparing the dataset for further analysis and exploration. The steps include merging multiple data sources, filtering the data based on specific criteria, cleaning the data by dropping unnecessary columns and rows, and conducting a preliminary analysis through correlation metrics. Each of these steps is essential for ensuring the data's integrity, usability, and relevance to the study objectives.

**5.3.1 Data Merging**

The first step involves merging the city and country datasets using a common identifier. This integration provides a comprehensive view that combines geographical and political attributes. Following that, the weather dataset is integrated with the already combined city-country data. The resulting dataset offers a rich context, incorporating both geographical information and meteorological variables.

**5.3.2 Data Filtering**

The dataset is filtered to only include records pertaining to US cities, thereby narrowing the scope for more targeted analysis. Further filtering is done to include only specific years, enhancing the dataset's relevance to the study period.

**5.3.3 Data Cleaning**

Columns that do not contribute to the analysis or contain redundant information are dropped to simplify the dataset. Rows with missing or irrelevant data are removed to improve the dataset's quality and consistency. Duplicate rows, if any, are identified and removed to ensure each record in the dataset is unique.

**5.3.4 Preliminary Analysis**

A correlation analysis is conducted on specific weather attributes like temperature, precipitation, and wind speed to identify any significant relationships among them.

---

Throughout these steps, the data are continuously inspected to understand their structures, types, and quality. Various data profiling techniques are employed, such as examining data distributions, checking for missing values, and assessing data types, to ensure that the dataset meets the quality and integrity requirements for downstream analysis.

#### 5.3.1 Data Merging

##### 5.3.1.1 Merge `city_data` and `country_data`

Join the countries and cities tables on the `country`, `iso2`, and `iso3` columns to give more context to the weather data.

In [None]:
# Code for joining countries and cities
city_country = city_data.merge(country_data, 
                               how='left', 
                               left_on=['country', 'iso2', 'iso3'], 
                               right_on=['country', 'iso2', 'iso3']
                               )

# Review the shape of the new dataframe
city_country

##### 5.3.1.2 Merge `weather_data` and `city_country`

Join the weather data with the combined countries and cities tables on the `station_id` and `city_name` columns.

In [None]:
# Review the shape of the weather dataframe
print(f"Weather Data: {weather_data.shape}")

# Review the shape of the city-country dataframe
print(f"City-Country Data: {city_country.shape}")

In [None]:
weather_data.head(1)

In [None]:
# Combine city/country with daily weather data
combined_weather = weather_data.merge(city_country, 
                                      how='left', 
                                      left_on=['station_id', 'city_name'], 
                                      right_on=['station_id', 'city_name']
)

# Review the shape of the new dataframe
print(f"Combined Weather Data: {combined_weather.shape}")

In [None]:
combined_weather.head(1)

In [None]:
combined_weather['country'].unique()

In [None]:
# View the states where country is 'United States of America'
combined_weather[combined_weather['country'] == 'United States of America']['state'].unique()

In [None]:
combined_weather.isnull().sum()

##### 5.3.1.3 Add `year` column to `combined_weather` and Filter to 2010-2019

Given that the migraine data is annual, we need to add a 'year' column to the weather data and filter it by year.

In [None]:
# confirmed that date column is in datetime format
combined_weather.dtypes

In [None]:
# Make a copy of the dataframe
combined_weather = combined_weather.copy()

# Create a new column for the year
combined_weather['year'] = combined_weather['date'].dt.year

combined_weather

In [None]:
combined_weather['year'].unique()

In [None]:
combined_weather['year'].describe()

Earliest year is 1750, latest year is 2023. We will filter the weather data to only include years 1990-2019 to match the migraine data's date range.

In [None]:
combined_weather.shape

In [None]:
combined_weather.isnull().sum()

In [None]:
# Filter the data to only include the years 1990-2019
year_filter = (combined_weather['year'] >= 1990) & (combined_weather['year'] <= 2019)
combined_weather = combined_weather[year_filter]

# Review the shape of the new dataframe
combined_weather.shape

In [None]:
combined_weather['year'].unique()

In [None]:
combined_weather.isnull().sum()

In [None]:
combined_weather.shape

In [None]:
combined_weather.head(1)

In [None]:
# Group by 'year' and 'state', then aggregate the numerical columns
annual_weather_by_stateCountry = (
    combined_weather.groupby(["year", "country", "state"])
    .agg(
        {
            "avg_temp_c": "mean",
            "min_temp_c": "mean",
            "max_temp_c": "mean",
            "precipitation_mm": "sum",
            "snow_depth_mm": "sum",
            "avg_wind_dir_deg": "mean",
            "avg_wind_speed_kmh": "mean",
            "peak_wind_gust_kmh": "mean",
            "avg_sea_level_pres_hpa": "mean",
            "sunshine_total_min": "mean",
            "population": "mean",
            "area": "mean",
            "latitude": "first",  # Assuming all latitudes are the same for a given year and state
            "longitude": "first",  # Assuming all longitudes are the same for a given year and state
        }
    )
    .reset_index()
)

# Review the shape of the new dataframe
annual_weather_by_stateCountry.shape

In [None]:
annual_weather_by_stateCountry

In [None]:
annual_weather_by_stateCountry.isnull().sum()

##### 5.3.1.4 Merge `filtered_migraine_data` and `combined_weather_data`

Before we can merge the migraine data with the weather data, we need to split the standardized `location` column into `country` and `state` columns. We have already confirmed that the `city_country` dataframe contains both a country and state column so we will use this dataframe to populate the `country` and `state` columns in the migraine data. We will then drop the `location` column from the migraine data. Finally, we will merge the migraine data with the weather data on the `country`, `state`, and `year` columns.

In [None]:
filtered_migraine_data.head(1)

In [None]:
filtered_migraine_data['location'].unique()

In [None]:
# import the assign_country function
from cleanup_location_migraine import assign_country

# Assign country to migraine data
filtered_migraine_data['country'] = filtered_migraine_data['location'].apply(assign_country, args=(city_country,))

filtered_migraine_data.head(1)

In [None]:
# Check state values in filtered_migraine_data dataframe, if not US state, then replace with 'None'
filtered_migraine_data["state"] = filtered_migraine_data["location"].apply(
    lambda x: x if x in US_STATES else "None"
)

filtered_migraine_data["state"].unique()

In [None]:
filtered_migraine_data.isnull().sum()

In [None]:
filtered_migraine_data.shape

In [None]:
filtered_migraine_data

In [None]:
annual_weather_by_stateCountry

In [None]:
# Combine filtered migraine data with combined weather data
weather_migraine = filtered_migraine_data.merge(annual_weather_by_stateCountry, 
                                                  how="left", 
                                                  left_on=["year", "country", "state"], 
                                                  right_on=["year", "country", "state"]
                                                  )

# Review the shape of the new dataframe
print(f"Migraine Data and Combined Weather: {weather_migraine.shape}")

weather_migraine.head(1)

In [None]:
weather_migraine.isnull().sum()

#### 5.3.2 Data Filtering

##### 5.3.2.1 Filter `combined_weather` by US Cities

In [None]:
# Filter the combined weather data to only include the US
usa_weather = combined_weather[combined_weather['iso3'] == 'USA']

# Review the shape of the new dataframe
usa_weather.shape

In [None]:
# View 10 rows of the new dataframe
usa_weather

In [None]:
# Check for missing values
usa_weather.isnull().sum()

In [None]:
# Check the unique values of the iso3 column, confirming no other countries are included
usa_weather['iso3'].unique()

#### 5.3.3 Data Cleaning (2nd Round)

##### 5.3.3.1 Drop Unnecessary Columns

After further review of the data, the `country` and `iso2` columns are no longer needed since we have filtered for iso3=USA, so we will drop them.

In [None]:
# List of columns to keep
columns_to_keep = [col for col in usa_weather.columns if col not in ['country', 'iso2']]

# Use .loc to select only the columns to keep
usa_weather = usa_weather.loc[:, columns_to_keep]

In [None]:
# Review the shape of the new dataframe
usa_weather.shape

##### 5.3.3.2 Drop Unnecessary Rows

In [None]:
usa_weather.isnull().sum()

In [None]:
# Drop rows with missing values
usa_weather = usa_weather.dropna(subset=['min_temp_c', 'max_temp_c', 'precipitation_mm'])

# Review the shape of the new dataframe
usa_weather.shape

In [None]:
# Check for missing values
usa_weather.isnull().sum()

##### 5.3.3.3 Drop Duplicate Rows

In [None]:
# Original DataFrame
usa_weather_row_count = len(usa_weather)

# DataFrame after dropping duplicates
usa_weather_deduplicated = usa_weather.drop_duplicates()
deduplicated_row_count = len(usa_weather_deduplicated)

# Calculate the number of rows that would be dropped
rows_to_be_dropped = usa_weather_row_count - deduplicated_row_count

# Print the difference
print(f"Rows to be dropped: {rows_to_be_dropped}")

#### 5.3.4 Preliminary Analysis

##### 5.3.4.1 Correlation Analysis for Weather Features

In [None]:
weather_features = ['avg_temp_c', 'min_temp_c', 'max_temp_c', 'precipitation_mm', 'snow_depth_mm', 'avg_wind_dir_deg',\
                     'avg_wind_speed_kmh', 'peak_wind_gust_kmh', 'avg_sea_level_pres_hpa', 'sunshine_total_min']
correlation_matrix_usa_weather = usa_weather[weather_features].corr()

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_usa_weather, annot=True, cmap="coolwarm", vmin=-1, vmax=1)
plt.show()

The `sunshine_total_min` column has a lot of missing values, has a very weak correlation (-0.0066) with `avg_sea_level_pres_hpa`, and is not a focal point of this analysis, so we will drop that column.

In [None]:
# Columns to keep
columns_to_keep = [col for col in usa_weather.columns if col not in ['sunshine_total_min']]

# Use .loc to select only the columns to keep
usa_weather = usa_weather.loc[:, columns_to_keep]

# Check for missing values
usa_weather.isnull().sum()

### 5.4 Handling Missing Values

#### 5.4.1 Non-pressure-related Columns

##### 5.4.1.1 Average Temperature Interpolation and Validation

For the `avg_temp_c` missing values, we will calculate the average of the `min_temp_c` and `max_temp_c` columns and use that value to fill in the missing average temperature values. A new, temporary column will be created called `avg_temp_c_interpolated` to hold these values.

In [None]:
# Create a copy of the dataframe
usa_weather = usa_weather.copy()

# Create a column for calculating the `avg_temp_c` using the `min_temp_c` and `max_temp_c` columns
usa_weather['avg_temp_c_interpolated'] = usa_weather['avg_temp_c'].combine_first\
    ((usa_weather['min_temp_c'] + usa_weather['max_temp_c']) / 2)

Utilize the mean absolute error (MAE) to determine the accuracy of the interpolated values.

In [None]:
# Import mean_absolute_error from sklearn.metrics
from sklearn.metrics import mean_absolute_error

# Filter out rows where either of the two columns is NaN
filtered_df = usa_weather.dropna(subset=['avg_temp_c', 'avg_temp_c_interpolated'])

# Calculate mean absolute error
mae = mean_absolute_error(filtered_df['avg_temp_c'], filtered_df['avg_temp_c_interpolated'])

# Print the mean absolute error
print(f"Mean Absolute Error: {mae}")


As noted above, the MAE is 0.0, so we will use the interpolated values to fill in the missing values for the `avg_temp_c` column. We will then drop the `avg_temp_c_interpolated` column. Please note after further investigation, it was found that the original `avg_temp_c` values are precisely calculated as the average of `min_temp_c` and `max_temp_c` values, so there is no loss of information.

Replace the missing values in the `avg_temp_c` column with the values from the `avg_temp_c_interpolated` column.

In [None]:
# Fill missing values in the 'avg_temp_c' column with the average of the 
# 'min_temp_c' and 'max_temp_c' columns
usa_weather['avg_temp_c'] = usa_weather['avg_temp_c_interpolated']

Drop the `avg_temp_c_interpolated` column and check for any remaining missing values.

In [None]:
# Drop the 'avg_temp_c_interpolated' column
usa_weather.drop(columns=['avg_temp_c_interpolated'], inplace=True)

# Check for missing values
usa_weather.isnull().sum()

In [None]:
# Print the shape of the dataframe
print(f"Original Shape: {usa_weather.shape}")

##### 5.4.1.2 Aggregate Weather Data by Year and State

The migraine data is aggregated at an annual level and broken down by state, so we need to aggregate the weather data to match. A mean aggregation will be used for all columns except for the `precipitation_mm` and `snow_depth_mm` columns, which will be aggregated using a sum.

1. **Group by Year, State, and City Name**: Use pandas' `groupby` method to group data by both the `year`, `state`, and `city` columns.
2. **Aggregation Functions**: 
    - For temperatures (`avg_temp_c`, `min_temp_c`, `max_temp_c`), the mean is calculated for each year and state.
    - For wind (`avg_wind_dir_deg`, `avg_wind_speed_kmh`, `peak_wind_gust_kmh`), the mean is calculated for each year and state.
    - For `precipitation_mm` and `snow_depth_mm`, the total sum is calculated for each year and state.
    - For `avg_sea_level_pres_hpa`, the mean is calculated, assuming it's relevant to have an annual mean sea level pressure for each state.
3. **Spatial Data**: For latitude and longitude, the first observed value for each year and state is taken, assuming that these values are consistent within each state and year.

By following this methodology, the daily weather data is transformed into an annual summary by state, making it directly comparable with the annual, state-level migraine data for further analysis.

In [None]:
# Group by 'year' and 'state', then aggregate the numerical columns
annual_usa_weather_by_stateCity = usa_weather.groupby(['year', 'state', 'city_name']).agg({
    'avg_temp_c': 'mean',
    'min_temp_c': 'mean',
    'max_temp_c': 'mean',
    'precipitation_mm': 'sum',
    'snow_depth_mm': 'sum',
    'avg_wind_dir_deg': 'mean',
    'avg_wind_speed_kmh': 'mean',
    'peak_wind_gust_kmh': 'mean',
    'avg_sea_level_pres_hpa': 'mean',
    'latitude': 'first',  # Assuming all latitudes are the same for a given year and state
    'longitude': 'first'  # Assuming all longitudes are the same for a given year and state
}).reset_index()

# Review the shape of the new dataframe
annual_usa_weather_by_stateCity.shape

In [None]:
annual_usa_weather_by_stateCity.isnull().sum()

##### 5.4.1.3 Drop Wind Gust and Wind Direction Columns

In [None]:
annual_usa_weather_by_stateCity.drop(columns=['peak_wind_gust_kmh', 'avg_wind_dir_deg'], inplace=True)

annual_usa_weather_by_stateCity.shape

In [None]:
annual_usa_weather_by_stateCity.isnull().sum()

##### 5.4.1.4 Linear Interpolation for Average Wind Speed

In [None]:
# Handle missing values for the `avg_wind_speed_kmh` column utilizing linear interpolation
annual_usa_weather_by_stateCity['avg_wind_speed_kmh'].interpolate(method='linear', inplace=True)

# Check for missing values
annual_usa_weather_by_stateCity.isnull().sum()

In [None]:
# Check the shape of the dataframe
annual_usa_weather_by_stateCity.shape

#### 5.4.2 Pressure-related Columns

Sea level pressure can vary greatly depending on the location of the city and the main focus of this analysisis is to see if there is any correlation between sudden changes in sea level pressure and migraines.  As a result, we will not fill in missing values for the `avg_sea_level_pres_hpa` column without further research.  We will work through four different scenarios to determine which seems most accurate for this situation.
- Scenario #1: Leave/drop missing values for the `avg_sea_level_pres_hpa` value    
- Scenario #2: Utilize linear interpolation to fill in missing values for the `avg_sea_level_pres_hpa` column
- Scenario #3: Utilize forward fill to fill in missing values for the `avg_sea_level_pres_hpa` column
- Scenario #4: Utilize backward fill to fill in missing values for the `avg_sea_level_pres_hpa` column

##### 5.4.2.1 Leaving/Dropping Missing Values (Scenario #1)

In [None]:
# Descriptive statistics for the `avg_sea_level_pres_hpa` column
annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa'].describe()

##### 5.4.2.2 Linear Interpolation (Scenario #2)

In [None]:
# Handle missing values for `avg_sea_level_pres_hpa` column utilizing linear interpolation
annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa_linear'] = annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa'].interpolate(method='linear')

# Check for missing values
annual_usa_weather_by_stateCity.isnull().sum()

##### 5.4.2.3 Forward Fill (Scenario #3)

In [None]:
# Handle missing values for `avg_sea_level_pres_hpa` column utilizing forward fill
annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa_ffill'] = annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa'].ffill()

# Check for missing values
annual_usa_weather_by_stateCity.isnull().sum()

##### 5.4.2.4 Backward Fill (Scenario #4)

In [None]:
# Handle missing values for `avg_sea_level_pres_hpa` column utilizing backward fill
annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa_bfill'] = annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa'].bfill()

# Check for missing values
annual_usa_weather_by_stateCity.isnull().sum()

In [None]:
annual_usa_weather_by_stateCity.describe()

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 12))

# Define common x and y limits
x_limits = [1005, 1025]  # Replace with the min and max values across all datasets for the x-axis
y_limits = [0, 125]  # Replace with the max frequency across all datasets for the y-axis

# Calculate number of bins for each dataset using the Square Root Rule
num_bins1 = int(np.sqrt(len(annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa'].dropna())))
num_bins2 = int(np.sqrt(len(annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa_linear'].dropna())))
num_bins3 = int(np.sqrt(len(annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa_ffill'].dropna())))
num_bins4 = int(np.sqrt(len(annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa_bfill'].dropna())))

# Plot each histogram on a different subplot
axes[0, 0].hist(annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa'].dropna(), bins=num_bins1, color='blue')
axes[0, 0].set_title('Original')
axes[0, 0].set_xlim(x_limits)
axes[0, 0].set_ylim(y_limits)

axes[0, 1].hist(annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa_linear'].dropna(), bins=num_bins2, color='green')
axes[0, 1].set_title('Linear Interpolated')
axes[0, 1].set_xlim(x_limits)
axes[0, 1].set_ylim(y_limits)

axes[1, 0].hist(annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa_ffill'].dropna(), bins=num_bins3, color='red')
axes[1, 0].set_title('Forward Fill')
axes[1, 0].set_xlim(x_limits)
axes[1, 0].set_ylim(y_limits)

axes[1, 1].hist(annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa_bfill'].dropna(), bins=num_bins4, color='purple')
axes[1, 1].set_title('Backward Fill')
axes[1, 1].set_xlim(x_limits)
axes[1, 1].set_ylim(y_limits)

for ax in axes.flat:
    ax.set(xlabel='Sea Level Pressure (hPa)', ylabel='Frequency')
    ax.set_xlim(x_limits)
    ax.set_ylim(y_limits)

# Display all subplots
plt.tight_layout()
plt.show()


##### 5.4.2.5 Decision on Filling Missing Values

After a thorough review of all four scenarios, we've decided to employ **Scenario #2 (linear interpolation)** for filling the missing values in the `avg_sea_level_pres_hpa` column. The rationale behind this choice is manifold:

- **Representation of Data**: Linear interpolation provides a smoother distribution than the other methods. This approach does not heavily skew the tail ends of the distribution, ensuring a more natural representation.
  
- **Preservation of Original Distribution**: Linear interpolation appears to retain the original data distribution more faithfully when filling in missing values, without introducing any discernible bias towards specific values.

- **Percentage of Missing Values**: With only 138 missing values, which represents 9.47% of the total count of 1457, the sheer accuracy of the method is not as paramount as it would be with a more substantial portion of missing values. Nevertheless, it's essential to utilize a method that delivers reliability, and linear interpolation does just that.

**Analysis of Alternative Methods:**

- The **forward-fill method**, though commendable, might introduce bias as it overlooks subsequent values after a missing point. It stands as our second preference.
  
- The **backward-fill method** is our third choice. While it does consider subsequent data points, its accuracy seems to trail the forward-fill method.
  
- Lastly, simply **leaving or dropping missing values** is the least appealing choice, as it disregards the rest of the dataset's information.


##### 5.4.2.6 Linear Interpolation for Average Sea Level Pressure

In [None]:
# copy the dataframe
annual_usa_weather_by_stateCity = annual_usa_weather_by_stateCity.copy()

# Fill missing values in the `avg_sea_level_pres_hpa` column with the linear values
annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa'] = annual_usa_weather_by_stateCity['avg_sea_level_pres_hpa_linear']

# Drop the `avg_sea_level_pres_hpa_ffill` and `avg_sea_level_pres_hpa_bfill` columns
annual_usa_weather_by_stateCity.drop(columns=['avg_sea_level_pres_hpa_ffill', 'avg_sea_level_pres_hpa_bfill'], inplace=True)

# check for missing values
annual_usa_weather_by_stateCity.isnull().sum()

In [None]:
# Filter out rows where either of the two columns is NaN
filtered_df = annual_usa_weather_by_stateCity.dropna(subset=['avg_sea_level_pres_hpa', 'avg_sea_level_pres_hpa_linear'])

# Calculate mean absolute error
mae = mean_absolute_error(filtered_df['avg_sea_level_pres_hpa'], filtered_df['avg_sea_level_pres_hpa_linear'])

# Print the mean absolute error
print(f"Mean Absolute Error: {mae}")

In [None]:
annual_usa_weather_by_stateCity.drop(columns=['avg_sea_level_pres_hpa_linear'], inplace=True)

annual_usa_weather_by_stateCity.isnull().sum()

In [None]:
annual_usa_weather_by_stateCity.head()

In [None]:
annual_usa_weather_by_stateCity.shape

### 5.6 Join Migraine Data with Weather Data

Now that the weather data has been aggregated to match the migraine data, we can join the two datasets together. These datasets will be joined on the `year` and `state` columns from both the USA weather data and the migraine data.

In [None]:
filtered_migraine_data.head()

In [None]:
filtered_migraine_data.shape

In [None]:
# Combine USA combined weather with migraine data
usa_weather_migraine = annual_usa_weather_by_stateCity.merge(filtered_migraine_data, 
                                      how='left', 
                                      left_on=['year', 'state'], 
                                      right_on=['year', 'state']
                                      )

# Review the shape of the new dataframe
usa_weather_migraine.shape

## 6. Feature Engineering

Discussing any new features that were created and why they were created. Also, discuss any features that were dropped and why they were dropped.

### 6.1 New Features

#### 6.1.1 Converting Celsius to Fahrenheit

##### 6.1.1.1 Convert `avg_temp_c` to `avg_temp_f`

In [None]:
from temp_conversion import celsius_to_fahrenheit

usa_weather_migraine['avg_temp_f'] = usa_weather_migraine['avg_temp_c'].apply(celsius_to_fahrenheit)

##### 6.1.1.2 Convert `min_temp_c` to `min_temp_f`

In [None]:
usa_weather_migraine['min_temp_f'] = usa_weather_migraine['min_temp_c'].apply(celsius_to_fahrenheit)

##### 6.1.1.3 Convert `max_temp_c` to `max_temp_f`

In [None]:
usa_weather_migraine['max_temp_f'] = usa_weather_migraine['max_temp_c'].apply(celsius_to_fahrenheit)

##### 6.1.1.4 Reorder Temperature Columns

In [None]:
# Reorder temperature columns
temp_col = usa_weather_migraine.pop('avg_temp_f')

# Insert columns at new position
usa_weather_migraine.insert(3, 'avg_temp_f', temp_col)

In [None]:
# Reorder temperature columns
temp_col1 = usa_weather_migraine.pop('min_temp_f')

# Insert columns at new position
usa_weather_migraine.insert(4, 'min_temp_f', temp_col1)

In [None]:
# Reorder temperature columns
temp_col2 = usa_weather_migraine.pop('max_temp_f')

# Insert columns at new position
usa_weather_migraine.insert(5, 'max_temp_f', temp_col2)

In [None]:
usa_weather_migraine

##### 6.1.1.5 Convert `precipitation_mm` to `precipitation_in`

In [None]:
from temp_conversion import mM_to_inches

usa_weather_migraine['precipitation_in'] = usa_weather_migraine['precipitation_mm'].apply(mM_to_inches)


In [None]:
# Reorder temperature columns
temp_col3 = usa_weather_migraine.pop('precipitation_in')

# Insert columns at new position
usa_weather_migraine.insert(9, 'precipitation_in', temp_col3)

In [None]:
usa_weather_migraine.head()

#### 6.1.2 Sea Level Pressure Missing Values

- `avg_sea_level_pres_hpa`: original column with missing values
- `avg_sea_level_pres_hpa_linear`: calculated by using linear interpolation to fill in missing values for the average sea level pressure
- `avg_sea_level_pres_hpa_ffill`: calculated by using forward fill to fill in missing values for the average sea level pressure
- `avg_sea_level_pres_hpa_bfill`: calculated by using backward fill to fill in missing values for the average sea level pressure

### 6.2 Dropped Features

In [None]:
# usa_weather_migraine.drop(columns=['avg_temp_c', 'min_temp_c', 'max_temp_c'], inplace=True)

# usa_weather_migraine

In [None]:
# usa_weather_migraine.drop(columns='precipitation_mm', inplace=True)

# usa_weather_migraine

## 7. Summary

At this point, we have performed a comprehensive data cleaning and preprocessing operation. This has involved everything from checking for data consistency, to handling missing values, to merging and integrating various data sets. The data is now in a state that is amenable to further analysis and modeling. This notebook has set the stage for the exploration and insights that will be obtained in the subsequent notebooks.

## 8. Next Steps

The next step in the project pipeline is data analysis (03_data_analysis.ipynb). In this notebook, we will explore the relationships between different variables and carry out statistical tests to validate the hypotheses.