# Feature Engineering

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math

from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
pd.set_option('display.max_rows', 200)

In [None]:
# Import datasets
train = pd.read_csv('./datasets/cleaned_train.csv')
test = pd.read_csv('./datasets/test.csv')
spray = pd.read_csv('./datasets/spray.csv')
weather = pd.read_csv('./datasets/cleaned_weather.csv')

We've observed that our features generally have a pretty low correlation to `WnvPresent`. Our strongest feature is `NumMosquitos` with a correlation of 0.197, which can be used together with our target variable to calculate <b>Mosquito Infection Rate</b>, which is defined by the [CDC](https://www.cdc.gov/westnile/resourcepages/mosqSurvSoft.html#:~:text=The%20simplest%20estimate%2C%20the%20minimum,goals%20of%20the%20surveillance%20program) as: 

$$ \text{MIR} = 1000 * {\text{number of positive pools} \over \text{total number of mosquitos in pools tested}} $$

This variable and time-lagged versions of it achieved a high correlation with the target --  <b>an interaction feature between `Week` and `MIR` had a much higher correlation of 0.26</b>. With further polynomial feature engineering, we managed to get features with up to 0.46 correlation with our target variable. Unfortunately, our test data doesn't have the information we need to make this a usable feature. We discussed estimating the number of mosquitos based on total rows in the test set, but we ultimately decided that this was a slightly [<i>'hackish'</i> solution](https://www.kaggle.com/c/predict-west-nile-virus/discussion/14790). We'll drop NumMosquitos moving forward.

In [None]:
# Dropping NumMosquitos as it isn't present within test data
train = train.drop(columns='NumMosquitos')

Our remaining features can be categorised as a mixture of 1. time, 2. weather and 3. location variables. Each of these variables has a low correlation of 0.105 or less to our target. While we certainly could just go ahead with these features and jump straight into predictive modelling, a much better approach in the form of feature engineering is available. Without engineering, our models consistenly scored an AUC-ROC of approximately 0.5.

In this section, we'll look to <b>decompose and split our features</b>, as well as carry out <b>data enrichment</b> in the form of historical temperature records from the [National Weather Service](weather.gov). We'll also carry out a bit of polynomial feature engineering, to try and create features with a higher correlation to our target. 

### Preparation for Engineering

In [None]:
# Convert to datetime object
weather['Date'] = pd.to_datetime(weather['Date'])
train['Date'] = pd.to_datetime(train['Date'])

In [None]:
# This gives me a more precise means of accessing certain weeks in a specific year
def year_week(row):
    week = row['Week']
    year = row['Year']
    row['YearWeek'] = f'{year}{week}'
    row['YearWeek'] = int(row['YearWeek'])
    return row

In [None]:
train = train.apply(year_week, axis=1)
weather = weather.apply(year_week, axis=1)

## Relative Humidity

High humidity is thought to be a strong factor in the spread of the West Nile Virus -- it's been [reported](https://www.mdpi.com/1660-4601/17/4/1403/pdf#:~:text=caspius.,%25%20%5B19%2C23%5D.) that <b>high humidity increases egg production, larval indices, mosquito activity and influences their activities</b>. Other studies have shown that a suitable range of humidity stimulating mosquito flight activity is between 44% and 69%, with 65% as a focal percentage. 

The climate of Chicago is classified as hot-summer humid continental (Köppen climate classification: Dfa), which means that humidity is worth looking into. To calculate relative humidity, we'll first look to convert some of our temperature readings into degrees celcius.

#### Calculate Celcius

In [None]:
# To calculate Relative Humidity, we need to change our features from Fahrenheit to Celcius
def celsius(x):
    c = ((x - 32) * 5.0)/9.0
    return float(c)

In [None]:
weather['TavgC'] = weather['Tavg'].apply(celsius)
weather['TminC'] = weather['Tmin'].apply(celsius)
weather['TmaxC'] = weather['Tmax'].apply(celsius)
weather['DewPointC'] = weather['DewPoint'].apply(celsius)

In [None]:
def r_humid(row):
    row['r_humid'] = round(100*(math.exp((17.625*row['DewPointC'])/(243.04+row['DewPointC'])) \
                          / math.exp((17.625*row['TavgC'])/(243.04+row['TavgC']))))
    return row

Formula for [Relative Humidity](https://bmcnoldy.rsmas.miami.edu/Humidity.html):

$$ \large RH = 100 {exp({aT_{d} \over {b + T_{d}}}) \over exp({aT \over b + T})}$$

where: <br>
$ \small a = \text{17.625} $ <br>
$ \small b = 243.04 $ <br>
$ \small T = \text{Average Temperature (F)} $ <br>
$ \small T_{d} = \text{Dewpoint Temperature (F)} $ <br>
$ \small RH = \text{Relative Humidity} $ (%)<br>

In [None]:
weather = weather.apply(r_humid, axis=1)

In [None]:
# Dropping as Celcius features are no longer needed
weather = weather.drop(columns=['TavgC', 'TminC', 'TmaxC', 'DewPointC'])

In [None]:
weather.sort_values(by='r_humid', ascending=False).head()

Unnamed: 0,Date,AvgSpeed,Cool,Depart,DewPoint,Heat,PrecipTotal,ResultDir,ResultSpeed,SeaLevel,StnPressure,Sunrise,Sunset,Tavg,Tmax,Tmin,WetBulb,lowvis,rain,r_humid
1287,2013-10-31,11.45,0.0,9.5,56.0,9.5,1.535,23.5,9.45,29.46,28.73,623.0,1647.0,55.5,64.5,46.5,57.0,1.0,1.0,102
1454,2014-10-14,9.4,0.0,7.0,58.0,5.0,0.915,11.0,2.65,29.545,28.86,603.0,1712.0,60.0,66.0,53.0,59.0,1.0,1.0,93
319,2008-09-13,9.4,7.5,7.5,70.0,0.0,4.855,21.5,6.8,29.705,29.005,529.0,1807.0,72.5,76.0,68.5,71.0,1.0,1.0,92
1286,2013-10-30,6.85,0.0,8.5,52.0,10.5,0.855,14.5,5.95,30.005,29.255,622.0,1649.0,54.5,63.5,45.0,53.0,1.0,1.0,91
763,2011-05-28,6.25,0.0,-5.5,55.0,7.5,0.175,18.0,5.2,29.83,29.13,421.0,1916.0,57.5,63.0,51.5,56.0,1.0,1.0,91


In [None]:
# The average humidity in Chicago could be a factor in the spread of the West Nile Virus
weather['r_humid'].mean()

62.21399456521739

Note: Relative humidity can exist beyond 100% due to [supersaturation](https://www.chicagotribune.com/news/ct-xpm-2011-07-20-ct-wea-0720-asktom-20110720-story.html#:~:text=Surprisingly%2C%20yes%2C%20the%20condition%20is,is%20needed%20to%20cause%20saturation.). Water vapor begins to condense onto impurities (such as dust or salt particles) in the air as the RH approaches 100 percent, and a cloud or fog forms.

## Weekly Average Precipitation

It's often thought that above-average precipitation leads to a higher abundance of mosquitoes and increases the potential for disease outbreaks like the West Nile Virus. This positive association has been confirmed by several [studies](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4342965/#RSTB20130561C42), but precipitation can be slightly more complex as a feature, as heavy rainfall could dilute the nutrients for larvae, thus decreasing development rate. It might also lead to a negative association by [flushing ditches and drainage channels used by Culex larvae](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4342965/#RSTB20130561C56).

Regardless, precipitation is still worth looking into. Instead of looking at daily precipitation amounts which likely don't affect the presence of WNV on that particular day, we can take cumulative weekly precipiation into account, and create a feature measuring weeks with heavy rain.

In [None]:
weather = weather.apply(year_week, axis=1)

In [None]:
# Setting up grouped df for calculation of cumulative weekly precipitation
group_df = weather.groupby('YearWeek').sum()

In [None]:
def WeekPrecipTotal(row):
    YearWeek = row['YearWeek']
    row['WeekPrecipTotal'] = group_df.loc[YearWeek]['PrecipTotal']
    return row

In [None]:
weather = weather.apply(WeekPrecipTotal, axis=1)

## Weekly Average Temperature

Temperature has been [acknowledged](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4342965/) to be the most prominent feature associated with outbreaks of the West Nile Virus. Among other things, high temperature has been show to positively correlate with viral replication rates, seasonal phenology of mosquito host populations, growth rates of vector populations, viral transmission efficiency to birds and geographical variations in human case incidence.

Rather than looking at daily temperature, we'll also look at average temperatures by week.

In [None]:
def WeekAvgTemp(row):
    # Retrieve current week
    YearWeek = row['YearWeek']
    
    # Retrieving sum of average temperature for current week
    temp_sum = group_df.loc[YearWeek]['Tavg']
    
    # Getting number of days recorded by weather station for current week
    n_days = weather[weather['YearWeek'] == YearWeek].shape[0]
    
    # Calculate Week Average Temperature
    row['WeekAvgTemp'] = temp_sum / n_days
    
    return row

In [None]:
weather = weather.apply(WeekAvgTemp, axis=1)

## Winter Temperature

Winter temperatures aren't a very intuitive variable when it comes to predicting the West Nile Virus. However, it turns out that the WNV can <b>[overwinter](https://ugaurbanag.com/811-2/)</b>. What this means, is that there are specific species of mosquito such as the Culex species that can overwinter -- this takes place in the adult stage by fertilized, non-blood-fed females. The Culex pipiens in particular goes into physiological diapauses (akin to hibernation) during the winter months.

The virus does not replicate within the mosquito at lower temperatures, <b>but is available to begin replication when temperatures increase</b>. This corresponds with the beginning of the nesting period of birds and the presence of young birds. Circulation of virus in the bird populations lead to the amplication of the virus and growth of vector mosquito populations.

The National Weather Service carries [historical records of January temperatures](https://www.weather.gov/lot/January_Temperature_Rankings_Chicago) -- I created a dataset based on this and carried out some minimal cleaning to create a proxy feature measuring winter temperatures.

In [None]:
# This dataset gives us the average Janurary temperature of each year -- we're using this as a proxy for Winter temperatures.
# We can also see how far each temperature differs from the 30 year normal (23.8 degrees F)
winter_df = pd.read_csv('./datasets/jan_winter.csv')
winter_df.head()

Unnamed: 0,Year,AvgTemp,JanDepart
0,2001,24.6,0.8
1,2002,31.9,8.1
2,2003,21.3,-2.5
3,2004,20.3,-3.5
4,2005,24.5,0.7


In [None]:
def winter_temp(row):
    year = row['Year']
    #row['WinterTemp'] = winter_df[winter_df['Year'] == year]['AvgTemp'].values[0]
    row['WinterDepart'] = winter_df[winter_df['Year'] == year]['JanDepart'].values[0]
    return row

In [None]:
weather = weather.apply(winter_temp, axis=1)

## Summer Temperature

While the link between summer temperature and WNV isn't as clear, we thought it might be worth bearing investigation into whether warmer summers (or in this case - warm Julys) affect the spread of the WNV. The virus is said to spreads most efficiently in the United States at temperatures [between 75.2 and 77 degrees Fahrenheit](https://www.medicinenet.com/script/main/art.asp?articlekey=247250#:~:text=The%20mosquito%2Dborne%20virus%20spreads,15%20in%20the%20journal%20eLife.).

This data also comes from the [National Weather Service](https://www.weather.gov/lot/July_Temperature_Rankings_Chicago).

In [None]:
# This dataset gives us the average July temperature of each year -- we're using this as a proxy for Summer temperatures.
# We can also see how far each temperature differs from the 30 year normal (74.0 degrees F)
summer_df = pd.read_csv('./datasets/jul_summer.csv')
summer_df.head()

Unnamed: 0,Year,AvgTemp,JulDepart
0,2006,76.5,2.5
1,2007,73.7,-0.3
2,2008,74.0,0.0
3,2009,69.4,-4.6
4,2010,77.7,3.7


In [None]:
def summer_temp(row):
    year = row['Year']
    #row['SummerTemp'] = summer_df[summer_df['Year'] == year]['AvgTemp'].values[0]
    row['SummerDepart'] = summer_df[summer_df['Year'] == year]['JulDepart'].values[0]
    return row

In [None]:
weather = weather.apply(summer_temp, axis=1)

## Temporal Features

Some [studies](https://pubmed.ncbi.nlm.nih.gov/30145430/) have argued that increased precipitation and temperatures might have a <n>lagged direct effect</n> on the incidence of WNV infection. Given that the incubation period for most Culex mosquitos is approximately [7-10 days](https://www.cdc.gov/mosquitoes/about/life-cycles/culex.html#:~:text=Life%20stages%20of%20Culex%20pipiens,develop%20into%20an%20adult%20mosquito), the temperature, humidity and precipitation of previous weeks could play into higher mosquito growth in following weeks. According to the CDC, eggs are ready to hatch from a few days to several months after being laid. Accordingly, we'll create some time-lagged variables going back to a month before the current date.

### Average Temperature (1 week - 4 weeks before)

In [None]:
def create_templag(row):   
    # Getting average temperature one week before
    YearWeek = row['YearWeek']
    
    # Calculating average temperature for up to four weeks before
    for i in range(4):
        try:
            row[f'templag{i+1}'] = weather[weather['YearWeek'] == (YearWeek - (i+1))]['WeekAvgTemp'].unique()[0]
            
        # For the first 4 weeks of the year where no previous data exists, create rough estimate of temperatures
        except IndexError:
            row[f'templag{i+1}'] = row['WeekAvgTemp'] - i
    return row

In [None]:
weather = weather.apply(create_templag, axis=1)

### Cumulative Weekly Precipitation (1 week - 4 weeks before)

In [None]:
def create_rainlag(row):
    # Getting average temperature one week before
    YearWeek = row['YearWeek']
    
    # Calculating average temperature for up to four weeks before
    for i in range(4):
        try:
            row[f'rainlag{i+1}'] = weather[weather['YearWeek'] == (YearWeek - (i+1))]['WeekPrecipTotal'].unique()[0]
            
        # Use average of column if no data available
        except IndexError:
            row[f'rainlag{i+1}'] = weather['WeekPrecipTotal'].mean()
    return row

In [None]:
weather = weather.apply(create_rainlag, axis=1)

### Relative Humidity (1 week - 4 weeks before)

In [None]:
def create_humidlag(row):
    # Getting average temperature one week before
    YearWeek = row['YearWeek']
    
    # Calculating average temperature for up to four weeks before
    for i in range(4):
        try:
            row[f'humidlag{i+1}'] = weather[weather['YearWeek'] == (YearWeek - (i+1))]['r_humid'].unique()[0]
            
        # Use average of column if no data available
        except IndexError:
            row[f'humidlag{i+1}'] = weather['r_humid'].mean()
    return row

In [None]:
weather = weather.apply(create_humidlag, axis=1)

In [None]:
# Checking that temperature lagged variables are correct
weather.groupby(by='YearWeek').mean()[['WeekAvgTemp', 'templag1', 'templag2', 'templag3', 'templag4']].tail(5)

Unnamed: 0_level_0,WeekAvgTemp,templag1,templag2,templag3,templag4
YearWeek,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
201440,57.142857,65.285714,62.142857,60.071429,74.428571
201441,53.714286,57.142857,65.285714,62.142857,60.071429
201442,54.0,53.714286,57.142857,65.285714,62.142857
201443,54.642857,54.0,53.714286,57.142857,65.285714
201444,50.2,54.642857,54.0,53.714286,57.142857


## Traps

During our exploratory data analysis, we also noticed that several traps had extremely high numbers of mosquitos and accordingly, high numbers of `WnvPresent`. We decided to one-hot encode all of the mosquito traps and compare them with our target variable further on.

In [None]:
train = pd.get_dummies(train, columns=['Trap'])

## Species

We also noticed that only three species were identified as WNV carriers. These species are the `CULEX PIPIENS/RESTUANS`, `CULEX RESTUANS` and `CULEX PIPIENS`. Noticeably, the incidence of the WNV in `CULEX RESTUANS` was 0.002 (49 positive pools vs 23431 mosquitos), while the incidence of the WNV in `CULEX PIPIENS/RESTUANS` was 0.004 (262 positive pools vs 66268 mosquitos). In `CULEX PIPIENS`, the incidence of WNV was measured at 0.005 (240 positive pools vs 44671 mosquitos).

Given this relationship, we placed a lighter weight on the `CULEX RESTUANS`, while assigning no weight to species that weren't identified as WNV carriers by the data.

In [None]:
# WnvPresent by species
train[['Species', 'WnvPresent']].groupby('Species').sum()

Unnamed: 0_level_0,WnvPresent
Species,Unnamed: 1_level_1
CULEX ERRATICUS,0
CULEX PIPIENS,240
CULEX PIPIENS/RESTUANS,262
CULEX RESTUANS,49
CULEX SALINARIUS,0
CULEX TARSALIS,0
CULEX TERRITANS,0


In [None]:
train['Species'] = train['Species'].map({'CULEX PIPIENS/RESTUANS': 2, 'CULEX PIPIENS': 2, 'CULEX RESTUANS': 1}) \
                                   .fillna(0)

In [None]:
# Checking species value count
train['Species'].value_counts()

2.0    7451
1.0    2740
0.0     315
Name: Species, dtype: int64

# Feature Selection

As previously mentioned, our features were generally quite low in terms of correlation to our target. Polynomial feature engineering is a great way to deal with this -- combining or transforming features can often significantly a feature's correlation to the target.

We can also identify relationships of interest between our variables.

## Polynomial Feature Engineering

In [None]:
merged_df = pd.merge(weather, train, on=['Date', 'Year', 'Week', 'Month', 'YearWeek', 'DayOfWeek'])

In [None]:
X = merged_df[[col for col in merged_df.columns if 'WnvPresent' not in col]]._get_numeric_data()
y = train['WnvPresent']

In [None]:
# Generates the full polynomial feature table
poly = PolynomialFeatures(include_bias=False, degree=2)
X_poly = poly.fit_transform(X)
X_poly.shape

(10506, 16835)

In [None]:
# Adds appropriate feature names to all polynomial features
X_poly = pd.DataFrame(X_poly,columns=poly.get_feature_names(X.columns))

# Generates list of poly feature correlations
X_poly_corrs = X_poly.corrwith(y)

# Shows features most highly correlated (positively) with target
X_poly_corrs.sort_values(ascending=False).head(10)

Sunrise WeekAvgTemp             0.150030
Sunrise templag2                0.147419
Sunrise templag3                0.147091
Sunrise templag1                0.144225
Sunrise WetBulb                 0.143433
DewPoint Sunrise                0.142001
DewPoint Week                   0.141467
Sunrise Tmin                    0.140002
Week WeekAvgTemp                0.139110
WeekPrecipTotal WinterDepart    0.138693
dtype: float64

Unsurprisingly, all of our top features involve some type of temperature variable. To prevent adding too much multicollinearity to our model, we'll pick only three features from this list.

In [None]:
# Creating interaction features -- only 3 due to multicollinearity issues
merged_df['Sunrise_WeekAvgTemp'] = merged_df['Sunrise'] * merged_df['WeekAvgTemp']
merged_df['Sunrise_WetBulb'] = merged_df['Sunrise'] * merged_df['WetBulb']
merged_df['Week_WeekAvgTemp'] = merged_df['Week'] * merged_df['WeekAvgTemp']

In [None]:
cm = abs(merged_df.corr()['WnvPresent']).sort_values(ascending=False)

WnvPresent         1.000000
NumMosquitos       0.196820
WinterTemp         0.124978
WinterDepart       0.124978
templag3           0.120568
templag2           0.106181
Sunrise            0.105227
Week_x             0.104171
Week_y             0.104171
Month              0.100143
WeekPrecipTotal    0.098721
templag4           0.087055
DewPoint           0.085883
WetBulb            0.080468
Tmin               0.074048
templag1           0.065369
Tavg               0.064256
Cool               0.058101
WeekAvgTemp        0.054949
YearWeek_x         0.053012
YearWeek_y         0.053012
Depart             0.052586
r_humid            0.052098
Year_x             0.050865
Year_y             0.050865
Tmax               0.048244
rainlag1           0.038998
humidlag2          0.030603
humidlag1          0.029072
humidlag4          0.028938
Latitude           0.028697
lowvis             0.027671
rain               0.024905
DayOfWeek          0.014968
rainlag2           0.010371
StnPressure        0

At this point, we'll look to drop some variables with extremely low correlation to our target. Most of these variables are our trap feature that we one-hot encoded.

In [None]:
# Variables with less than 2% correlation to WnvPresent
cols_to_drop = cm[cm < 0.02].index
cols_to_drop

Index(['Trap_T013', 'Trap_T230', 'Trap_T014', 'Trap_T148', 'Trap_T016',
       'Trap_T018', 'Trap_T043', 'Trap_T212', 'Trap_T145', 'Trap_T074',
       ...
       'Trap_T156', 'Trap_T162', 'Trap_T039', 'Trap_T142', 'PrecipTotal',
       'Trap_T227', 'Trap_T033', 'Trap_T066', 'Trap_T226', 'ResultDir'],
      dtype='object', length=132)

In [None]:
merged_df = merged_df.drop(columns=cols_to_drop)
merged_df.shape

(10506, 58)

There's a degree of multicollinearity in our data, but this isn't something we can avoid, given that all of our top variables are too important to drop. Luckily, our interaction features seem to be somewhat distinct from other variables and each other (less than 60% correlation which should limit worsening the effects of multicollinearity).

In [None]:
plt.figure(figsize=(14, 14))
sns.heatmap(merged_df.corr(), cmap='coolwarm', square=True)

### Selecting Top Features

In [None]:
cm = abs(merged_df.corr()['WnvPresent']).sort_values(ascending=False)
cm = cm.drop('WnvPresent')
cols_to_keep = cm.head(40)
cols_to_keep

Sunrise_WeekAvgTemp    0.150030
Sunrise_WetBulb        0.143433
Week_WeekAvgTemp       0.139110
WinterDepart           0.124978
templag3               0.120568
templag2               0.106181
Sunrise                0.105227
Week                   0.104171
Species                0.103477
Month                  0.100143
WeekPrecipTotal        0.098721
templag4               0.087055
DewPoint               0.085883
WetBulb                0.080468
Tmin                   0.074048
Sunset                 0.068451
templag1               0.065369
Tavg                   0.064256
Longitude              0.060345
Cool                   0.058101
WeekAvgTemp            0.054949
rainlag4               0.054830
Heat                   0.054740
YearWeek               0.053012
Depart                 0.052586
r_humid                0.052098
n_codesum              0.051083
Year                   0.050865
Tmax                   0.048244
ResultSpeed            0.046298
Trap_T900              0.044220
rainlag1

In [None]:
final_train_df = merged_df[cols_to_keep.keys()]

## Prepare Train & Test for Modelling

In [None]:
test = pd.get_dummies(test, columns=['Trap'])

In [None]:
test['Date'] = pd.to_datetime(test['Date'])

In [None]:
test['Species'] = test['Species'].map({'CULEX PIPIENS/RESTUANS': 2, 'CULEX PIPIENS': 2, 'CULEX RESTUANS': 1}).fillna(0)

In [None]:
merged_test_df = pd.merge(weather, test)

In [None]:
merged_test_df.columns

Index(['Date', 'AvgSpeed', 'Cool', 'Depart', 'DewPoint', 'Heat', 'PrecipTotal',
       'ResultDir', 'ResultSpeed', 'SeaLevel',
       ...
       'Trap_T231', 'Trap_T232', 'Trap_T233', 'Trap_T234', 'Trap_T235',
       'Trap_T236', 'Trap_T237', 'Trap_T238', 'Trap_T900', 'Trap_T903'],
      dtype='object', length=200)

In [None]:
merged_test_df['Sunrise_WeekAvgTemp'] = merged_test_df['Sunrise'] * merged_test_df['WeekAvgTemp']
merged_test_df['Sunrise_WetBulb'] = merged_test_df['Sunrise'] * merged_test_df['WetBulb']
merged_test_df['Week_WeekAvgTemp'] = merged_test_df['Week'] * merged_test_df['WeekAvgTemp']

In [None]:
final_test_df = merged_test_df[cols_to_keep.keys()]

In [None]:
# Checking for missing columns
[col for col in final_test_df if col not in final_train_df], [col for col in final_train_df if col not in final_test_df]

[]

In [None]:
final_train_df.shape

(10506, 40)

In [None]:
final_test_df.shape

(116293, 40)

In [None]:
final_train_df.isnull().sum()[final_train_df.isnull().sum() > 0]

Series([], dtype: int64)

In [None]:
final_test_df.isnull().sum()[final_test_df.isnull().sum() > 0]

Series([], dtype: int64)

## Export

In [None]:
final_train_df.to_csv('./datasets/final_train.csv', index=False)

In [None]:
final_test_df.to_csv('./datasets/final_test.csv', index=False)