https://www.kaggle.com/c/predict-impact-of-air-quality-on-death-rates/discussion

Links:

[Iterative Imputer Sklearn](https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py)

# Structure

1. Explore Data:

  * Distribution plots

  * Missing value imputation:
    * Pattern of missing values
    * Dropping all missing values
    * Univariate imputation (Sklearn)
    * Multivariate imputation (Sklearn)
    * GAN-based imputation
    * Time-series based imputation (long-run variance)
  Stationarity in Multivariate time-series

  * Feature generation:
    * Log transformation
    * Box-Cox transformation
    * Correlation Analysis
    * VIF Analysis
    * Handling outliers
  
2. Approach for prediction:
  * Regression-based
  * Classification-based

3. Forecasting
---
4. Creating our dataset

5. Above stuff on our dataset

In [0]:
# Helper libraries
import numpy as np
import pandas as pd

# Plots
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Time-series Models
from statsmodels.tsa.seasonal import seasonal_decompose
from dateutil.parser import parse

# Imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer

import warnings
warnings.filterwarnings("ignore")

# Importing Data

In [0]:
train = pd.read_csv("/content/drive/My Drive/PROJECTS/DSL/Kaggle/train.csv")
test = pd.read_csv("/content/drive/My Drive/PROJECTS/DSL/Kaggle/test.csv")
regions = pd.read_csv("/content/drive/My Drive/PROJECTS/DSL/Kaggle/regions.csv")

In [0]:
train.head()

Unnamed: 0,Id,region,date,mortality_rate,O3,PM10,PM25,NO2,T2M
0,1,E12000001,2007-01-02,2.264,42.358,9.021,,,278.138
1,2,E12000001,2007-01-03,2.03,49.506,5.256,,,281.745
2,3,E12000001,2007-01-04,1.874,51.101,4.946,,,280.523
3,4,E12000001,2007-01-05,2.069,47.478,6.823,,,280.421
4,5,E12000001,2007-01-06,1.913,45.226,7.532,,,278.961


In [0]:
test.head()

Unnamed: 0,Id,region,date,O3,PM10,PM25,NO2,T2M
0,18404,E12000006,2012-05-28,75.98,20.876,19.123,9.713,290.787
1,18405,E12000006,2012-05-29,73.084,21.66,17.794,8.417,288.474
2,18406,E12000006,2012-05-30,59.35,21.925,17.699,10.878,289.889
3,18407,E12000006,2012-05-31,45.991,14.549,11.386,10.302,287.815
4,18408,E12000006,2012-06-01,52.21,11.208,9.545,8.598,287.627


In [0]:
regions.head()

Unnamed: 0,Code,Region
0,E12000001,North East
1,E12000002,North West
2,E12000003,Yorkshire and The Humber
3,E12000004,East Midlands
4,E12000005,West Midlands


In [0]:
train.shape

(18403, 9)

In [0]:
test.shape

(7886, 8)

In [0]:
regions.shape

(9, 2)

# Data Cleaning

In [0]:
y_train = train['mortality_rate']

In [0]:
train.drop(labels = ['mortality_rate'],axis=1, inplace = True)

In [0]:
data = pd.concat([train,test])

In [0]:
data.shape

(26289, 8)

In [0]:
data.isna().sum()

Id           0
region       0
date         0
O3           9
PM10         9
PM25      3276
NO2       6570
T2M          0
dtype: int64