In [9]:
import os
import sys
import glob
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt 

# ML related imports


# AT utility imports
from preprocessing import read_split


mpl.style.use("bmh")
%config InlineBackend.figure_format = 'retina'

# Reading data

Reading the `dins` original file, which is a heterogeneous tabular dataset. Then, conduct a simple data exploration on the distribution and nature of the features.

In [14]:
X_train_full, X_test, y_train_full, y_test, col_names = read_split("dins_2017_2022.csv")
X_train_full.sample(5)

Unnamed: 0,ROOFCONSTRUCTION,EAVES,VENTSCREEN,EXTERIORSIDING,WINDOWPANE,DECKPORCHONGRADE,DECKPORCHELEVATED,PATIOCOVER,FENCE,YEARBUILT,LATITUDE,LONGITUDE,DISTANCE
44394,Fire Resistant,Unenclosed,Screened,Combustible,Single Pane,,,,,,38.480848,-122.750318,50.331979
24467,Asphalt,Unenclosed,"Mesh Screen > 1/8""",Combustible,Multi Pane,Composite,Unknown,No Patio Cover/Carport,Non Combustible,1962.0,39.760703,-121.64281,73.647705
40282,Metal,Unenclosed,No Vents,Metal,Single Pane,No Deck/Porch,No Deck/Porch,No Patio Cover/Carport,No Fence,,40.387061,-122.695311,19.370348
27240,Tile,Unenclosed,"Mesh Screen > 1/8""",Wood,Multi Pane,Masonry/Concrete,No Deck/Porch,No Patio Cover/Carport,Combustible,1985.0,38.505486,-122.644597,140.115956
52647,Tile,Unknown,"Mesh Screen > 1/8""",Combustible,Single Pane,No Deck/Porch,No Deck/Porch,Combustible,Non Combustible,1978.0,40.594664,-122.420977,393.142494


In [16]:
X_train_full.info()

<class 'pandas.core.frame.DataFrame'>
Index: 69725 entries, 49272 to 49682
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ROOFCONSTRUCTION   66280 non-null  object 
 1   EAVES              66220 non-null  object 
 2   VENTSCREEN         66180 non-null  object 
 3   EXTERIORSIDING     66267 non-null  object 
 4   WINDOWPANE         66212 non-null  object 
 5   DECKPORCHONGRADE   56244 non-null  object 
 6   DECKPORCHELEVATED  56242 non-null  object 
 7   PATIOCOVER         56238 non-null  object 
 8   FENCE              56242 non-null  object 
 9   YEARBUILT          42390 non-null  float64
 10  LATITUDE           69725 non-null  float64
 11  LONGITUDE          69725 non-null  float64
 12  DISTANCE           69725 non-null  float64
dtypes: float64(4), object(9)
memory usage: 7.4+ MB


# Data processing

The steps to data preprocessing is as follows:

1. Separate the data into train and test cases with 20% going to the test set.
2. Design imputation strategies, train and apply to the train set, and fit to the test set.
3. To enable use of a variety of models:
    - Normalize the numerical variables
    - Conduct `OneHotEncoding` on categorical variables
4. Resample to make the representation of all classes equal to in the train set.
5. If necessary do a `PCA` conversion
6. Put all steps into a pipelie under one function

## Imputation strategies

Here the strategy differes for each type of feature and even within the `categorical` and `numerical` features.

Adopted strategy for each feature is as follows:

- 