This notebook is a part of ultimate student challenge hosted by analytics vidhya. In this notebook, I have used various machine learning and intuitive methods to improve training data.

In [1]:
# Import pandas to manipulate dataframes
import pandas as pd

In [2]:
# Read train and test data
train_data = pd.read_csv("data/train_data.csv")
test_data = pd.read_csv("data/test_data.csv")

In [3]:
train_data.columns

Index(['ID', 'Park_ID', 'Date', 'Direction_Of_Wind', 'Average_Breeze_Speed',
       'Max_Breeze_Speed', 'Min_Breeze_Speed', 'Var1',
       'Average_Atmospheric_Pressure', 'Max_Atmospheric_Pressure',
       'Min_Atmospheric_Pressure', 'Min_Ambient_Pollution',
       'Max_Ambient_Pollution', 'Average_Moisture_In_Park',
       'Max_Moisture_In_Park', 'Min_Moisture_In_Park', 'Location_Type',
       'Footfall'],
      dtype='object')

In [4]:
train_data.head(5)

Unnamed: 0,ID,Park_ID,Date,Direction_Of_Wind,Average_Breeze_Speed,Max_Breeze_Speed,Min_Breeze_Speed,Var1,Average_Atmospheric_Pressure,Max_Atmospheric_Pressure,Min_Atmospheric_Pressure,Min_Ambient_Pollution,Max_Ambient_Pollution,Average_Moisture_In_Park,Max_Moisture_In_Park,Min_Moisture_In_Park,Location_Type,Footfall
0,3311712,12,01-09-1990,194.0,37.24,60.8,15.2,92.13,8225.0,8259.0,8211.0,92.0,304.0,255.0,288.0,222.0,3,1406
1,3311812,12,02-09-1990,285.0,32.68,60.8,7.6,14.11,8232.0,8280.0,8205.0,172.0,332.0,252.0,297.0,204.0,3,1409
2,3311912,12,03-09-1990,319.0,43.32,60.8,15.2,35.69,8321.0,8355.0,8283.0,236.0,292.0,219.0,279.0,165.0,3,1386
3,3312012,12,04-09-1990,297.0,25.84,38.0,7.6,0.0249,8379.0,8396.0,8358.0,272.0,324.0,225.0,261.0,192.0,3,1365
4,3312112,12,05-09-1990,207.0,28.88,45.6,7.6,0.83,8372.0,8393.0,8335.0,236.0,332.0,234.0,273.0,183.0,3,1413


There are number of things we can do to improve our training data:  
1. We can not handle dates directly in machine learning algorithms so we can split dates into day, month and year
2. ID is just increasing number and it is record identifier. This is not an important feature so discard it

In [5]:
# Split date into day,month and year in train and test data
train_dtObj = pd.DatetimeIndex(train_data['Date'])
train_data['year'] = train_dtObj.year
train_data['month'] = train_dtObj.month
train_data['day'] = train_dtObj.day

test_dtObj = pd.DatetimeIndex(test_data['Date'])
test_data['year'] = test_dtObj.year
test_data['month'] = test_dtObj.month
test_data['day'] = test_dtObj.day

# Save Ids of test data set
IDs = test_data.ID

# Drop ID and original date columns
train_data = train_data.drop('ID',1)
train_data = train_data.drop('Date',1)
test_data = test_data.drop("ID",1)
test_data = test_data.drop("Date",1)

In [6]:
train_data.isnull().sum()

Park_ID                             0
Direction_Of_Wind                3931
Average_Breeze_Speed             3931
Max_Breeze_Speed                 3936
Min_Breeze_Speed                 3934
Var1                             8282
Average_Atmospheric_Pressure    40195
Max_Atmospheric_Pressure        40195
Min_Atmospheric_Pressure        40195
Min_Ambient_Pollution           31645
Max_Ambient_Pollution           31645
Average_Moisture_In_Park           40
Max_Moisture_In_Park               40
Min_Moisture_In_Park               40
Location_Type                       0
Footfall                            0
year                                0
month                               0
day                                 0
dtype: int64

In [7]:
test_data.isnull().sum()

Park_ID                             0
Direction_Of_Wind                1493
Average_Breeze_Speed             1493
Max_Breeze_Speed                 1493
Min_Breeze_Speed                 1493
Var1                             2920
Average_Atmospheric_Pressure    13173
Max_Atmospheric_Pressure        13173
Min_Atmospheric_Pressure        13173
Min_Ambient_Pollution            9655
Max_Ambient_Pollution            9655
Average_Moisture_In_Park           39
Max_Moisture_In_Park               39
Min_Moisture_In_Park               39
Location_Type                       0
year                                0
month                               0
day                                 0
dtype: int64

There are many records with missing values in train data and test data. We can not simply discard it as it can hurt performance badly. We can use some intuitive methods and machine learning to predict missing values

**I have following 6 different approach to estimate missing values**

1. Generalized linear models
2. Group by month and fill missing values with mean of each columns
3. K-Nearest neighbour regressor
4. Random forest regressor
5. Gradient boosting regressor
6. Average all models to reduce variance

In [8]:
# Import machine learning algorithms to predict missing values
from sklearn.linear_model import LinearRegression,RidgeCV,LassoCV
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor

In [9]:
def impute_missing_values_using_all_features(_data,model):
    data = _data.copy()
    columns_with_NA = data.loc[:,pd.isnull(data).sum() > 0].columns
    for column_name in columns_with_NA:
        if(data[column_name].isnull().sum() == 0):
            break
        #print(column_name)
        data[column_name] = data.groupby(["year","month","Location_Type","Park_ID"])[column_name].transform(lambda x: x.fillna(x.mean()))
    
    columns_with_NA = data.loc[:,pd.isnull(data).sum() > 0].columns
    for missingF in columns_with_NA:
        good_features = data.loc[:,pd.isnull(data).sum() == 0].columns

        if(data[missingF].isnull().sum() == 0):
            break
        print("Finding missing values for ",missingF)
        tr_data = data[~data[missingF].isnull()]
        ts_data = data[data[missingF].isnull()]
        x_train = tr_data[good_features]
        y = tr_data[missingF]
        x_test = ts_data[good_features]
        model.fit(x_train,y)
        data.loc[data[missingF].isnull(),missingF] = model.predict(x_test)
    return(data)

In [10]:
def impute_missing_values_using_categorical_features(_data,model):
    data = _data.copy()

    columns_with_NA = data.loc[:,pd.isnull(data).sum() > 0].columns
    for missingF in columns_with_NA:
        good_features = ["year","month","Location_Type","Park_ID","day"]

        if(data[missingF].isnull().sum() == 0):
            break
        print("Finding missing values for ",missingF)
        tr_data = data[~data[missingF].isnull()]
        ts_data = data[data[missingF].isnull()]
        x_train = tr_data[good_features]
        y = tr_data[missingF]
        x_test = ts_data[good_features]
        model.fit(x_train,y)
        data.loc[data[missingF].isnull(),missingF] = model.predict(x_test)
    return(data)

In [11]:
def impute_missing_values_combined(_data,model):
    data = _data.copy()
    columns_with_NA = data.loc[:,pd.isnull(data).sum() > 0].columns
    for column_name in columns_with_NA:
        if(data[column_name].isnull().sum() == 0):
            break
        #print(column_name)
        data[column_name] = data.groupby(["year","month","Location_Type","Park_ID"])[column_name].transform(lambda x: x.fillna(x.mean()))
    
    #Predict "Var1"
    good_features = data.loc[:,pd.isnull(data).sum() == 0].columns
    if(data["Var1"].isnull().sum() != 0):
        #print("Finding missing values for ",missingF)
        tr_data = data[~data["Var1"].isnull()]
        ts_data = data[data["Var1"].isnull()]
        x_train = tr_data[good_features]
        y = tr_data["Var1"]
        x_test = ts_data[good_features]
        model.fit(x_train,y)
        data.loc[data["Var1"].isnull(),"Var1"] = model.predict(x_test)
    
    columns_with_NA = data.loc[:,pd.isnull(data).sum() > 0].columns
    #print(columns_with_NA)
    for missingF in columns_with_NA:
        good_features = data.loc[:,pd.isnull(data).sum() == 0].columns

        if(data[missingF].isnull().sum() == 0):
            break
        #print("Finding missing values for ",missingF)
        tr_data = data[~data[missingF].isnull()]
        ts_data = data[data[missingF].isnull()]
        x_train = tr_data[good_features]
        y = tr_data[missingF]
        x_test = ts_data[good_features]
        model.fit(x_train,y)
        data.loc[data[missingF].isnull(),missingF] = model.predict(x_test)
    return(data)

### Featureset 1 - Linear, Ridge, Lasso average using combined features

In [12]:
# Initialize regression models with paramaters
ridge_model = RidgeCV(alphas=[1,0.1,0.01,0.001,0.0001])
lasso_model = LassoCV(alphas=[1,0.1,0.01,0.001,0.0001])
reg_model = LinearRegression()

# Impute missing values using approach 3: both categorical and continuous variables (Predict var 1 first)
tr11 = impute_missing_values_combined(train_data,ridge_model)
tr12 = impute_missing_values_combined(train_data,lasso_model)
tr13 = impute_missing_values_combined(train_data,reg_model)

ts11 = impute_missing_values_combined(test_data,ridge_model)
ts12 = impute_missing_values_combined(test_data,lasso_model)
ts13 = impute_missing_values_combined(test_data,reg_model)

# Average all regression models to reduce variance
tr1 = (tr11 + tr12 + tr13)/3
ts1 = (ts11 + ts12 + ts13)/3

# Print number of missing values to make sure there are no missing values
print("Missing values in featureset 1: in train - ",tr1.isnull().sum().sum()," in test - ",ts1.isnull().sum().sum())

# Save featureset 1 into filesystem
tr1.to_csv("featuresets/tr1.csv")
ts1.to_csv("featuresets/ts1.csv")

Missing values in featureset 1: in train -  0  in test -  0


### Featureset 2 - Group by month and average columns

In [13]:
# Group train and test data by month and fill missing values with the mean of entire column
tr2 = train_data.groupby(["month"]).transform(lambda x: x.fillna(x.mean()))
ts2 = test_data.groupby(["month"]).transform(lambda x: x.fillna(x.mean()))

# Add month column as it is 
tr2["month"] = train_data["month"]
ts2["month"] = test_data["month"]

# Print number of missing values to make sure there are no missing values
print("Missing values in featureset 2: in train - ",tr2.isnull().sum().sum()," in test - ",ts2.isnull().sum().sum())

# Save featureset 2 into filesystem
tr2.to_csv("featuresets/tr2.csv")
ts2.to_csv("featuresets/ts2.csv")

Missing values in featureset 2: in train -  0  in test -  0


### Featureset 3 - KNN with categorical features

In [14]:
# Initialize nearest neighbour regressor with paramaters
knr = KNeighborsRegressor(n_neighbors=5)

# Impute missing values using approach 1: all features
tr3 = impute_missing_values_using_all_features(train_data,knr)
ts3 = impute_missing_values_using_all_features(test_data,knr)

# Print number of missing values to make sure there are no missing values
print("Missing values in featureset 3: in train - ",tr3.isnull().sum().sum()," in test - ",ts3.isnull().sum().sum())

# Save featureset 3 into filesystem
tr3.to_csv("featuresets/tr3.csv")
ts3.to_csv("featuresets/ts3.csv")

Finding missing values for  Direction_Of_Wind
Finding missing values for  Average_Breeze_Speed
Finding missing values for  Max_Breeze_Speed
Finding missing values for  Min_Breeze_Speed
Finding missing values for  Var1
Finding missing values for  Average_Atmospheric_Pressure
Finding missing values for  Max_Atmospheric_Pressure
Finding missing values for  Min_Atmospheric_Pressure
Finding missing values for  Min_Ambient_Pollution
Finding missing values for  Max_Ambient_Pollution
Finding missing values for  Direction_Of_Wind
Finding missing values for  Average_Breeze_Speed
Finding missing values for  Max_Breeze_Speed
Finding missing values for  Min_Breeze_Speed
Finding missing values for  Var1
Finding missing values for  Average_Atmospheric_Pressure
Finding missing values for  Max_Atmospheric_Pressure
Finding missing values for  Min_Atmospheric_Pressure
Finding missing values for  Min_Ambient_Pollution
Finding missing values for  Max_Ambient_Pollution
Missing values in featureset 3: in tra

### Featureset 4 - Randomforest regressor with good features

In [15]:
# Initialize random forest regressor with paramaters
rfr = RandomForestRegressor(n_estimators=70)

# Impute missing values using approach 2: categorical features
tr4 = impute_missing_values_using_categorical_features(train_data,rfr)
ts4 = impute_missing_values_using_categorical_features(test_data,rfr)

# Print number of missing values to make sure there are no missing values
print("Missing values in featureset 4: in train - ",tr4.isnull().sum().sum()," in test - ",ts4.isnull().sum().sum())

# Save featureset 4 into filesystem
tr4.to_csv("featuresets/tr4.csv")
ts4.to_csv("featuresets/ts4.csv")

Finding missing values for  Direction_Of_Wind
Finding missing values for  Average_Breeze_Speed
Finding missing values for  Max_Breeze_Speed
Finding missing values for  Min_Breeze_Speed
Finding missing values for  Var1
Finding missing values for  Average_Atmospheric_Pressure
Finding missing values for  Max_Atmospheric_Pressure
Finding missing values for  Min_Atmospheric_Pressure
Finding missing values for  Min_Ambient_Pollution
Finding missing values for  Max_Ambient_Pollution
Finding missing values for  Average_Moisture_In_Park
Finding missing values for  Max_Moisture_In_Park
Finding missing values for  Min_Moisture_In_Park
Finding missing values for  Direction_Of_Wind
Finding missing values for  Average_Breeze_Speed
Finding missing values for  Max_Breeze_Speed
Finding missing values for  Min_Breeze_Speed
Finding missing values for  Var1
Finding missing values for  Average_Atmospheric_Pressure
Finding missing values for  Max_Atmospheric_Pressure
Finding missing values for  Min_Atmosphe

### Featureset 5 - GBM with categorical features

In [16]:
# Initialize gbm regressor with paramaters
gbm = GradientBoostingRegressor(n_estimators=200,learning_rate=0.2,max_depth=4, min_samples_split=1)

# Impute missing values using approach 2: categorical features
tr5 = impute_missing_values_using_categorical_features(train_data,gbm)
ts5 = impute_missing_values_using_categorical_features(test_data,gbm)

# Print number of missing values to make sure there are no missing values
print("Missing values in featureset 5: in train - ",tr5.isnull().sum().sum()," in test - ",ts5.isnull().sum().sum())

# Save featureset 6 into filesystem
tr5.to_csv("featuresets/tr5.csv")
ts5.to_csv("featuresets/ts5.csv")

Finding missing values for  Direction_Of_Wind
Finding missing values for  Average_Breeze_Speed
Finding missing values for  Max_Breeze_Speed
Finding missing values for  Min_Breeze_Speed
Finding missing values for  Var1
Finding missing values for  Average_Atmospheric_Pressure
Finding missing values for  Max_Atmospheric_Pressure
Finding missing values for  Min_Atmospheric_Pressure
Finding missing values for  Min_Ambient_Pollution
Finding missing values for  Max_Ambient_Pollution
Finding missing values for  Average_Moisture_In_Park
Finding missing values for  Max_Moisture_In_Park
Finding missing values for  Min_Moisture_In_Park
Finding missing values for  Direction_Of_Wind
Finding missing values for  Average_Breeze_Speed
Finding missing values for  Max_Breeze_Speed
Finding missing values for  Min_Breeze_Speed
Finding missing values for  Var1
Finding missing values for  Average_Atmospheric_Pressure
Finding missing values for  Max_Atmospheric_Pressure
Finding missing values for  Min_Atmosphe

### Featureset 6 - Average all

In [17]:
# Average featureset 1 to 5
tr6 = (tr1 + tr2 + tr3 + tr4 + tr5)/5
ts6 = (ts1 + ts2 + ts3 + ts4 + ts5)/5

# Print number of missing values to make sure there are no missing values
print("Missing values in featureset 6: in train - ",tr6.isnull().sum().sum()," in test - ",ts6.isnull().sum().sum())

# Save featureset 6 into filesystem
tr6.to_csv("featuresets/tr6.csv")
ts6.to_csv("featuresets/ts6.csv")

Missing values in featureset 6: in train -  0  in test -  0
