<a href="https://colab.research.google.com/github/a-apte/DS1_Project_2/blob/master/Project_2_Kaggle_Dataset_Tanzania_Water.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Data Overview

**Features**

        Your goal is to predict the operating condition of a waterpoint for each record in the dataset. You are provided the following set of information about the waterpoints:

      amount_tsh : Total static head (amount water available to waterpoint) - numeric
      date_recorded : The date the row was entered - datetime
      funder : Who funded the well
      gps_height : Altitude of the well
      installer : Organization that installed the well
      longitude : GPS coordinate
      latitude : GPS coordinate
      wpt_name : Name of the waterpoint if there is one
      num_private :
      basin : Geographic water basin
      subvillage : Geographic location
      region : Geographic location
      region_code : Geographic location (coded)
      district_code : Geographic location (coded)
      lga : Geographic location
      ward : Geographic location
      population : Population around the well
      public_meeting : True/False
      recorded_by : Group entering this row of data
      scheme_management : Who operates the waterpoint
      scheme_name : Who operates the waterpoint
      permit : If the waterpoint is permitted
      construction_year : Year the waterpoint was constructed
      extraction_type : The kind of extraction the waterpoint uses
      extraction_type_group : The kind of extraction the waterpoint uses
      extraction_type_class : The kind of extraction the waterpoint uses
      management : How the waterpoint is managed
      management_group : How the waterpoint is managed
      payment : What the water costs
      payment_type : What the water costs
      water_quality : The quality of the water
      quality_group : The quality of the water
      quantity : The quantity of water
      quantity_group : The quantity of water
      source : The source of the water
      source_type : The source of the water
      source_class : The source of the water
      waterpoint_type : The kind of waterpoint
      waterpoint_type_group : The kind of waterpoint


**Labels**

There are three possible values:

      functional : the waterpoint is operational and there are no repairs needed
      functional needs repair : the waterpoint is operational, but needs repairs
      non functional : the waterpoint is not operational

### File Loading

In [0]:
!pip install kaggle



In [0]:
# Note - you'll also have to sign up for Kaggle and authorize the API
# https://github.com/Kaggle/kaggle-api#api-credentials

# Key = "6aa390ba6abbed908ad0d11e6462b361"

# This essentially means uploading a kaggle.json file
# For Colab we can have it in Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
%env KAGGLE_CONFIG_DIR=/content/drive/My Drive/

# %env KAGGLE_CONFIG_DIR=/content/drive/My Drive/


'4/5wAtCWG0qZ5_sP2GCg7X_XQl8WtHc5NBRKEo9CS-2J3LpRGdIxNY_4U'

In [0]:
!kaggle competitions list

In [0]:

!kaggle competitions download -c ds1-predictive-modeling-challenge

In [0]:
# !wget https://os.unil.cloud.switch.ch/fma/fma_metadata.zip
!unzip test_features.csv.zip
!unzip train_labels.csv.zip
!unzip train_features.csv.zip


### First Data Investigation

In [0]:
import pandas as pd

test_features = pd.read_csv('test_features.csv')

test_features.head()

In [0]:
train_features = pd.read_csv('train_features.csv')

train_features.head()

In [0]:
train_labels = pd.read_csv('train_labels.csv')

train_labels.head()

In [0]:
train_features.dtypes

In [0]:
test_features.dtypes

In [0]:
train_labels.dtypes

In [0]:
train_labels.shape, train_features.shape, test_features.shape

In [0]:
train_labels.isnull().sum()

In [0]:
train_features.isnull().sum()

In [0]:
train_features.dropna().shape

In [0]:
test_features.isnull().sum()

In [0]:
test_features.dropna().shape

In [0]:
test_features.columns

In [0]:
null_columns = ['funder', 'installer', 'subvillage', 'public_meeting', 'scheme_management', 'scheme_name', 'permit']

train_clean = train_features.drop(null_columns, axis='columns')

test_clean = test_features.drop(null_columns, axis='columns')

train_clean.shape, test_clean.shape

In [0]:
train_clean.isnull().sum().sum(), test_clean.isnull().sum().sum()

In [0]:
sample_sub = pd.read_csv('sample_submission.csv')

print (sample_sub.shape)

sample_sub.head()

### Exporting Files to Local Machine (will later be uploaded to Github)

In [0]:
# from google.colab import files

# sample_sub.to_csv('sample_submission.csv')
# files.download('sample_submission.csv')

In [0]:
# train_labels.to_csv('train_labels.csv')
# files.download('train_labels.csv')

In [0]:
# train_features.to_csv('train_features.csv')
# files.download('train_features.csv')

In [0]:
# test_features.to_csv('test_features.csv')
# files.download('test_features.csv')

In [0]:
# train_clean.to_csv('train_clean.csv')
# files.download('train_clean.csv')

In [0]:
# test_clean.to_csv('test_clean.csv')
# files.download('test_clean.csv')

### Importing same files from Github (START HERE)

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style as style
style.use('seaborn-whitegrid')
import seaborn as sns

pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [0]:
train_feature_url = 'https://raw.githubusercontent.com/a-apte/DS1_Project_2/master/Input_DataSets/train_features.csv'
test_feature_url = 'https://raw.githubusercontent.com/a-apte/DS1_Project_2/master/Input_DataSets/test_features.csv'
train_labels_url = 'https://raw.githubusercontent.com/a-apte/DS1_Project_2/master/Input_DataSets/train_labels.csv'

# train_feature_clean_url = 'https://raw.githubusercontent.com/a-apte/DS1_Project_2/master/Input_DataSets/train_clean.csv'
# test_feature_clean_url = 'https://raw.githubusercontent.com/a-apte/DS1_Project_2/master/Input_DataSets/test_clean.csv'

In [0]:
drop_col = ['Unnamed: 0']

# drop_col_label = ['Unnamed: 0']

train_F = pd.read_csv(train_feature_url, index_col='id').drop(drop_col,axis='columns')
test_F = pd.read_csv(test_feature_url, index_col='id').drop(drop_col,axis='columns')
train_L = pd.read_csv(train_labels_url, index_col='id').drop(drop_col,axis='columns')

# train_F_C = pd.read_csv(train_feature_clean_url, index_col='id').drop(drop_col,axis='columns')
# test_F_C = pd.read_csv(test_feature_clean_url, index_col='id').drop(drop_col,axis='columns')

train_F.shape, test_F.shape, train_L.shape  #,  train_F_C.shape, test_F_C.shape, # 

((59400, 39), (14358, 39), (59400, 1))

### Feature Cleanup and Engineering

**Finding Columns with Null Values for later**

In [0]:
# null_list = [train_F.isnull().sum() > 0]

null_list = []

for col in train_F.columns:
  if train_F[col].isnull().sum() > 0:
    null_list.append(col)
    
null_list.append('recorded_by')
null_list.append('date_recorded')
    
null_list

['funder',
 'installer',
 'subvillage',
 'public_meeting',
 'scheme_management',
 'scheme_name',
 'permit',
 'recorded_by',
 'date_recorded']

**Converting the date variable into a datetime object and splitting it into pieces**

In [0]:
def convert_datetime(df, col):
  df[col] = pd.to_datetime(df[col])
  df['day_of_week'] = df[col].dt.weekday_name 
  df['year'] = df[col].dt.year
  df['month'] = df[col].dt.month 
  df['day'] = df[col].dt.day 
  
  return None

# train_F.date_recorded = pd.to_datetime(train_F.date_recorded)

# train_F['day_of_week'] = train_F.date_recorded.dt.weekday_name 
# train_F['year'] = train_F.date_recorded.dt.year
# train_F['month'] = train_F.date_recorded.dt.month 
# train_F['day'] = train_F.date_recorded.dt.day 

# train_F.dtypes

In [0]:
convert_datetime(train_F, 'date_recorded')
convert_datetime(test_F, 'date_recorded')

train_F.dtypes

amount_tsh                      float64
date_recorded            datetime64[ns]
funder                           object
gps_height                        int64
installer                        object
longitude                       float64
latitude                        float64
wpt_name                         object
num_private                       int64
basin                            object
subvillage                       object
region                           object
region_code                       int64
district_code                     int64
lga                              object
ward                             object
population                        int64
public_meeting                   object
recorded_by                      object
scheme_management                object
scheme_name                      object
permit                           object
construction_year                 int64
extraction_type                  object
extraction_type_group            object


**Converting numeric columns to category type**

In [0]:
train_F['region_code'] = train_F['region_code'].astype('category')
test_F['region_code'] = test_F['region_code'].astype('category')
train_F['district_code'] = train_F['district_code'].astype('category')
test_F['district_code'] = test_F['district_code'].astype('category')
train_F['wpt_name'] = train_F['wpt_name'].astype('category')
test_F['wpt_name'] = test_F['wpt_name'].astype('category')
train_F['ward'] = train_F['ward'].astype('category')
test_F['ward'] = test_F['ward'].astype('category')


train_F.dtypes

amount_tsh                      float64
date_recorded            datetime64[ns]
funder                           object
gps_height                        int64
installer                        object
longitude                       float64
latitude                        float64
wpt_name                       category
num_private                       int64
basin                            object
subvillage                       object
region                           object
region_code                    category
district_code                  category
lga                              object
ward                           category
population                        int64
public_meeting                   object
recorded_by                      object
scheme_management                object
scheme_name                      object
permit                           object
construction_year                 int64
extraction_type                  object
extraction_type_group            object


**Finding the number of unique values per column**

In [0]:
for col in train_F.columns:  
  print(col, train_F[col].nunique())

amount_tsh 98
date_recorded 356
funder 1897
gps_height 2428
installer 2145
longitude 57516
latitude 57517
wpt_name 37400
num_private 65
basin 9
subvillage 19287
region 21
region_code 27
district_code 20
lga 125
ward 2092
population 1049
public_meeting 2
recorded_by 1
scheme_management 12
scheme_name 2696
permit 2
construction_year 55
extraction_type 18
extraction_type_group 13
extraction_type_class 7
management 12
management_group 5
payment 7
payment_type 7
water_quality 8
quality_group 6
quantity 5
quantity_group 5
source 10
source_type 7
source_class 3
waterpoint_type 7
waterpoint_type_group 6
day_of_week 7
year 5
month 12
day 31


#### Splitting into Numeric and Nonnumeric variables

In [0]:
def df_split(df):
  numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
  df_num = df.select_dtypes(include=numerics)
  df_cat = df.drop(df_num, axis = 'columns')
  print (df.shape, df_num.shape, df_cat.shape)
  return df_num, df_cat
  

In [0]:
train_F_num, train_F_cat = df_split(train_F)

test_F_num, test_F_cat = df_split(test_F)

(59400, 43) (59400, 10) (59400, 33)
(14358, 43) (14358, 10) (14358, 33)


####Fixing numeric sets

In [0]:
train_F_num.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
amount_tsh,59400.0,317.650385,2997.574558,0.0,0.0,0.0,20.0,350000.0
gps_height,59400.0,668.297239,693.11635,-90.0,0.0,369.0,1319.25,2770.0
longitude,59400.0,34.077427,6.567432,0.0,33.090347,34.908743,37.178387,40.34519
latitude,59400.0,-5.706033,2.946019,-11.64944,-8.540621,-5.021597,-3.326156,-2e-08
num_private,59400.0,0.474141,12.23623,0.0,0.0,0.0,0.0,1776.0
population,59400.0,179.909983,471.482176,0.0,0.0,25.0,215.0,30500.0
construction_year,59400.0,1300.652475,951.620547,0.0,0.0,1986.0,2004.0,2013.0
year,59400.0,2011.921667,0.958758,2002.0,2011.0,2012.0,2013.0,2013.0
month,59400.0,4.37564,3.029247,1.0,2.0,3.0,7.0,12.0
day,59400.0,15.621498,8.687553,1.0,8.0,16.0,23.0,31.0


In [0]:
train_F_num['construction_year'].loc[train_F_num['construction_year'] == 0] = train_F_num['year']
test_F_num['construction_year'].loc[test_F_num['construction_year'] == 0] = test_F_num['year']

train_F_num.describe().T

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
amount_tsh,59400.0,317.650385,2997.574558,0.0,0.0,0.0,20.0,350000.0
gps_height,59400.0,668.297239,693.11635,-90.0,0.0,369.0,1319.25,2770.0
longitude,59400.0,34.077427,6.567432,0.0,33.090347,34.908743,37.178387,40.34519
latitude,59400.0,-5.706033,2.946019,-11.64944,-8.540621,-5.021597,-3.326156,-2e-08
num_private,59400.0,0.474141,12.23623,0.0,0.0,0.0,0.0,1776.0
population,59400.0,179.909983,471.482176,0.0,0.0,25.0,215.0,30500.0
construction_year,59400.0,2001.919495,12.254881,1960.0,1996.0,2008.0,2011.0,2013.0
year,59400.0,2011.921667,0.958758,2002.0,2011.0,2012.0,2013.0,2013.0
month,59400.0,4.37564,3.029247,1.0,2.0,3.0,7.0,12.0
day,59400.0,15.621498,8.687553,1.0,8.0,16.0,23.0,31.0


**Creating new variables "distance" and "distance3D"**

In [0]:
mean_lat_train = train_F_num['latitude'].mean()
mean_long_train = train_F_num['longitude'].mean()
mean_lat_test = test_F_num['latitude'].mean()
mean_long_test = test_F_num['longitude'].mean()


train_F_num['distance'] = np.sqrt((train_F_num['longitude'] - mean_long_train)**2 + (train_F_num['latitude'] - mean_lat_train)**2)
test_F_num['distance'] = np.sqrt((test_F_num['longitude'] - mean_long_test)**2 + (test_F_num['latitude'] - mean_lat_test)**2)

train_F_num['distance3d'] = np.sqrt((train_F_num['gps_height']**2 + train_F_num['longitude'] - mean_long_train)**2 + (train_F_num['latitude'] - mean_lat_train)**2)
test_F_num['distance3d'] = np.sqrt((test_F_num['gps_height']**2 + test_F_num['longitude'] - mean_long_test)**2 + (test_F_num['latitude'] - mean_lat_test)**2)


train_F_num.describe().T

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/panda

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
amount_tsh,59400.0,317.650385,2997.575,0.0,0.0,0.0,20.0,350000.0
gps_height,59400.0,668.297239,693.1164,-90.0,0.0,369.0,1319.25,2770.0
longitude,59400.0,34.077427,6.567432,0.0,33.090347,34.908743,37.17839,40.34519
latitude,59400.0,-5.706033,2.946019,-11.64944,-8.540621,-5.021597,-3.326156,-2e-08
num_private,59400.0,0.474141,12.23623,0.0,0.0,0.0,0.0,1776.0
population,59400.0,179.909983,471.4822,0.0,0.0,25.0,215.0,30500.0
construction_year,59400.0,2001.919495,12.25488,1960.0,1996.0,2008.0,2011.0,2013.0
year,59400.0,2011.921667,0.9587576,2002.0,2011.0,2012.0,2013.0,2013.0
month,59400.0,4.37564,3.029247,1.0,2.0,3.0,7.0,12.0
day,59400.0,15.621498,8.687553,1.0,8.0,16.0,23.0,31.0


#### Converting Categorical variables to Dummy Variables

In [0]:
for col in train_F_cat.columns:
  print (col, train_F_cat[col].nunique())

date_recorded 356
funder 1897
installer 2145
wpt_name 37400
basin 9
subvillage 19287
region 21
region_code 27
district_code 20
lga 125
ward 2092
public_meeting 2
recorded_by 1
scheme_management 12
scheme_name 2696
permit 2
extraction_type 18
extraction_type_group 13
extraction_type_class 7
management 12
management_group 5
payment 7
payment_type 7
water_quality 8
quality_group 6
quantity 5
quantity_group 5
source 10
source_type 7
source_class 3
waterpoint_type 7
waterpoint_type_group 6
day_of_week 7


In [0]:
null_list

['funder',
 'installer',
 'subvillage',
 'public_meeting',
 'scheme_management',
 'scheme_name',
 'permit',
 'recorded_by',
 'date_recorded']

#### Creating Dummy variables out of the less unique (<=125 categories) categorical variables.

In [0]:
cols_kept = []

for col in train_F_cat.columns:
  if col not in null_list:
    if train_F_cat[col].nunique() <= 125:
      cols_kept.append(col)
    
print (len(cols_kept))
    
cols_kept



22


['basin',
 'region',
 'region_code',
 'district_code',
 'lga',
 'extraction_type',
 'extraction_type_group',
 'extraction_type_class',
 'management',
 'management_group',
 'payment',
 'payment_type',
 'water_quality',
 'quality_group',
 'quantity',
 'quantity_group',
 'source',
 'source_type',
 'source_class',
 'waterpoint_type',
 'waterpoint_type_group',
 'day_of_week']

In [0]:
small_cat_train = train_F_cat[cols_kept]
small_cat_test = test_F_cat[cols_kept]


small_cat_train.shape, small_cat_test.shape

((59400, 22), (14358, 22))

In [0]:
def dummy_df(category_df):
  df_dummy = pd.DataFrame()
  for col in category_df.columns:
    df_dummy = pd.concat([df_dummy, pd.get_dummies(category_df[col], drop_first=True, prefix = 'Is')], axis='columns')
  return df_dummy

In [0]:
df_dumb_train = dummy_df(small_cat_train)
df_dumb_test = dummy_df(small_cat_test)

#### Extracting more dummy variables without blowing up the computer

In [0]:
cols_lost = []

for item in train_F_cat.columns:
  if item not in null_list:
    if item not in cols_kept:
      cols_lost.append(item)

cols_lost 

['wpt_name', 'ward']

In [0]:
def make_more_features(df, col, count, df_cat):
  values = df[col]
  counts = pd.value_counts(values)
  mask = values.isin(counts[counts > count].index)
  
  dummy = pd.get_dummies(values[mask], prefix = "Is_Big").reset_index()
  temp = df_cat.reset_index()
  
  temp_new = pd.merge(temp, dummy, left_on='id', right_on='id', how = 'left')
  temp_new.set_index('id', inplace=True)

  
  return temp_new

In [0]:
aa = 100
bb = int(aa*len(test_F_cat)/len(train_F_cat))

cc = 100
dd = int(cc*len(test_F_cat)/len(train_F_cat))

wpt_dummy_train = make_more_features(train_F_cat, 'wpt_name', aa, small_cat_train)
wpt_dummy_test = make_more_features(test_F_cat, 'wpt_name', bb, small_cat_test)

ward_dummy_train = make_more_features(train_F_cat, 'ward', cc, small_cat_train)
ward_dummy_test = make_more_features(test_F_cat, 'ward', dd, small_cat_test)




wpt_dummy_train.shape, wpt_dummy_test.shape, ward_dummy_train.shape, ward_dummy_test.shape

ValueError: ignored

Exception ignored in: 'pandas._libs.lib.is_bool_array'
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'


ValueError: ignored

Exception ignored in: 'pandas._libs.lib.is_bool_array'
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'


In [0]:
wpt_dummy_train.fillna(0,inplace=True) 
wpt_dummy_test.fillna(0,inplace=True) 
ward_dummy_train.fillna(0,inplace=True) 
ward_dummy_test.fillna(0,inplace=True)


In [0]:
for col in small_cat_train.columns:
  del wpt_dummy_train[col]
  del ward_dummy_train[col]
  del wpt_dummy_test[col]
  del ward_dummy_test[col]
  
wpt_dummy_train.shape, wpt_dummy_test.shape, ward_dummy_train.shape, ward_dummy_test.shape

In [0]:
df_dumb_train = pd.concat([df_dumb_train, wpt_dummy_train, ward_dummy_train], axis='columns')
df_dumb_test = pd.concat([df_dumb_test, wpt_dummy_test, ward_dummy_test], axis='columns')

# df_dumb_train.shape, df_dumb_test.shape

#### Check to see the Train and Test Features have the same columns in the same order.

In [0]:
# %%time


df_dumb_train.shape, df_dumb_test.shape

NameError: ignored

In [0]:
a = list(df_dumb_train.columns.values)

print(a)
print(len(a))

In [0]:
b = list(df_dumb_test.columns.values)

print(b)
print(len(b))

In [0]:
a == b

In [0]:
def ex_cols(a,b):
  ex_a = []
  ex_b = []
  for i in range(0,len(a)):
    if a[i] not in b:
      ex_a.append(a[i])
  for j in range(0,len(b)):
    if b[j] not in a:
      ex_b.append(b[j])
  return ex_a, ex_b

ex_a, ex_b = ex_cols(a,b)

ex_a,ex_b

In [0]:
for col in df_dumb_train.columns:
  if col in ex_a:
    del df_dumb_train[col]

for col in df_dumb_test.columns:
  if col in ex_b:
    del df_dumb_test[col]

    
    
    
# df_dumb_train.drop(ex_a, axis='columns')
# df_dumb_test.drop(ex_b, axis='columns')

# del df_dumb_train['other - mkulima/shinyanga']

df_dumb_train.shape, df_dumb_test.shape

In [0]:
c = list(df_dumb_train.columns.values)
d = list(df_dumb_test.columns.values)


c == d

In [0]:
for i in range(0,len(c)):
  if c[i] != d[i]:
    print("No match")
    

#### Combining to form the Feature Matrices

In [0]:
X_train = pd.concat([train_F_num,df_dumb_train],axis='columns')
X_test = pd.concat([test_F_num,df_dumb_test],axis='columns')

# X_train = train_F_num
# X_test = test_F_num

# X_train = df_dumb_train
# X_test = df_dumb_test

X_train.shape, X_test.shape

In [0]:
X_train.head()

In [0]:
np.any(np.isnan(X_train)), np.any(np.isnan(X_test))


### Target Variable

In [0]:
train_L.head()

In [0]:
train_L['status_group'] = train_L['status_group'].astype('category')

train_L.dtypes, train_L.shape

In [0]:
y_train = train_L['status_group']

y_train.value_counts()

### Majority Class Baseline

In [0]:


majority_class = y_train.mode()[0]

print(majority_class)

y_pred = pd.DataFrame(np.full(shape=len(X_test), fill_value = majority_class))

# y_pred.head()

# y_pred['id'] = X_test['id']

In [0]:
# accuracy_score(y_train, y_pred)

In [0]:
temp = X_test.reset_index()

temp.head()

In [0]:
# temp = X_test.reset_index()

# temp.head()

y_pred['id'] = temp['id'].values
y_pred.rename(columns={0:'status_group'}, inplace=True)
y_pred.set_index('id', inplace=True)

y_pred.head()

print(y_pred.shape)

In [0]:
from google.colab import files

y_pred.to_csv('majority_class.csv')
files.download('majority_class.csv')

### Classifier

### Validation with just the training set split

**Classifiers**

In [0]:
from sklearn.linear_model import LogisticRegression
# from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
# from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier


clf_rf = RandomForestClassifier(n_estimators=100, max_depth=27, # criterion = 'entropy', # max_features = 'auto', 
                             oob_score = True, random_state=237)

clf_lr = LogisticRegression(random_state=237, solver='lbfgs', multi_class='multinomial', max_iter=1000)


clf_nb = GaussianNB()


# clf_knn = KNeighborsClassifier(3) Stalls out

**Ensemble Classifiers - Did not perform as well as Random Forest**

In [0]:
eclf1 = VotingClassifier(estimators=[('lr', clf_lr), ('rf', clf_rf), ('gnb', clf_nb)], voting='hard')

eclf2 = VotingClassifier(estimators=[('lr', clf_lr), ('rf', clf_rf), ('gnb', clf_nb)], voting='soft')

eclf3 = VotingClassifier(estimators=[('lr', clf_lr), ('rf', clf_rf), ('gnb', clf_nb)], voting='soft', weights=[1,8,1], flatten_transform=True)

In [0]:
def quick_eval(X,y, clf):
  from sklearn.model_selection import train_test_split
  from sklearn.preprocessing import StandardScaler
  from sklearn.preprocessing import RobustScaler
  from sklearn.metrics import accuracy_score

  
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle = True, random_state=237)
  print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
  
  scaler = StandardScaler()
#   scaler = RobustScaler()
  X_train_clf = scaler.fit_transform(X_train)
  X_test_clf = scaler.transform(X_test)
  
  clf.fit(X_train_clf, y_train)
  
  y_pred_train = clf.predict(X_train_clf)
  
  y_pred = clf.predict(X_test_clf)
  
  
  return accuracy_score(y_train,y_pred_train), accuracy_score(y_test, y_pred)

In [0]:
# %%time

quick_eval(X_train, y_train, clf_rf)

### Cross Validation

In [0]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MaxAbsScaler

scaler = StandardScaler()
# scaler = MinMaxScaler()
# scaler = RobustScaler()
# scaler = MaxAbsScaler()

X_train_clf = scaler.fit_transform(X_train)
X_test_clf = scaler.transform(X_test)

# X_train_clf = X_train
# X_test_clf = X_test

print(type(y_train))
print(type(X_train_clf))
print(type(X_test_clf))

X_train_clf.shape, y_train.shape, X_test_clf.shape

In [0]:
%%time

from sklearn.model_selection import cross_validate

from sklearn.metrics import accuracy_score


scores = cross_validate(clf_rf,
                        X_train_clf,y_train, 
                        scoring = 'accuracy', cv=5) 



In [0]:
pd.DataFrame(scores)

### Creating the Prediction Vector

**Tuning the parameters - Grid Search - Random Forest**

In [0]:
param_grid = { "criterion" : ["gini", "entropy"], 
              "min_samples_leaf" : [1, 5, 10, 25], 
              "min_samples_split" : [20, 28, 36,], 
              "n_estimators": [100, 200, 300]}

param_grid2 = { #"criterion" : ["gini", "entropy"], 
              "min_samples_leaf" : [1, 5, 10, 25, 50, 70], 
              "min_samples_split" : [2, 4, 10, 12, 16, 18, 25, 35], 
              "n_estimators": [100, 400, 700, 1000, 1500]}

from sklearn.model_selection import GridSearchCV, cross_val_score

# rf = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1, n_jobs=-1)

clf = GridSearchCV(estimator=clf_rf, param_grid=param_grid, n_jobs=-1, cv=3)

clf.fit(X_train, y_train)

clf.bestparams

In [0]:
def predictor(X_train, X_test, y_train, clf):
  from sklearn.preprocessing import StandardScaler
  from sklearn.preprocessing import RobustScaler
 
  from sklearn.metrics import accuracy_score
  

  y_pred = pd.DataFrame()
  
  temp_test = X_test.reset_index()
  y_id = temp_test['id']
#   scaler = StandardScaler()
  scaler = RobustScaler()
  
  X_train_clf = scaler.fit_transform(X_train)

  X_test_clf = scaler.transform(X_test)
  clf.fit(X_train_clf, y_train)
  
  y_pred_train = clf.predict(X_train_clf)
  
  print (f'\nThe accuracy score of the training set is {round(accuracy_score(y_train, y_pred_train), 5)}\n')
  
  prediction = pd.DataFrame(clf.predict(X_test_clf))
  
  y_pred = pd.concat([y_id, prediction], axis='columns')

  y_pred.rename(columns={0:'status_group'}, inplace=True)
  
  y_pred.set_index('id', inplace=True)
  
  return y_pred

In [0]:
%%time

df = predictor(X_train, X_test, y_train, clf_rf)



In [0]:
df.head()

In [0]:
df['status_group'].value_counts()

In [0]:
df.shape

### Submission Download

In [0]:
from google.colab import files

df.to_csv('submission.csv')
files.download('submission.csv')