# Importing Libraries

In [777]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier


# Problem Statement



Congratulations – you have been hired as Chief Data Scientist of MedCamp – a not for profit organization dedicated in making health conditions for working professionals better. MedCamp was started because the founders saw their family suffer due to bad work life balance and neglected health.

*MedCamp* organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).

MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and Number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.

One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.

## Process

    MedCamp employees / volunteers reach out to people and drive registrations.
    During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of healthcamp.

## Note

    Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.
    For a few camps, there was hardware failure, so some information about date and time of registration is lost.
    MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides information about several health issues through various awareness stalls.

## Favorable outcome:

    For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
    You need to predict the chances (probability) of having a favourable outcome.

## Data Description
Train Data

<b>train.zip</b> contains 6 different csv files apart from the data dictionary as described below:

<b>Health_Camp_Detail.csv</b> – File containing Health_Camp_Id, Camp_Start_Date, Camp_End_Date and Category details of each camp.

<b>Train.csv</b> – File containing registration details for all the test camps. This includes Patient_ID, Health_Camp_ID, Registration_Date and a few anonymized variables as on registration date.

<b>Patient_Profile.csv</b> – This file contains Patient profile details like Patient_ID, Online_Follower, Social media details, Income, Education, Age, First_Interaction_Date, City_Type and Employer_Category

<b>First_Health_Camp_Attended.csv</b> – This file contains details about people who attended health camp of first format. This includes Donation (amount) & Health_Score of the person.

<b>Second_Health_Camp_Attended.csv</b> - This file contains details about people who attended health camp of second format. This includes Health_Score of the person.

<b>Third_Health_Camp_Attended.csv</b> - This file contains details about people who attended health camp of third format. This includes Number_of_stall_visited & Last_Stall_Visited_Number.
<b>Test Data

<b>Test.csv</b> – File containing registration details for all the camps done after 1st April 2006. This includes Patient_ID, Health_Camp_ID, Registration_Date and a few anonymized variables as on registration date. Participant should make predictions for these patient camp combinations
Submission File

<b>sample_submission.csv</b>

<b>Patient_ID</b>: Unique Identifier for each patient. This ID is not sequential in nature and can not be used in modeling

<b>Health_Camp_ID</b>: Unique Identifier for each camp. This ID is not sequential in nature and can not be used in modeling

<b>Outcome</b>: Predicted probability for having a favourable outcome depending on the format


In [17]:
os.chdir('C:\\Users\\chandan.malla\\Desktop\\Analytics India\\Data\\Train')
print('Data Files available:')
os.listdir()

Data Files available:


['Data_Dictionary.xlsx',
 'First_Health_Camp_Attended.csv',
 'Health_Camp_Detail.csv',
 'Patient_Profile.csv',
 'Second_Health_Camp_Attended.csv',
 'Third_Health_Camp_Attended.csv',
 'Train.csv']

# Load Data

In [43]:
##Storing each Sheet of ReadMe (Data_Dictionary.xlsx) Excel file into dictionary
xl_file = pd.ExcelFile('Data_Dictionary.xlsx')
info_df = {sheet_name: xl_file.parse(sheet_name) 
          for sheet_name in xl_file.sheet_names}

print('Detail of Each File:\n')
for i in range(info_df['ReadMe'].shape[0]):
    print(info_df['ReadMe']['Details of the Files'][i])
    print('---')

Detail of Each File:

Health_Camp_Detail.csv – File containing Health_Camp_Id, Camp_Start_Date, Camp_End_Date and Category details of each camp.
---
Train.csv – File containing registration details for all the test camps. This includes Patient_ID, Health_Camp_ID, Registration_Date and a few anonymized variables as on registration date.
---
Patient_Profile.csv – This file contains Patient profile details like Patient_ID, Online_Follower, Social media details, Income, Education, Age, First_Interaction_Date, City_Type and Employer_Category
---
First_Health_Camp_Attended.csv – This file contains details about people who attended health camp of first format. This includes Donation (amount) & Health_Score of the person.
---
Second_Health_Camp_Attended.csv - This file contains details about people who attended health camp of second format. This includes Health_Score of the person.
---
Third_Health_Camp_Attended.csv - This file contains details about people who attended health camp of third fo

In [201]:
## Storing each data file into dictionary    
data = {}
data['First_Health_Camp_Attended'] = pd.read_csv('First_Health_Camp_Attended.csv')
data['Health_Camp_Detail'] = pd.read_csv('Health_Camp_Detail.csv')
data['Patient_Profile']= pd.read_csv('Patient_Profile.csv')
data['Second_Health_Camp_Attended'] = pd.read_csv('Second_Health_Camp_Attended.csv')
data['Third_Health_Camp_Attended'] = pd.read_csv('Third_Health_Camp_Attended.csv')
data['Train'] = pd.read_csv('Train.csv')
data['Test'] = pd.read_csv('test.csv')

In [202]:
## Examining each file
for i in data.keys():
    print(i)
    print(data[i].head())

First_Health_Camp_Attended
   Patient_ID  Health_Camp_ID  Donation  Health_Score  Unnamed: 4
0      506181            6560        40      0.439024         NaN
1      494977            6560        20      0.097561         NaN
2      518680            6560        10      0.048780         NaN
3      509916            6560        30      0.634146         NaN
4      488006            6560        20      0.024390         NaN
Health_Camp_Detail
   Health_Camp_ID Camp_Start_Date Camp_End_Date Category1 Category2  Category3
0            6560       16-Aug-03     20-Aug-03     First         B          2
1            6530       16-Aug-03     28-Oct-03     First         C          2
2            6544       03-Nov-03     15-Nov-03     First         F          1
3            6585       22-Nov-03     05-Dec-03     First         E          2
4            6561       30-Nov-03     18-Dec-03     First         E          1
Patient_Profile
   Patient_ID  Online_Follower  LinkedIn_Shared  Twitter_Shared  \
0

# Data cleaning

In [203]:
## Dropping Unwanted Row
data['First_Health_Camp_Attended'] = data['First_Health_Camp_Attended'].drop('Unnamed: 4',axis=1)

In [218]:
# Merging Each Individual data file with Train.csv and 
# adding a new column _merge which stores information on how join
# was performed (both(inner)/left/right) for referencing Outcome
# There will be lot of NaN values due to left join and less data in files other than train.csv
data['Train_Final_Data'] =pd.merge(data['Train'],data['First_Health_Camp_Attended'],
                                   on=['Patient_ID','Health_Camp_ID'],how='left',indicator='camp1_merge')

data['Train_Final_Data'] =pd.merge(data['Train_Final_Data'],data['Second_Health_Camp_Attended'],
                                   on=['Patient_ID','Health_Camp_ID'],how='left',indicator='camp2_merge')

data['Train_Final_Data'] =pd.merge(data['Train_Final_Data'],data['Third_Health_Camp_Attended'],
                                   on=['Patient_ID','Health_Camp_ID'],how='left',indicator='camp3_merge')

data['Train_Final_Data'] =pd.merge(data['Train_Final_Data'],data['Health_Camp_Detail'],
                                   on=['Health_Camp_ID'],how='left',indicator=False)

data['Train_Final_Data'] =pd.merge(data['Train_Final_Data'],data['Patient_Profile'],
                                   on=['Patient_ID'],how='left',indicator=False)


### Performing above pre-processing for test data also
data['Test_Final_Data'] =pd.merge(data['Test'],data['First_Health_Camp_Attended'],
                              on=['Patient_ID','Health_Camp_ID'],how='left',indicator='camp1_merge')

data['Test_Final_Data'] =pd.merge(data['Test_Final_Data'],data['Second_Health_Camp_Attended'],
                              on=['Patient_ID','Health_Camp_ID'],how='left',indicator='camp2_merge')

data['Test_Final_Data'] =pd.merge(data['Test_Final_Data'],data['Third_Health_Camp_Attended'],
                              on=['Patient_ID','Health_Camp_ID'],how='left',indicator='camp3_merge')

data['Test_Final_Data'] =pd.merge(data['Test_Final_Data'],data['Health_Camp_Detail'],
                              on=['Health_Camp_ID'],how='left',indicator=False)

data['Test_Final_Data'] =pd.merge(data['Test_Final_Data'],data['Patient_Profile'],
                                   on=['Patient_ID'],how='left',indicator=False)

In [219]:
##Creating Predictor Variable based on Rules of Problem
train_data = data['Train_Final_Data'].copy()
train_data['Outcome'] = 0
train_data.loc[(train_data['camp1_merge']=='both') |
                         (train_data['camp2_merge']=='both') |
                         ((train_data['camp3_merge']=='both') & (train_data['Number_of_stall_visited']>0)),'Outcome'] = 1

##No Predictor variable for Test data, as they have been registered later and need to upload prediction from test to analytics vidhya
test_data = data['Test_Final_Data'].copy()

In [220]:
##dropping not necessary features
train_data = train_data.drop(['camp1_merge','camp2_merge','camp3_merge'],axis = 1)
test_data = test_data.drop(['camp1_merge','camp2_merge','camp3_merge'],axis = 1)

In [221]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 75278 entries, 0 to 75277
Data columns (total 29 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Patient_ID                 75278 non-null  int64  
 1   Health_Camp_ID             75278 non-null  int64  
 2   Registration_Date          74944 non-null  object 
 3   Var1                       75278 non-null  int64  
 4   Var2                       75278 non-null  int64  
 5   Var3                       75278 non-null  int64  
 6   Var4                       75278 non-null  int64  
 7   Var5                       75278 non-null  int64  
 8   Donation                   6218 non-null   float64
 9   Health_Score               6218 non-null   float64
 10  Health Score               7819 non-null   float64
 11  Number_of_stall_visited    6515 non-null   float64
 12  Last_Stall_Visited_Number  6515 non-null   float64
 13  Camp_Start_Date            75278 non-null  obj

In [222]:
##function to convert some features in object dtype to date type
def convert_to_date(data,columns):
    for i in columns:
        data[columns]=data[columns].astype('datetime64[ns]') 
    return data

##function to convert some features in object dtype to int type
def convert_to_float(data,columns):
    for i in columns:
        data[columns]=data[columns].astype('float') 
    return data
## Replacing None Data with NaN 
train_data = train_data.replace('None', np.nan)
test_data = test_data.replace('None', np.nan)

In [223]:
cols = ['Registration_Date','First_Interaction','Camp_Start_Date','Camp_End_Date']
train_data = convert_to_date(train_data,cols)
test_data = convert_to_date(test_data,cols)

cols = ['Income','Education_Score','Age']
train_data = convert_to_int(train_data,cols)
test_data = convert_to_int(test_data,cols)

In [224]:
train_data.dtypes

Patient_ID                            int64
Health_Camp_ID                        int64
Registration_Date            datetime64[ns]
Var1                                  int64
Var2                                  int64
Var3                                  int64
Var4                                  int64
Var5                                  int64
Donation                            float64
Health_Score                        float64
Health Score                        float64
Number_of_stall_visited             float64
Last_Stall_Visited_Number           float64
Camp_Start_Date              datetime64[ns]
Camp_End_Date                datetime64[ns]
Category1                            object
Category2                            object
Category3                             int64
Online_Follower                       int64
LinkedIn_Shared                       int64
Twitter_Shared                        int64
Facebook_Shared                       int64
Income                          

In [225]:
#pd.set_option('display.max_columns', 30)
#pd.set_option('display.max_rows', 50)
#train_data.head(50)

## Checking for Null Values

In [241]:
print('**'*20,'Train Data info','**'*20)
for col in train_data.columns.values:
    print('Any Null value in ', col,':\t\t',train_data[col].isnull().any(),' and Count is ',train_data[train_data[col].isnull()].shape[0])

print('\n','**'*20,'Test Data info','**'*20)
for col in test_data.columns.values:
    print('Any Null value in ', col,':\t\t',test_data[col].isnull().any(),' and Count is ',test_data[test_data[col].isnull()].shape[0])

**************************************** Train Data info ****************************************
Any Null value in  Patient_ID :		 False  and Count is  0
Any Null value in  Health_Camp_ID :		 False  and Count is  0
Any Null value in  Registration_Date :		 True  and Count is  334
Any Null value in  Var1 :		 False  and Count is  0
Any Null value in  Var2 :		 False  and Count is  0
Any Null value in  Var3 :		 False  and Count is  0
Any Null value in  Var4 :		 False  and Count is  0
Any Null value in  Var5 :		 False  and Count is  0
Any Null value in  Donation :		 True  and Count is  69060
Any Null value in  Health_Score :		 True  and Count is  69060
Any Null value in  Health Score :		 True  and Count is  67459
Any Null value in  Number_of_stall_visited :		 True  and Count is  68763
Any Null value in  Last_Stall_Visited_Number :		 True  and Count is  68763
Any Null value in  Camp_Start_Date :		 False  and Count is  0
Any Null value in  Camp_End_Date :		 False  and Count is  0
Any Null val

In [580]:
## Removing data where  Registration Data is Null, as there are only 334 records,
#also test data does not have this discrepancy in this feature, so proceeing to delete

record_before = train_data.shape[0]

train_data_cleaned = train_data.dropna(subset=['Registration_Date'])
test_data_cleaned = test_data.copy()

record_after = train_data_cleaned.shape[0]
print('Records before Cleaning:',record_before)
print('Records after Cleaning:',record_after)
print('Records Deleted',record_before - record_after)
print('% Data left',record_after/record_before)


Records before Cleaning: 75278
Records after Cleaning: 74944
Records Deleted 334
% Data left 0.9955631127288185


# Data Imputation

## Zero Value Imputation

In [581]:
cols = ['Donation','Health Score','Health_Score','Number_of_stall_visited','Last_Stall_Visited_Number']
train_data_cleaned[cols] = train_data_cleaned[cols].fillna(0)
test_data_cleaned[cols] = test_data_cleaned[cols].fillna(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


## Predicting missing value imputation

    As city Feature has more than 40000 data present, we can predict with this much amount of data missing city records

### For Train Data

In [582]:
imputation_data = train_data_cleaned.copy()


##Dictionary for city_type(Dict is Used to map back numbers to Alphabet after training)
cat = {'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7,'H':8,'I':9}
##Converting City_type to Categories
for i in cat.keys():
    imputation_data.loc[imputation_data['City_Type']==i,'City_Type'] = cat[i]
  

## https://stackoverflow.com/questions/32011359/convert-categorical-data-in-pandas-dataframe/32011969
cat_columns = ['Category1','Category2','Employer_Category']
## Converting rest of the categorical variable to numerical
imputation_data[cat_columns] = imputation_data[cat_columns].astype('category')
imputation_data[cat_columns] = imputation_data[cat_columns].apply(lambda x: x.cat.codes)


## Dividing data into train and test(upon which we make prediction)
imputation_data_train = imputation_data[~imputation_data['City_Type'].isnull()]
imputation_data_test = imputation_data[imputation_data['City_Type'].isnull()]
imputation_data_train['City_Type'] = imputation_data_train['City_Type'].astype(int)  ## Changing Data type


## Droppping Date Features
x_train,y_train = imputation_data_train.drop(['Health_Camp_ID','City_Type','Registration_Date',
                                              'Camp_Start_Date','Camp_End_Date',
                                              'First_Interaction','Outcome'],axis=1) , imputation_data_train['City_Type']
x_test = imputation_data_test.drop(['Health_Camp_ID','Registration_Date','Camp_Start_Date',
                                    'Camp_End_Date','First_Interaction','City_Type','Outcome'],axis=1)
x_test = x_test.fillna(0) ## Changing NaN to zero as Fit method does not work on NaN
x_train = x_train.fillna(0)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [583]:
## USing RandomForest to Predict

clf = RandomForestClassifier(n_estimators=50)
clf.fit(x_train,y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [584]:
##Predicting Missing City:
p = clf.predict(x_test)
#https://stackoverflow.com/questions/8023306/get-key-by-value-in-dictionary
p_city  = [list(cat.keys())[list(cat.values()).index(i)] for i in p]

In [585]:
##Mapping back numerical entries of City_type to alphabet and adding it back to *train_data_cleaned*
train_data_cleaned.loc[train_data_cleaned['City_Type'].isnull(),'City_Type'] = p_city

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [586]:
print('Before:\n',train_data['City_Type'].value_counts())
print('=='*50)
print('After:\n',train_data_cleaned['City_Type'].value_counts())

Before:
 B    8273
H    6139
D    5451
G    4360
C    4259
E    4025
A    3441
I    3312
F    2810
Name: City_Type, dtype: int64
After:
 B    14999
H    11794
D     9756
G     7375
A     7014
E     6890
C     6866
I     5577
F     4673
Name: City_Type, dtype: int64


### For Test Data

In [587]:
imputation_data_for_test = test_data_cleaned.copy()
imputation_data_for_test = imputation_data_for_test[imputation_data_for_test['City_Type'].isnull()]
cat_columns = ['Category1','Category2','Employer_Category']
## Converting rest of the categorical variable to numerical
imputation_data_for_test[cat_columns] = imputation_data_for_test[cat_columns].astype('category')
imputation_data_for_test[cat_columns] = imputation_data_for_test[cat_columns].apply(lambda x: x.cat.codes)

x_test = imputation_data_for_test.drop(['Health_Camp_ID','City_Type','Registration_Date',
                                              'Camp_Start_Date','Camp_End_Date',
                                              'First_Interaction'],axis=1)
x_test = x_test.fillna(0)

In [588]:
##Predicting Missing City from clf trained on Test
p = clf.predict(x_test)
#https://stackoverflow.com/questions/8023306/get-key-by-value-in-dictionary
p_city  = [list(cat.keys())[list(cat.values()).index(i)] for i in p]

In [589]:
##Mapping back numerical entries of City_type to alphabe and adding it back to *train_data_cleaned*
test_data_cleaned.loc[test_data_cleaned['City_Type'].isnull(),'City_Type'] = p_city

In [590]:
print('Before:\n',test_data['City_Type'].value_counts())
print('=='*50)
print('After:\n',test_data_cleaned['City_Type'].value_counts())

Before:
 H    4064
B    3969
A    2464
D    2430
C    2084
E    1984
G    1938
I    1592
F    1330
Name: City_Type, dtype: int64
After:
 B    6819
H    6079
D    4387
A    3778
C    3265
G    3206
E    3080
I    2488
F    2147
Name: City_Type, dtype: int64


## Mean Imputation

In [591]:
from sklearn.impute import SimpleImputer
mean_impute_cols = ['Age']
imp_mean = SimpleImputer(strategy='mean')
imp_mean.fit(train_data_cleaned[mean_impute_cols])
##Imputing mean values
train_data_cleaned[mean_impute_cols] = imp_mean.transform(train_data_cleaned[mean_impute_cols])
test_data_cleaned[mean_impute_cols] = imp_mean.transform(test_data_cleaned[mean_impute_cols])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


## Mode Imputer

In [592]:
freq_impute_cols = ['Income', 'Education_Score', 'City_Type', 'Employer_Category']
imp_freq = SimpleImputer(strategy='most_frequent')
imp_freq.fit(train_data_cleaned[freq_impute_cols])
train_data_cleaned[freq_impute_cols] = imp_freq.transform(train_data_cleaned[freq_impute_cols])
test_data_cleaned[freq_impute_cols] = imp_freq.transform(test_data_cleaned[freq_impute_cols])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [593]:
print('Any null value present',test_data_cleaned.isnull().any().sum())

Any null value present 0


# Saving Data

In [532]:
train_data_cleaned.to_csv('train_data_cleaned.csv')
test_data_cleaned.to_csv('test_data_cleaned.csv')

# Feature Engineering

## Camp duration

In [594]:
train_data_cleaned['Camp Duration'] = (train_data_cleaned['Camp_End_Date'] - train_data_cleaned['Camp_Start_Date']).dt.days
test_data_cleaned['Camp Duration'] = (test_data_cleaned['Camp_End_Date'] - test_data_cleaned['Camp_Start_Date']).dt.days

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


## Registeration before or after camp started?

In [595]:
train_data_cleaned['Reg_bef_after'] = (train_data_cleaned['Camp_Start_Date'] - train_data_cleaned['Registration_Date']).dt.days
test_data_cleaned['Reg_bef_after'] = (test_data_cleaned['Camp_Start_Date'] - test_data_cleaned['Registration_Date']).dt.days

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


## How many days after user registered, camp ended?

In [596]:
train_data_cleaned['days_left_for_camp_end'] = (train_data_cleaned['Camp_End_Date'] - train_data_cleaned['Registration_Date'] ).dt.days
test_data_cleaned['days_left_for_camp_end'] = (test_data_cleaned['Camp_End_Date'] - test_data_cleaned['Registration_Date'] ).dt.days

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


## Days Since First Interaction

In [597]:
train_data_cleaned['days_first_interaction'] = (train_data_cleaned['Registration_Date'] - train_data_cleaned['First_Interaction'] ).dt.days
test_data_cleaned['days_first_interaction'] = (test_data_cleaned['Registration_Date'] - test_data_cleaned['First_Interaction'] ).dt.days

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


## Days since Last Interaction

In [1076]:
all_data = train_data_cleaned.sort_values(['Patient_ID', 'Registration_Date'])
all_data = all_data.reset_index(drop=True)
patient_wise_visits = all_data.loc[:,['Patient_ID','Registration_Date']]
#patient_wise_visits = patient_wise_visits.drop_duplicates()
patient_wise_visits = patient_wise_visits.reset_index(drop=True)
train_data_cleaned['Last_Interaction'] = patient_wise_visits.groupby('Patient_ID')['Registration_Date'].shift()

In [1077]:
all_data = test_data_cleaned.sort_values(['Patient_ID', 'Registration_Date'])
all_data = all_data.reset_index(drop=True)
patient_wise_visits = all_data.loc[:,['Patient_ID','Registration_Date']]
#patient_wise_visits = patient_wise_visits.drop_duplicates()
patient_wise_visits = patient_wise_visits.reset_index(drop=True)
test_data_cleaned['Last_Interaction'] = patient_wise_visits.groupby('Patient_ID')['Registration_Date'].shift()

In [1082]:
test_data_cleaned['Last_Int_days'] = (test_data_cleaned['Registration_Date'] - test_data_cleaned['Last_Interaction'] ).dt.days
train_data_cleaned['Last_Int_days'] = (train_data_cleaned['Registration_Date'] - train_data_cleaned['Last_Interaction'] ).dt.days

In [1095]:
train_data_cleaned = train_data_cleaned.fillna(0)
test_data_cleaned = test_data_cleaned.fillna(0)

## Scaling

In [1099]:
target = 'Outcome'
Id = ['Patient_ID','Health_Camp_ID']
date = ['Registration_Date', 'Camp_Start_Date', 'Camp_End_Date', 'First_Interaction','Last_Interaction']
categorical = ['Var1', 'Var2', 'Var3', 'Var4', 'Var5', 'Category1', 'Category2', 'Category3', 'Online_Follower', 
                   'LinkedIn_Shared', 'Twitter_Shared', 'Facebook_Shared', 'City_Type', 'Employer_Category']

In [1100]:
train_data_cleaned_scaled = train_data_cleaned.copy()
test_data_cleaned_scaled = test_data_cleaned.copy()

In [1101]:
for col in train_data_cleaned_scaled.columns:
    if (col != target) and (col not in Id)  and (col not in date) and (col not in categorical):
        clf = StandardScaler()
        train_data_cleaned_scaled[col] = clf.fit_transform(train_data_cleaned[[col]])
        test_data_cleaned_scaled[col] = clf.transform(test_data_cleaned[[col]])

## OHE

In [1102]:
cols_for_ohe = ['Category1', 'Category2', 'City_Type', 'Employer_Category']

In [1103]:
train_data_cleaned_scaled = pd.concat([train_data_cleaned_scaled.drop(cols_for_ohe,axis=1),
                                       pd.get_dummies(train_data_cleaned_scaled[cols_for_ohe])],axis=1)
test_data_cleaned_scaled = pd.concat([test_data_cleaned_scaled.drop(cols_for_ohe,axis=1),
                                      pd.get_dummies(test_data_cleaned_scaled[cols_for_ohe])],axis=1)

# Modelling

## Test Train Split

In [1104]:
train_data_cleaned_scaled.groupby(target)[target].count()

Outcome
0    54606
1    20338
Name: Outcome, dtype: int64

In [1105]:
from sklearn.model_selection import train_test_split
ignore_cols_train = [Id[0],Id[1], target, 'Registration_Date', 'Camp_Start_Date',
                     'Camp_End_Date', 'First_Interaction','Donation','Health_Score','Health Score',
                     'Number_of_stall_visited','Last_Stall_Visited_Number','Last_Interaction']
ignore_cols_test = [Id[0],Id[1], 'Registration_Date', 'Camp_Start_Date',
                     'Camp_End_Date', 'First_Interaction','Donation','Health_Score','Health Score',
                     'Number_of_stall_visited','Last_Stall_Visited_Number','Last_Interaction']
X, y = train_data_cleaned_scaled.drop(ignore_cols_train, axis=1), train_data_cleaned_scaled[target]
X_test = test_data_cleaned_scaled.drop(ignore_cols_test, axis=1)

In [1106]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.30, random_state=10)

## Random Forest

In [1107]:
X_test['Category2_B'] = [0]*X_test.shape[0]

In [1108]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
clf = RandomForestClassifier(n_estimators=1000,class_weight={0:1,1:2.5})
clf.fit(X_train,y_train)
y_pred = clf.predict(X_val)
accuracy_score(y_val,y_pred)

0.8199163849848782

In [1110]:
X_test.columns =X_train.columns

In [1111]:
randomforest_auc = roc_auc_score(y_val,y_pred)
print(randomforest_auc)

0.7435572855316975


In [1112]:
data['Test_result'] =data['Test'].drop(['Registration_Date','Var1','Var2','Var3','Var4','Var5'],axis=1)
data['Test_result']['Outcome'] = clf.predict(X_test)

In [1113]:
data['Test_result']['Outcome'].value_counts()

0    31886
1     3363
Name: Outcome, dtype: int64

## Using Lazy Predict

In [891]:
from lazypredict.Supervised import LazyClassifier



In [1125]:
clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = clf.fit(X_train, X_val, y_train, y_val)

 60%|█████████████████████████████████████████████████▏                                | 18/30 [01:59<00:54,  4.57s/it]

KeyboardInterrupt: 

In [893]:
models

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMClassifier,0.82,0.73,0.73,0.81,1.56
RandomForestClassifier,0.81,0.73,0.73,0.8,7.12
XGBClassifier,0.82,0.72,0.72,0.81,4.73
ExtraTreesClassifier,0.8,0.72,0.72,0.79,6.25
BaggingClassifier,0.8,0.72,0.72,0.79,2.19
DecisionTreeClassifier,0.78,0.71,0.71,0.77,0.37
ExtraTreeClassifier,0.76,0.69,0.69,0.76,0.22
KNeighborsClassifier,0.76,0.67,0.67,0.75,31.95
AdaBoostClassifier,0.79,0.66,0.66,0.76,2.47
NuSVC,0.75,0.57,0.57,0.7,490.79


## Using XGboost

In [897]:
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV 
from sklearn import  metrics 

In [981]:
def modelfit(alg, dtrain,dpredictor,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrain.values, label=dpredictor.values)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='auc', early_stopping_rounds=early_stopping_rounds)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    #Fit the algorithm on the data
    alg.fit(dtrain,dpredictor,eval_metric='auc')
        
    #Predict training set:
    dtrain_predictions = alg.predict(dtrain)
    dtrain_predprob = alg.predict_proba(dtrain)[:,1]
    print(cvresult.shape[0])
    #Print model report:
    print ("\nModel Report")
    print ("Accuracy : %.4g" % metrics.accuracy_score(dpredictor.values, dtrain_predictions))
    print ("AUC Score (Train): %f" % metrics.roc_auc_score(dpredictor, dtrain_predprob))
                    

In [1130]:
## All the hyper parameters are determined using randomsearch Cv not present in this notebook,
## n_estimator is determined in modelfit()
xgb1 = XGBClassifier(
 learning_rate =0.01,
 n_estimators=200,
 max_depth=9,
 min_child_weight=1,
 gamma=0.1,
 subsample=0.9,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1.75,
 reg_alpha=0
 )

In [1131]:
modelfit(xgb1,X_train,y_train)

200

Model Report
Accuracy : 0.848
AUC Score (Train): 0.889691


In [1132]:
data['Test_result']['Outcome'] = xgb1.predict(X_test)
data['Test_result'].to_csv('Submission.csv',index=False)

In [1133]:
boosted = roc_auc_score(xgb1.predict(X_val),y_val)
print(boosted)

0.7728601249523224


## Using LGBMClassifier

In [994]:
def learning_rate_010_decay_power_099(current_iter):
    base_learning_rate = 0.1
    lr = base_learning_rate  * np.power(.99, current_iter)
    return lr if lr > 1e-3 else 1e-3

def learning_rate_010_decay_power_0995(current_iter):
    base_learning_rate = 0.1
    lr = base_learning_rate  * np.power(.995, current_iter)
    return lr if lr > 1e-3 else 1e-3

def learning_rate_005_decay_power_099(current_iter):
    base_learning_rate = 0.05
    lr = base_learning_rate  * np.power(.99, current_iter)
    return lr if lr > 1e-3 else 1e-3

In [1117]:
import lightgbm as lgb
fit_params={"early_stopping_rounds":30, 
            "eval_metric" : 'auc', 
            "eval_set" : [(X_val,y_val)],
            'eval_names': ['valid'],
            #'callbacks': [lgb.reset_parameter(learning_rate=learning_rate_010_decay_power_099)],
            'verbose': 100,
            'categorical_feature': 'auto'}

In [1118]:
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
param_test ={'num_leaves': sp_randint(6, 50), 
             'min_child_samples': sp_randint(100, 500), 
             'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4],
             'subsample': sp_uniform(loc=0.2, scale=0.8), 
             'colsample_bytree': sp_uniform(loc=0.4, scale=0.6),
             'reg_alpha': [0, 1e-1, 1, 2, 5, 7, 10, 50, 100],
             'reg_lambda': [0, 1e-1, 1, 5, 10, 20, 50, 100]}

In [1119]:
#This parameter defines the number of HP points to be tested
n_HP_points_to_test = 100

import lightgbm as lgb
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

#n_estimators is set to a "large value". The actual number of trees build will depend on early stopping and 5000 define only the absolute maximum
clf = lgb.LGBMClassifier(max_depth=-1, random_state=314, silent=True, metric='None', n_jobs=4, n_estimators=5000)
gs = RandomizedSearchCV(
    estimator=clf, param_distributions=param_test, 
    n_iter=n_HP_points_to_test,
    scoring='roc_auc',
    cv=3,
    refit=True,
    random_state=314,
    verbose=True)

In [998]:
## This section takes lot of time to run
gs.fit(X_train, y_train, **fit_params)
print('Best score reached: {} with params: {} '.format(gs.best_score_, gs.best_params_))



Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.853679
[200]	valid's auc: 0.855562
[300]	valid's auc: 0.85639
[400]	valid's auc: 0.856638
Early stopping, best iteration is:
[419]	valid's auc: 0.856727
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.853227
[200]	valid's auc: 0.855412
[300]	valid's auc: 0.856567
[400]	valid's auc: 0.85725
[500]	valid's auc: 0.857835
[600]	valid's auc: 0.858036
Early stopping, best iteration is:
[625]	valid's auc: 0.858138
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.853036
[200]	valid's auc: 0.855186
[300]	valid's auc: 0.856059
[400]	valid's auc: 0.856392
Early stopping, best iteration is:
[386]	valid's auc: 0.856472
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.853428
[200]	valid's auc: 0.855129
[300]	valid's auc: 0.855847
[400]	valid's auc: 0.856201
Early stopping, best iteration is:
[398]	valid's auc: 0.856204
Tr

[200]	valid's auc: 0.856173
[300]	valid's auc: 0.856602
[400]	valid's auc: 0.857169
[500]	valid's auc: 0.857307
Early stopping, best iteration is:
[483]	valid's auc: 0.857371
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.854892
[200]	valid's auc: 0.856051
[300]	valid's auc: 0.856618
[400]	valid's auc: 0.857032
Early stopping, best iteration is:
[385]	valid's auc: 0.857122
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.854757
[200]	valid's auc: 0.856165
[300]	valid's auc: 0.857217
[400]	valid's auc: 0.857657
Early stopping, best iteration is:
[413]	valid's auc: 0.857754
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.85452
[200]	valid's auc: 0.856116
[300]	valid's auc: 0.856765
[400]	valid's auc: 0.856949
Early stopping, best iteration is:
[459]	valid's auc: 0.857098
Training until validation scores don't improve for 30 rounds
Early stopping, best iteration is:
[1]	valid's auc: 0.5
T

[100]	valid's auc: 0.850249
[200]	valid's auc: 0.852485
[300]	valid's auc: 0.853298
[400]	valid's auc: 0.853876
[500]	valid's auc: 0.854296
[600]	valid's auc: 0.854697
[700]	valid's auc: 0.855063
[800]	valid's auc: 0.855408
[900]	valid's auc: 0.855668
Early stopping, best iteration is:
[911]	valid's auc: 0.855728
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.850058
[200]	valid's auc: 0.852227
[300]	valid's auc: 0.853411
[400]	valid's auc: 0.853909
[500]	valid's auc: 0.854282
[600]	valid's auc: 0.854753
[700]	valid's auc: 0.855078
[800]	valid's auc: 0.855559
[900]	valid's auc: 0.855862
Early stopping, best iteration is:
[944]	valid's auc: 0.85602
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.849525
[200]	valid's auc: 0.852059
[300]	valid's auc: 0.853053
[400]	valid's auc: 0.853882
[500]	valid's auc: 0.854349
[600]	valid's auc: 0.854906
[700]	valid's auc: 0.855323
[800]	valid's auc: 0.855703
[900]	valid's auc: 0.8560

[400]	valid's auc: 0.857574
Early stopping, best iteration is:
[378]	valid's auc: 0.857645
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.855397
[200]	valid's auc: 0.857033
[300]	valid's auc: 0.857874
[400]	valid's auc: 0.858158
[500]	valid's auc: 0.858663
[600]	valid's auc: 0.858849
Early stopping, best iteration is:
[648]	valid's auc: 0.858999
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.854457
[200]	valid's auc: 0.856172
[300]	valid's auc: 0.856991
[400]	valid's auc: 0.857541
Early stopping, best iteration is:
[466]	valid's auc: 0.857732
Training until validation scores don't improve for 30 rounds
Early stopping, best iteration is:
[1]	valid's auc: 0.5
Training until validation scores don't improve for 30 rounds
Early stopping, best iteration is:
[1]	valid's auc: 0.5
Training until validation scores don't improve for 30 rounds
Early stopping, best iteration is:
[1]	valid's auc: 0.5
Training until validation scor

[200]	valid's auc: 0.832814
[300]	valid's auc: 0.834458
[400]	valid's auc: 0.835603
[500]	valid's auc: 0.83636
[600]	valid's auc: 0.836938
[700]	valid's auc: 0.837301
[800]	valid's auc: 0.837665
[900]	valid's auc: 0.83795
[1000]	valid's auc: 0.838171
[1100]	valid's auc: 0.83837
[1200]	valid's auc: 0.838557
[1300]	valid's auc: 0.838677
[1400]	valid's auc: 0.838793
[1500]	valid's auc: 0.838964
[1600]	valid's auc: 0.839054
[1700]	valid's auc: 0.83913
[1800]	valid's auc: 0.839216
[1900]	valid's auc: 0.83925
Early stopping, best iteration is:
[1879]	valid's auc: 0.839273
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.847796
Early stopping, best iteration is:
[130]	valid's auc: 0.848205
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.847008
Early stopping, best iteration is:
[142]	valid's auc: 0.847702
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.847348
Early stopping, best iteration is:

Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.848448
Early stopping, best iteration is:
[143]	valid's auc: 0.848662
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.848179
Early stopping, best iteration is:
[142]	valid's auc: 0.848467
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.850418
[200]	valid's auc: 0.852327
[300]	valid's auc: 0.85328
[400]	valid's auc: 0.853962
[500]	valid's auc: 0.854377
[600]	valid's auc: 0.854821
[700]	valid's auc: 0.855109
[800]	valid's auc: 0.85539
Early stopping, best iteration is:
[777]	valid's auc: 0.855422
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.849927
[200]	valid's auc: 0.852177
[300]	valid's auc: 0.853288
[400]	valid's auc: 0.853931
[500]	valid's auc: 0.854399
[600]	valid's auc: 0.854675
Early stopping, best iteration is:
[601]	valid's auc: 0.854681
Training until validation scores don't improve for 30 roun

[100]	valid's auc: 0.84985
[200]	valid's auc: 0.852344
[300]	valid's auc: 0.853312
[400]	valid's auc: 0.853909
[500]	valid's auc: 0.854174
Early stopping, best iteration is:
[524]	valid's auc: 0.85423
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.852675
[200]	valid's auc: 0.855138
[300]	valid's auc: 0.856147
[400]	valid's auc: 0.8567
Early stopping, best iteration is:
[463]	valid's auc: 0.856853
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.853164
[200]	valid's auc: 0.855212
[300]	valid's auc: 0.85623
[400]	valid's auc: 0.856874
[500]	valid's auc: 0.857226
[600]	valid's auc: 0.857573
[700]	valid's auc: 0.85772
Early stopping, best iteration is:
[698]	valid's auc: 0.85773
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.852842
[200]	valid's auc: 0.855009
[300]	valid's auc: 0.85621
[400]	valid's auc: 0.856789
[500]	valid's auc: 0.857068
[600]	valid's auc: 0.857259
[700]	valid's auc: 0

[1100]	valid's auc: 0.838941
[1200]	valid's auc: 0.839067
[1300]	valid's auc: 0.83917
[1400]	valid's auc: 0.83932
[1500]	valid's auc: 0.839416
Early stopping, best iteration is:
[1535]	valid's auc: 0.839494
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.846887
Early stopping, best iteration is:
[157]	valid's auc: 0.848015
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.846379
Early stopping, best iteration is:
[168]	valid's auc: 0.847497
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.846123
[200]	valid's auc: 0.847328
Early stopping, best iteration is:
[176]	valid's auc: 0.847328
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.852896
[200]	valid's auc: 0.854814
[300]	valid's auc: 0.855409
Early stopping, best iteration is:
[345]	valid's auc: 0.855652
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.852662
[200]	valid'

Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.852599
[200]	valid's auc: 0.854676
[300]	valid's auc: 0.855387
Early stopping, best iteration is:
[349]	valid's auc: 0.855728
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.852891
[200]	valid's auc: 0.8545
[300]	valid's auc: 0.855122
Early stopping, best iteration is:
[294]	valid's auc: 0.855123
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.853218
[200]	valid's auc: 0.854846
Early stopping, best iteration is:
[256]	valid's auc: 0.855051
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.852777
[200]	valid's auc: 0.854545
Early stopping, best iteration is:
[249]	valid's auc: 0.854882
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.852367
[200]	valid's auc: 0.854379
[300]	valid's auc: 0.855137
Early stopping, best iteration is:
[362]	valid's auc: 0.855397
Training until val

[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed: 14.9min finished


Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.859565
[200]	valid's auc: 0.861817
[300]	valid's auc: 0.86295
[400]	valid's auc: 0.863813
Early stopping, best iteration is:
[433]	valid's auc: 0.863986
Best score reached: 0.8582075884899406 with params: {'colsample_bytree': 0.952164731370897, 'min_child_samples': 111, 'min_child_weight': 0.01, 'num_leaves': 38, 'reg_alpha': 0, 'reg_lambda': 0.1, 'subsample': 0.3029313662262354} 


In [1004]:
## finding weights for minority class
clf_sw = lgb.LGBMClassifier(**clf.get_params())
#set optimal parameters
clf_sw.set_params(**opt_parameters)

LGBMClassifier(boosting_type='gbdt', class_weight=None,
               colsample_bytree=0.952164731370897, importance_type='split',
               learning_rate=0.1, max_depth=-1, metric='None',
               min_child_samples=111, min_child_weight=0.01, min_split_gain=0.0,
               n_estimators=5000, n_jobs=4, num_leaves=38, objective=None,
               random_state=314, reg_alpha=0, reg_lambda=0.1, silent=True,
               subsample=0.3029313662262354, subsample_for_bin=200000,
               subsample_freq=0)

In [1007]:
gs_sample_weight = GridSearchCV(estimator=clf_sw, 
                                param_grid={'scale_pos_weight':[2.5,3,3.5,4]},
                                scoring='roc_auc',
                                cv=5,
                                refit=True,
                                verbose=True)

In [1008]:
gs_sample_weight.fit(X_train, y_train, **fit_params)
print('Best score reached: {} with params: {} '.format(gs_sample_weight.best_score_, gs_sample_weight.best_params_))

Fitting 5 folds for each of 4 candidates, totalling 20 fits
Training until validation scores don't improve for 30 rounds

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.



[100]	valid's auc: 0.859509
[200]	valid's auc: 0.860679
Early stopping, best iteration is:
[227]	valid's auc: 0.861065
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.8586
[200]	valid's auc: 0.859937
[300]	valid's auc: 0.860582
Early stopping, best iteration is:
[274]	valid's auc: 0.860637
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.858683
[200]	valid's auc: 0.859878
Early stopping, best iteration is:
[257]	valid's auc: 0.860466
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.860294
[200]	valid's auc: 0.862079
Early stopping, best iteration is:
[237]	valid's auc: 0.862298
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.858681
[200]	valid's auc: 0.860443
[300]	valid's auc: 0.86097
Early stopping, best iteration is:
[312]	valid's auc: 0.861011
Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.859462
[200]	valid's auc:

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:   47.2s finished


Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.860104
[200]	valid's auc: 0.862213
[300]	valid's auc: 0.863424
[400]	valid's auc: 0.864037
Early stopping, best iteration is:
[433]	valid's auc: 0.864209
Best score reached: 0.8600527319750176 with params: {'scale_pos_weight': 2.5} 


In [1120]:
## Best Parameters
opt_parameters = {'colsample_bytree': 0.952164731370897, 
         'min_child_samples': 111, 'min_child_weight': 0.01, 'num_leaves': 38, 
         'reg_alpha': 0.3, 'reg_lambda': 0.2, 'subsample': 0.3029313662262354,
         'scale_pos_weight': 1.75} 

In [1116]:
X_train

Unnamed: 0,Var1,Var2,Var3,Var4,Var5,Category3,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,Camp Duration,Reg_bef_after,days_left_for_camp_end,days_first_interaction,Last_Int_days,Category1_First,Category1_Second,Category1_Third,Category2_A,Category2_B,Category2_C,Category2_D,Category2_E,Category2_F,Category2_G,City_Type_A,City_Type_B,City_Type_C,City_Type_D,City_Type_E,City_Type_F,City_Type_G,City_Type_H,City_Type_I,Employer_Category_BFSI,Employer_Category_Broadcasting,Employer_Category_Consulting,Employer_Category_Education,Employer_Category_Food,Employer_Category_Health,Employer_Category_Manufacturing,Employer_Category_Others,Employer_Category_Real Estate,Employer_Category_Retail,Employer_Category_Software Industry,Employer_Category_Technology,Employer_Category_Telecom,Employer_Category_Transport
17496,0,0,0,0,0,2,0,0,0,0,0.67,-3.54,-1.22,-0.09,-0.19,-0.23,2.50,-0.12,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
33732,0,0,0,0,0,2,0,0,0,0,-0.39,0.15,-1.99,-0.07,-0.38,-0.33,-0.69,-0.12,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
26633,0,0,0,0,0,2,0,0,0,0,-0.39,0.15,-0.00,1.98,-0.08,2.66,-0.71,1.38,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
18067,0,0,0,0,0,2,0,0,0,0,4.90,2.42,-0.14,-0.57,0.67,-0.39,-0.46,-0.80,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
30674,1,0,0,0,1,2,0,0,0,0,-0.39,0.15,-0.00,-0.53,0.40,-0.49,2.22,-0.12,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17976,0,0,0,0,0,2,0,0,0,0,1.73,2.14,-0.30,-0.70,0.53,-0.64,-0.73,-0.12,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
37733,0,0,0,0,0,2,0,0,0,0,-0.39,0.15,-0.00,0.64,-1.60,-0.06,-0.73,-0.12,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
10246,0,0,0,0,0,2,0,0,0,0,-0.39,0.15,-0.00,0.64,0.32,1.07,-0.73,-2.11,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
9412,0,0,0,0,0,2,0,0,0,0,-0.39,0.15,-0.00,-0.54,0.23,-0.60,0.57,-0.12,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [1121]:
clf_final = lgb.LGBMClassifier(**clf.get_params())
#set optimal parameters
clf_final.set_params(**opt_parameters)
#Train the final model with learning rate decay
clf_final.fit(X_train, y_train, **fit_params, callbacks=[lgb.reset_parameter(learning_rate=learning_rate_010_decay_power_0995)])

Training until validation scores don't improve for 30 rounds
[100]	valid's auc: 0.858186
[200]	valid's auc: 0.859786
[300]	valid's auc: 0.860386
[400]	valid's auc: 0.860707
[500]	valid's auc: 0.860815
[600]	valid's auc: 0.860965
[700]	valid's auc: 0.861016
Early stopping, best iteration is:
[747]	valid's auc: 0.861034


LGBMClassifier(boosting_type='gbdt', class_weight=None,
               colsample_bytree=0.952164731370897, importance_type='split',
               learning_rate=0.1, max_depth=-1, metric='None',
               min_child_samples=111, min_child_weight=0.01, min_split_gain=0.0,
               n_estimators=5000, n_jobs=4, num_leaves=38, objective=None,
               random_state=314, reg_alpha=0.3, reg_lambda=0.2,
               scale_pos_weight=1.75, silent=True, subsample=0.3029313662262354,
               subsample_for_bin=200000, subsample_freq=0)

In [1123]:
data['Test_result']['Outcome'] = clf_final.predict(X_test)

In [1124]:
data['Test_result'].to_csv('Submission.csv',index=False)

In [1122]:
LGBM_AUC = roc_auc_score(clf_final.predict(X_val),y_val)
print(LGBM_AUC)

0.7668441040513237


In [1134]:
print('Random Forest AUC Score',randomforest_auc)
print('XGB Boost AUC Score',boosted)
print('LGBM AUC Score',LGBM_AUC)

Random Forest AUC Score 0.7435572855316975
XGB Boost AUC Score 0.7728601249523224
LGBM AUC Score 0.7668441040513237


####### On Test LGBM gave 75% accuracy