# Problem Statement:

Congratulations – you have been hired as Chief Data Scientist of MedCamp – a not for profit organization dedicated in making health conditions for working professionals better. MedCamp was started because the founders saw their family suffer due to bad work life balance and neglected health.

MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp). 

MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and Number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.

One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.

 

The Process:
1. MedCamp employees / volunteers reach out to people and drive registrations.

2. During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.
 

Other things to note:
1. Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.

2. For a few camps, there was hardware failure, so some information about date and time of registration is lost.

3. MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides information about several health issues through various awareness stalls.

Favorable outcome:
1. For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.

2. You need to predict the chances (probability) of having a favourable outcome.
 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

  import pandas.util.testing as tm


In [2]:
from google.colab import files
uploaded = files.upload()

Saving Data_Dictionary.xlsx to Data_Dictionary.xlsx
Saving First_Health_Camp_Attended.csv to First_Health_Camp_Attended.csv
Saving Health_Camp_Detail.csv to Health_Camp_Detail.csv
Saving Patient_Profile.csv to Patient_Profile.csv
Saving sample_submmission.csv to sample_submmission.csv
Saving Second_Health_Camp_Attended.csv to Second_Health_Camp_Attended.csv
Saving test_l0Auv8Q.csv to test_l0Auv8Q.csv
Saving Third_Health_Camp_Attended.csv to Third_Health_Camp_Attended.csv
Saving Train.csv to Train.csv


In [3]:
train_data = pd.read_csv("Train.csv")
first_health_camp_data = pd.read_csv("First_Health_Camp_Attended.csv")
health_camp_details = pd.read_csv("Health_Camp_Detail.csv")
second_health_camp_data = pd.read_csv("Second_Health_Camp_Attended.csv")
third_health_camp_data = pd.read_csv("Third_Health_Camp_Attended.csv")
patient_profile = pd.read_csv("Patient_Profile.csv")
test_data = pd.read_csv("test_l0Auv8Q.csv")

# Data Cleaning:

In [4]:
patient_profile.head()

Unnamed: 0,Patient_ID,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category
0,516956,0,0,0,0,1,90.0,39,18-Jun-03,,Software Industry
1,507733,0,0,0,0,1,,40,20-Jul-03,H,Software Industry
2,508307,0,0,0,0,3,87.0,46,02-Nov-02,D,BFSI
3,512612,0,0,0,0,1,75.0,47,02-Nov-02,D,Education
4,521075,0,0,0,0,3,,80,24-Nov-02,H,Others


In [5]:
train = pd.merge(train_data, first_health_camp_data.drop('Unnamed: 4',axis = 1), how = 'left', on = ['Patient_ID','Health_Camp_ID'], indicator = "merge_1")
train = pd.merge(train, second_health_camp_data, how = 'left', on = ['Patient_ID','Health_Camp_ID'], indicator = "merge_2")
train = pd.merge(train, third_health_camp_data, how = 'left', on = ['Patient_ID','Health_Camp_ID'], indicator = "merge_3")
train = pd.merge(train, health_camp_details, how = 'left', on = ['Health_Camp_ID'], indicator = "merge_healthcamp")
train = pd.merge(train, patient_profile, how = 'left', on = ['Patient_ID'], indicator = "merge_patientprofile")

In [6]:
train

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,Donation,Health_Score,merge_1,Health Score,merge_2,Number_of_stall_visited,Last_Stall_Visited_Number,merge_3,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,merge_healthcamp,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category,merge_patientprofile
0,489652,6578,10-Sep-05,4,0,0,0,2,,,left_only,,left_only,2.0,1.0,both,16-Aug-05,14-Oct-05,Third,G,2,both,0,0,0,0,,,,06-Dec-04,,,both
1,507246,6578,18-Aug-05,45,5,0,0,7,,,left_only,,left_only,,,left_only,16-Aug-05,14-Oct-05,Third,G,2,both,0,0,0,0,1,75,40,08-Sep-04,C,Others,both
2,523729,6534,29-Apr-06,0,0,0,0,0,,,left_only,0.402054,both,,,left_only,17-Oct-05,07-Nov-07,Second,A,2,both,0,0,0,0,,,,22-Jun-04,,,both
3,524931,6535,07-Feb-04,0,0,0,0,0,,,left_only,,left_only,,,left_only,01-Feb-04,18-Feb-04,First,E,2,both,0,0,0,0,,,,07-Feb-04,I,,both
4,521364,6529,28-Feb-06,15,1,0,0,7,,,left_only,0.845597,both,,,left_only,30-Mar-06,03-Apr-06,Second,A,2,both,0,0,0,1,1,70,40,04-Jul-03,I,Technology,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75273,500969,6539,03-Jan-05,0,0,0,0,0,,,left_only,,left_only,,,left_only,07-Aug-04,12-Feb-05,First,F,2,both,0,0,0,0,,,,14-Aug-04,,,both
75274,511952,6528,13-Feb-06,0,0,0,0,0,,,left_only,,left_only,2.0,1.0,both,10-Feb-06,25-Apr-06,Third,G,2,both,1,1,1,0,3,73,51,12-Sep-04,I,Real Estate,both
75275,521236,6554,24-May-05,0,0,0,0,0,20.0,0.927746,both,,left_only,,,left_only,19-Jun-05,01-Jul-05,First,B,2,both,0,0,0,0,1,92,37,11-May-05,G,Software Industry,both
75276,518817,6580,22-Dec-04,0,0,0,0,0,,,left_only,,left_only,,,left_only,22-Dec-04,06-Jan-05,First,E,2,both,0,0,0,0,3,76,44,24-Sep-04,E,Technology,both


In [7]:
train['outcome']  = 0

In [8]:
train.loc[(train.merge_1 == "both") |
          (train.merge_2 == 'both') |
          (train.merge_3 == 'both') & (train.Number_of_stall_visited > 0),'outcome'] = 1

In [9]:
train.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,Donation,Health_Score,merge_1,Health Score,merge_2,Number_of_stall_visited,Last_Stall_Visited_Number,merge_3,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,merge_healthcamp,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category,merge_patientprofile,outcome
0,489652,6578,10-Sep-05,4,0,0,0,2,,,left_only,,left_only,2.0,1.0,both,16-Aug-05,14-Oct-05,Third,G,2,both,0,0,0,0,,,,06-Dec-04,,,both,1
1,507246,6578,18-Aug-05,45,5,0,0,7,,,left_only,,left_only,,,left_only,16-Aug-05,14-Oct-05,Third,G,2,both,0,0,0,0,1.0,75.0,40.0,08-Sep-04,C,Others,both,0
2,523729,6534,29-Apr-06,0,0,0,0,0,,,left_only,0.402054,both,,,left_only,17-Oct-05,07-Nov-07,Second,A,2,both,0,0,0,0,,,,22-Jun-04,,,both,1
3,524931,6535,07-Feb-04,0,0,0,0,0,,,left_only,,left_only,,,left_only,01-Feb-04,18-Feb-04,First,E,2,both,0,0,0,0,,,,07-Feb-04,I,,both,0
4,521364,6529,28-Feb-06,15,1,0,0,7,,,left_only,0.845597,both,,,left_only,30-Mar-06,03-Apr-06,Second,A,2,both,0,0,0,1,1.0,70.0,40.0,04-Jul-03,I,Technology,both,1


In [10]:
test = pd.merge(test_data, first_health_camp_data.drop('Unnamed: 4',axis = 1), how = 'left', on = ['Patient_ID','Health_Camp_ID'], indicator = "merge_1")
test = pd.merge(test, second_health_camp_data, how = 'left', on = ['Patient_ID','Health_Camp_ID'], indicator = "merge_2")
test = pd.merge(test, third_health_camp_data, how = 'left', on = ['Patient_ID','Health_Camp_ID'], indicator = "merge_3")
test = pd.merge(test, health_camp_details, how = 'left', on = ['Health_Camp_ID'], indicator = "merge_healthcamp")
test = pd.merge(test, patient_profile, how = 'left', on = ['Patient_ID'], indicator = "merge_patientprofile")

In [11]:
test.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,Donation,Health_Score,merge_1,Health Score,merge_2,Number_of_stall_visited,Last_Stall_Visited_Number,merge_3,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,merge_healthcamp,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category,merge_patientprofile
0,505701,6548,21-May-06,1,0,0,0,2,,,left_only,,left_only,,,left_only,13-Jun-06,18-Aug-06,Third,G,2,both,0,0,0,0,0.0,,44.0,05-Feb-03,E,,both
1,500633,6584,02-Jun-06,0,0,0,0,0,,,left_only,,left_only,,,left_only,04-Aug-06,09-Aug-06,Second,A,2,both,0,1,0,0,1.0,67.0,41.0,11-Dec-04,D,Consulting,both
2,506945,6582,10-Aug-06,0,0,0,0,0,,,left_only,,left_only,,,left_only,06-Apr-06,07-Nov-07,First,F,2,both,0,0,0,0,,,,19-Apr-06,,,both
3,497447,6551,27-Aug-06,0,0,0,0,0,,,left_only,,left_only,,,left_only,13-Nov-06,18-Nov-06,Second,D,2,both,0,0,0,0,0.0,,47.0,25-Aug-06,B,,both
4,496446,6533,19-Sep-06,0,0,0,0,0,,,left_only,,left_only,,,left_only,20-Sep-06,23-Sep-06,First,E,2,both,0,0,0,0,,,,19-Sep-06,B,,both


In [12]:
train.columns

Index(['Patient_ID', 'Health_Camp_ID', 'Registration_Date', 'Var1', 'Var2',
       'Var3', 'Var4', 'Var5', 'Donation', 'Health_Score', 'merge_1',
       'Health Score', 'merge_2', 'Number_of_stall_visited',
       'Last_Stall_Visited_Number', 'merge_3', 'Camp_Start_Date',
       'Camp_End_Date', 'Category1', 'Category2', 'Category3',
       'merge_healthcamp', 'Online_Follower', 'LinkedIn_Shared',
       'Twitter_Shared', 'Facebook_Shared', 'Income', 'Education_Score', 'Age',
       'First_Interaction', 'City_Type', 'Employer_Category',
       'merge_patientprofile', 'outcome'],
      dtype='object')

In [13]:
date_cols = ['Registration_Date','Camp_Start_Date','Camp_End_Date','First_Interaction']

In [14]:
def to_date(df):
    for col in date_cols:
        if col in df.columns:
            df[col] = pd.to_datetime(df[col],format = "%d-%b-%y")
    return df        

In [15]:
train = to_date(train)
test = to_date(test)

In [16]:
num_cols = ['Income', 'Education_Score', 'Age']

In [17]:
import pandas.api.types as ptypes
~(ptypes.is_numeric_dtype(train['Age']))

-1

In [18]:
def to_numeric(df):
    for col in df.columns:
        if (col in num_cols) & ~(ptypes.is_numeric_dtype(df[col])):
            df[col] = df[col].replace({"None":""})
            df[col] = pd.to_numeric(df[col], errors = 'coerce')
    return df

In [19]:
train = to_numeric(train)
test = to_numeric(test)

In [20]:
train.sort_values(['Patient_ID','Registration_Date'])

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,Donation,Health_Score,merge_1,Health Score,merge_2,Number_of_stall_visited,Last_Stall_Visited_Number,merge_3,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,merge_healthcamp,Online_Follower,LinkedIn_Shared,Twitter_Shared,Facebook_Shared,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category,merge_patientprofile,outcome
69348,485679,6578,2005-08-22,0,0,0,0,0,,,left_only,,left_only,4.0,4.0,both,2005-08-16,2005-10-14,Third,G,2,both,0,0,0,0,,,,2005-08-12,I,,both,1
64479,485679,6555,2005-08-31,0,0,0,0,0,,,left_only,,left_only,,,left_only,2005-09-15,2005-09-19,Second,A,2,both,0,0,0,0,,,,2005-08-12,I,,both,0
6484,485680,6543,2006-07-10,0,0,0,0,0,,,left_only,,left_only,,,left_only,2005-09-27,2007-11-07,First,F,2,both,0,0,0,0,,,,2006-07-10,A,,both,0
18999,485681,6580,2004-12-20,0,0,0,0,0,,,left_only,,left_only,,,left_only,2004-12-22,2005-01-06,First,E,2,both,0,0,0,1,0.0,,46.0,2004-12-19,G,,both,0
2604,485681,6526,2005-01-01,0,0,0,0,0,,,left_only,,left_only,,,left_only,2005-01-03,2005-02-20,First,E,2,both,0,0,0,1,0.0,,46.0,2004-12-19,G,,both,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18124,528657,6531,2004-12-11,0,0,0,0,0,20.0,0.670886,both,,left_only,,,left_only,2004-12-09,2004-12-14,First,C,2,both,0,0,0,0,,,,2004-10-25,D,,both,1
32744,528657,6580,2004-12-18,0,0,0,0,0,,,left_only,,left_only,,,left_only,2004-12-22,2005-01-06,First,E,2,both,0,0,0,0,,,,2004-10-25,D,,both,0
7632,528657,6526,2004-12-30,0,0,0,0,0,,,left_only,,left_only,,,left_only,2005-01-03,2005-02-20,First,E,2,both,0,0,0,0,,,,2004-10-25,D,,both,0
24471,528657,6536,2005-02-13,0,0,0,0,0,,,left_only,0.102063,both,,,left_only,2005-02-15,2005-02-18,Second,D,2,both,0,0,0,0,,,,2004-10-25,D,,both,1


# Drop unnecessary columns:

In [21]:
cols = ['merge_1','merge_2','merge_3','merge_healthcamp','merge_patientprofile']

In [22]:
train.drop(cols,axis = 1,inplace = True)
test.drop(cols,axis = 1,inplace = True)

# Missing Imputations:

In [23]:
from sklearn.impute import SimpleImputer

In [24]:
train.isnull().sum()

Patient_ID                       0
Health_Camp_ID                   0
Registration_Date              334
Var1                             0
Var2                             0
Var3                             0
Var4                             0
Var5                             0
Donation                     69060
Health_Score                 69060
Health Score                 67459
Number_of_stall_visited      68763
Last_Stall_Visited_Number    68763
Camp_Start_Date                  0
Camp_End_Date                    0
Category1                        0
Category2                        0
Category3                        0
Online_Follower                  0
LinkedIn_Shared                  0
Twitter_Shared                   0
Facebook_Shared                  0
Income                       53546
Education_Score              65345
Age                          51612
First_Interaction                0
City_Type                    33208
Employer_Category            60075
outcome             

# Mean Imputation:

In [25]:
mean_impute_cols = ['Age']
impute_mean = SimpleImputer(strategy = 'mean')
impute_mean.fit(train[mean_impute_cols])
train[mean_impute_cols] = impute_mean.transform(train[mean_impute_cols])
test[mean_impute_cols] = impute_mean.transform(test[mean_impute_cols])

# Frequent Imputation:

In [26]:
freq_impute_cols = ['Income', 'Education_Score', 'City_Type', 'Employer_Category']

In [27]:
impute_freq = SimpleImputer(strategy = 'most_frequent')
impute_freq.fit(train[freq_impute_cols])
train[freq_impute_cols] = impute_freq.transform(train[freq_impute_cols])
test[freq_impute_cols] = impute_freq.transform(test[freq_impute_cols])

# Zero Imputation:

In [28]:
zero_impute_cols = ['Donation', 'Health_Score', 'Health Score', 'Number_of_stall_visited', 'Last_Stall_Visited_Number']

In [29]:
train[zero_impute_cols] = train[zero_impute_cols].fillna(0)
test[zero_impute_cols] = test[zero_impute_cols].fillna(0)

# Missing Date Imputation:

In [30]:
def date_impute(df):
    midpoint = df['Camp_Start_Date'] + (df['Camp_End_Date'] - df['Camp_Start_Date'])/2
    df['Registration_Date'] = df['Registration_Date'].fillna(midpoint)
    df['Registration_Date'] = pd.to_datetime(df['Registration_Date'], format = "%Y-%m-%d")
    return df

In [31]:
train = date_impute(train)
test = date_impute(test)

# Feature engineering:

# 1.Duration of camp:

In [32]:
train['camp_duration'] = (train['Camp_End_Date'] - train['Camp_Start_Date']).dt.days
test['camp_duration'] = (test['Camp_End_Date'] - test['Camp_Start_Date']).dt.days

# 2.Registered before/after start of camp:

In [33]:
train['reg_start_diff'] = (train['Camp_Start_Date'] - train['Registration_Date']).dt.days
test['reg_start_diff'] = (test['Camp_Start_Date'] - test['Registration_Date']).dt.days

# 3.Days left for camp end:

In [34]:
train['days_for_camp_end'] = (train['Registration_Date'] - train['Camp_End_Date']).dt.days
test['days_for_camp_end'] = (test['Registration_Date'] - test['Camp_End_Date']).dt.days

# 4.Point in camp:

In [35]:
train['point_in_camp'] = 1 - train['days_for_camp_end']/train['camp_duration']
test['point_in_camp'] = 1 - test['days_for_camp_end']/test['camp_duration']

# Days since first and last interaction:

In [36]:
train['is_train'] = True
test['is_train'] = False
all_data = pd.concat([train,test])
all_data = all_data.reset_index(drop=True)
all_data = all_data.sort_values(['Patient_ID', 'Registration_Date'])
all_data = all_data.reset_index(drop=True)
patient_wise_visits = all_data.loc[:,['Patient_ID','Registration_Date']]
patient_wise_visits = patient_wise_visits.drop_duplicates()
patient_wise_visits = patient_wise_visits.reset_index(drop=True)
patient_wise_visits['Last_Interaction'] = patient_wise_visits.groupby('Patient_ID')['Registration_Date'].shift()
all_data = pd.merge(all_data,patient_wise_visits,on=['Patient_ID', 'Registration_Date'],how='left')
all_data.loc[all_data['Last_Interaction'].isna(),'Last_Interaction'] = all_data['First_Interaction']
all_data['days_since_first_interaction'] = (all_data['Registration_Date'] - all_data['First_Interaction']).dt.days
all_data['days_since_last_interaction'] = (all_data['Registration_Date'] - all_data['Last_Interaction']).dt.days

# 6.Historic Features:

In [37]:
import sqlite3
#Make the db in memory
conn = sqlite3.connect(':memory:')
#write the tables
all_data.to_sql('all_data', conn, index=False)

qry = '''
        select a.Patient_ID, a.Registration_Date,
        count(b.Health_Camp_ID) as prev_registration_count,
        sum(b.Outcome) / count(b.Health_Camp_ID) as prev_response_rate,
        sum(case when b.Category2 = 'First' then b.Outcome else NULL end)/
        count( case when b.Category2 = 'First' then b.Health_Camp_ID else NULL end) as prev_first_response_rate,
        sum(case when b.Category2 = 'Second' then b.Outcome else NULL end)/
        count( case when b.Category2 = 'Second' then b.Health_Camp_ID else NULL end) as prev_second_response_rate,
        sum(case when b.Category2 = 'Third' then b.Outcome else NULL end)/
        count( case when b.Category2 = 'Third' then b.Health_Camp_ID else NULL end) as prev_third_response_rate,
        sum(b.Donation) as prev_donation,
        avg(case when b.Category2 = 'First' then b.Donation else NULL end) as prev_donation_avg,
        avg(case when b.Category2 = 'First' then b.Health_Score else NULL end) as prev_avg_health_score1,
        avg(case when b.Category2 = 'Second' then b.[Health Score] else NULL end) as prev_avg_health_score2,
        sum(b.Number_of_stall_visited) as prev_stall_count,
        avg(case when b.Category2 = 'Third' then b.Number_of_stall_visited else NULL end) as prev_stall_count_avg,
        count(distinct b.Last_Stall_Visited_Number) as prev_distinct_stalls
        from all_data as a left join all_data as b 
        on a.Patient_ID = b.Patient_ID 
        and b.Registration_Date < a.Registration_Date 
        group by a.Patient_ID, a.Registration_Date
      '''
patient_history = pd.read_sql_query(qry, conn)
patient_history['Registration_Date'] = pd.to_datetime(patient_history['Registration_Date'], format='%Y-%m-%d')

  method=method,


In [38]:
all_data = pd.merge(all_data,patient_history,on=['Patient_ID','Registration_Date'],how='left')

# Missing imputation for new features:

In [39]:
new_feat = ['prev_response_rate', 'prev_first_response_rate', 'prev_second_response_rate', 'prev_third_response_rate',
           'prev_donation', 'prev_donation_avg', 'prev_avg_health_score1', 'prev_avg_health_score2', 'prev_stall_count',
           'prev_stall_count_avg']
all_data[new_feat] = all_data[new_feat].fillna(-999)

In [40]:
train_final_data = all_data.loc[all_data['is_train']==True]
test_final_data = all_data.loc[all_data['is_train']==False]

# Constants:

In [41]:
ID1 = 'Patient_ID'
ID2 = 'Health_Camp_ID'
target = 'outcome'
date_columns = ['Registration_Date', 'Camp_Start_Date', 'Camp_End_Date', 'First_Interaction']
discrete_columns = ['Var1', 'Var2', 'Var3', 'Var4', 'Var5', 'Category1', 'Category2', 'Category3', 'Online_Follower', 
                   'LinkedIn_Shared', 'Twitter_Shared', 'Facebook_Shared', 'City_Type', 'Employer_Category']
ignore_cols = [ID1, ID2, target, 'Registration_Date', 'Camp_Start_Date', 'Camp_End_Date', 'First_Interaction', 
               'Last_Interaction', 'Donation', 'Health_Score', 'Health Score', 'Number_of_stall_visited', 
               'Last_Stall_Visited_Number','is_train']

In [42]:
random_state = 1234

In [43]:
should_ohe = True
should_scale = True

# Scaling:

In [44]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import FunctionTransformer

In [45]:
if should_scale:
    for col in train_final_data.columns:
        if (col != target) and (col != ID1) and (col != ID2) and (col not in date_columns) and (col not in discrete_columns) and (col not in ignore_cols):
            mms = MinMaxScaler()
            ss = StandardScaler()
            rs = RobustScaler()
            pt = PowerTransformer()
            ft_log = FunctionTransformer(np.log)
            
            train_final_data[f"{col}_MMS"] = mms.fit_transform(train_final_data[[col]])
            test_final_data[f"{col}_MMS"] = mms.transform(test_final_data[[col]])
            
            train_final_data[f"{col}_SS"] = ss.fit_transform(train_final_data[[col]])
            test_final_data[f"{col}_SS"] = ss.transform(test_final_data[[col]])
            
            train_final_data[f"{col}_RS"] = rs.fit_transform(train_final_data[[col]])
            test_final_data[f"{col}_RS"] = rs.transform(test_final_data[[col]])
            
            train_final_data[f"{col}_PT"] = pt.fit_transform(train_final_data[[col]])
            test_final_data[f"{col}_PT"] = pt.transform(test_final_data[[col]])
            
#             train_final_data[f"{col}_FT_log"] = ft_log.fit_transform(train_final_data[[col]])
#             test_final_data[f"{col}_FT_log"] = ft_log.transform(test_final_data[[col]])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc

# One hot encoding:

In [46]:
cols_for_ohe = ['Category1', 'Category2', 'City_Type', 'Employer_Category']

In [47]:
train_final_data = pd.concat([train_final_data.drop(cols_for_ohe,axis=1),pd.get_dummies(train_final_data[cols_for_ohe])],axis=1)
test_final_data = pd.concat([test_final_data.drop(cols_for_ohe,axis=1),pd.get_dummies(test_final_data[cols_for_ohe])],axis=1)

# Test train split:

In [48]:
from sklearn.model_selection import train_test_split

In [49]:
ignore_cols_train = [ID1, ID2, target, 'Registration_Date', 'Camp_Start_Date', 'Camp_End_Date', 'First_Interaction', 
               'Last_Interaction', 'Donation', 'Health_Score', 'Health Score', 'Number_of_stall_visited', 
               'Last_Stall_Visited_Number','is_train','Category2_B']
ignore_cols = [ID1, ID2, target, 'Registration_Date', 'Camp_Start_Date', 'Camp_End_Date', 'First_Interaction', 
               'Last_Interaction', 'Donation', 'Health_Score', 'Health Score', 'Number_of_stall_visited', 
               'Last_Stall_Visited_Number','is_train']
X, y = train_final_data.drop(ignore_cols_train, axis=1), train_final_data[target]
X_test = test_final_data.drop(ignore_cols, axis=1)

In [50]:
set(X.columns) - set(X_test.columns)

set()

In [51]:

X_test.shape

(35249, 147)

In [52]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.30, random_state=random_state)

# Base Models:

In [53]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
from sklearn.metrics import f1_score, auc, confusion_matrix, roc_curve, roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import plot_confusion_matrix

In [62]:
n_esitmators = 1000
classifiers = {
    "DT": DecisionTreeClassifier(),
    "RF": RandomForestClassifier(n_estimators=n_esitmators, random_state=random_state),
    "GBM": GradientBoostingClassifier(n_estimators=n_esitmators, random_state=random_state),
    "GBM_ES": GradientBoostingClassifier(n_estimators=n_esitmators, validation_fraction=0.2, 
                                         n_iter_no_change=5,tol=0.01,random_state=random_state)
}

# Prepare learning rate shrinkage:

In [63]:
def learning_rate_010_decay_power_099(current_iter):
    base_learning_rate = 0.1
    lr = base_learning_rate  * np.power(.99, current_iter)
    return lr if lr > 1e-3 else 1e-3

def learning_rate_010_decay_power_0995(current_iter):
    base_learning_rate = 0.1
    lr = base_learning_rate  * np.power(.995, current_iter)
    return lr if lr > 1e-3 else 1e-3

def learning_rate_005_decay_power_099(current_iter):
    base_learning_rate = 0.05
    lr = base_learning_rate  * np.power(.99, current_iter)
    return lr if lr > 1e-3 else 1e-3

Use test subset for early stopping criterion

In [64]:
import lightgbm as lgb
fit_params={"early_stopping_rounds":30, 
            "eval_metric" : 'auc', 
            "eval_set" : [(X_val,y_val)],
            'eval_names': ['valid'],
            #'callbacks': [lgb.reset_parameter(learning_rate=learning_rate_010_decay_power_099)],
            'verbose': 100,
            'categorical_feature': 'auto'}

Set up HyperParameter search

In [57]:
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
param_test ={'num_leaves': sp_randint(6, 50), 
             'min_child_samples': sp_randint(100, 500), 
             'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4],
             'subsample': sp_uniform(loc=0.2, scale=0.8), 
             'colsample_bytree': sp_uniform(loc=0.4, scale=0.6),
             'reg_alpha': [0, 1e-1, 1, 2, 5, 7, 10, 50, 100],
             'reg_lambda': [0, 1e-1, 1, 5, 10, 20, 50, 100]}

In [58]:
#This parameter defines the number of HP points to be tested
n_HP_points_to_test = 100

import lightgbm as lgb
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

#n_estimators is set to a "large value". The actual number of trees build will depend on early stopping and 5000 define only the absolute maximum
clf = lgb.LGBMClassifier(max_depth=-1, random_state=314, silent=True, metric='None', n_jobs=4, n_estimators=5000)
gs = RandomizedSearchCV(
    estimator=clf, param_distributions=param_test, 
    n_iter=n_HP_points_to_test,
    scoring='roc_auc',
    cv=3,
    refit=True,
    random_state=314,
    verbose=True)

In [None]:
gs.fit(X_train, y_train, **fit_params)
print('Best score reached: {} with params: {} '.format(gs.best_score_, gs.best_params_))

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Training until validation scores don't improve for 30 rounds.
[100]	valid's auc: 0.864678
[200]	valid's auc: 0.866965
[300]	valid's auc: 0.867963
[400]	valid's auc: 0.868569
Early stopping, best iteration is:
[422]	valid's auc: 0.868665
Training until validation scores don't improve for 30 rounds.
[100]	valid's auc: 0.864568
[200]	valid's auc: 0.867167
[300]	valid's auc: 0.868111
[400]	valid's auc: 0.868567
[500]	valid's auc: 0.869049
[600]	valid's auc: 0.869238
Early stopping, best iteration is:
[579]	valid's auc: 0.869242
Training until validation scores don't improve for 30 rounds.
[100]	valid's auc: 0.864532
[200]	valid's auc: 0.86697
[300]	valid's auc: 0.867778
[400]	valid's auc: 0.86841
Early stopping, best iteration is:
[445]	valid's auc: 0.868672
Training until validation scores don't improve for 30 rounds.
[100]	valid's auc: 0.864746
[200]	valid's auc: 0.867362
[300]	valid's auc: 0.868238
[400]	valid's auc: 0.868784
[500]	valid's auc: 0.869249
Early stopping, best iteration is

[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed: 50.9min finished


Training until validation scores don't improve for 30 rounds.
[100]	valid's auc: 0.869727
[200]	valid's auc: 0.872551
[300]	valid's auc: 0.874022
[400]	valid's auc: 0.875265
[500]	valid's auc: 0.876093
[600]	valid's auc: 0.876759
[700]	valid's auc: 0.877503
[800]	valid's auc: 0.877977
[900]	valid's auc: 0.878361
Early stopping, best iteration is:
[913]	valid's auc: 0.87841
Best score reached: 0.8724027449215775 with params: {'colsample_bytree': 0.8665631328558623, 'min_child_samples': 122, 'min_child_weight': 0.1, 'num_leaves': 48, 'reg_alpha': 2, 'reg_lambda': 50, 'subsample': 0.7252600946741159} 


In [59]:
opt_parameters = {'colsample_bytree': 0.8665631328558623, 'min_child_samples': 122, 'min_child_weight': 0.1, 'num_leaves': 48, 'reg_alpha': 2, 'reg_lambda': 50, 'subsample': 0.7252600946741159}

In [60]:
clf_sw = lgb.LGBMClassifier(**clf.get_params())
#set optimal parameters
clf_sw.set_params(**opt_parameters)

LGBMClassifier(boosting_type='gbdt', class_weight=None,
               colsample_bytree=0.8665631328558623, importance_type='split',
               learning_rate=0.1, max_depth=-1, metric='None',
               min_child_samples=122, min_child_weight=0.1, min_split_gain=0.0,
               n_estimators=5000, n_jobs=4, num_leaves=48, objective=None,
               random_state=314, reg_alpha=2, reg_lambda=50, silent=True,
               subsample=0.7252600946741159, subsample_for_bin=200000,
               subsample_freq=0)

In [65]:
clf_sw.fit(X_train, y_train, **fit_params, callbacks=[lgb.reset_parameter(learning_rate=learning_rate_010_decay_power_0995)])

Training until validation scores don't improve for 30 rounds.
[100]	valid's auc: 0.868707
[200]	valid's auc: 0.870869
[300]	valid's auc: 0.871713
[400]	valid's auc: 0.872077
[500]	valid's auc: 0.872242
[600]	valid's auc: 0.872405
[700]	valid's auc: 0.872475
[800]	valid's auc: 0.872514
[900]	valid's auc: 0.872548
[1000]	valid's auc: 0.872563
[1100]	valid's auc: 0.872574
Early stopping, best iteration is:
[1103]	valid's auc: 0.872574


LGBMClassifier(boosting_type='gbdt', class_weight=None,
               colsample_bytree=0.8665631328558623, importance_type='split',
               learning_rate=0.1, max_depth=-1, metric='None',
               min_child_samples=122, min_child_weight=0.1, min_split_gain=0.0,
               n_estimators=5000, n_jobs=4, num_leaves=48, objective=None,
               random_state=314, reg_alpha=2, reg_lambda=50, silent=True,
               subsample=0.7252600946741159, subsample_for_bin=200000,
               subsample_freq=0)

In [66]:
#Configure locally from hardcoded values
#clf_final = lgb.LGBMClassifier(**clf.get_params())
#set optimal parameters
#clf_final.set_params(**opt_parameters)

#Train the final model with learning rate decay
#clf_final.fit(X_train, y_train, **fit_params, callbacks=[lgb.reset_parameter(learning_rate=learning_rate_010_decay_power_0995)])

In [67]:
y_pred = clf_sw.predict(X_test)


In [68]:
X_test.columns

Index(['Var1', 'Var2', 'Var3', 'Var4', 'Var5', 'Category3', 'Online_Follower',
       'LinkedIn_Shared', 'Twitter_Shared', 'Facebook_Shared',
       ...
       'Employer_Category_Food', 'Employer_Category_Health',
       'Employer_Category_Manufacturing', 'Employer_Category_Others',
       'Employer_Category_Real Estate', 'Employer_Category_Retail',
       'Employer_Category_Software Industry', 'Employer_Category_Technology',
       'Employer_Category_Telecom', 'Employer_Category_Transport'],
      dtype='object', length=147)

In [69]:
set(X.columns) - set(X_test.columns)

set()

In [70]:
submission_df = test_final_data.copy()
submission_df['Outcome'] = y_pred
submission_df = submission_df[['Patient_ID', 'Health_Camp_ID','Outcome']]
submission_df.to_csv('submission_1st_attempt.csv', index=False)

In [71]:
from google.colab import files
files.download("submission_1st_attempt.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>