# Modelling for Madugital
Madugital is a company that sells honey online. One day, Madugital's CRM Manager wanted to know which customers had the greatest potential to buy their products based on their activities on the website, so that Madugital could contact these customers more quickly with the right treatment.

The Madugital company has a lead collection in the form of filled in forms from several customers who access their website through various marketing channels. From this data, there are some customers who buy the product and some who don't, this is recorded in the 'converted' column.

## 1. Data Preparation

### Import Packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

### Data Overview
Import data and view data

In [2]:
data=pd.read_csv("Data Madugital.csv")

In [3]:
data.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,...,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,...,No,Potential Lead,Jakarta,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,...,No,Select,Jakarta,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,...,No,Select,Jakarta,02.Medium,01.High,15.0,18.0,No,No,Modified


In [4]:
data.tail()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,...,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
9235,19d6451e-fcd6-407c-b83b-48e1af805ea9,579564,Landing Page Submission,Direct Traffic,Yes,No,1,8.0,1845,2.67,...,No,Potential Lead,Jakarta,02.Medium,01.High,15.0,17.0,No,No,Email Marked Spam
9236,82a7005b-7196-4d56-95ce-a79f937a158d,579546,Landing Page Submission,Direct Traffic,No,No,0,2.0,238,2.0,...,No,Potential Lead,Jakarta,02.Medium,01.High,14.0,19.0,No,Yes,SMS Sent
9237,aac550fe-a586-452d-8d3c-f1b62c94e02c,579545,Landing Page Submission,Direct Traffic,Yes,No,0,2.0,199,2.0,...,No,Potential Lead,Jakarta,02.Medium,01.High,13.0,20.0,No,Yes,SMS Sent
9238,5330a7d1-2f2b-4df4-85d6-64ca2f6b95b9,579538,Landing Page Submission,Google,No,No,1,3.0,499,3.0,...,No,,Other Metro Cities,02.Medium,02.Medium,15.0,16.0,No,No,SMS Sent
9239,571b5c8e-a5b2-4d57-8574-f2ffb06fdeff,579533,Landing Page Submission,Direct Traffic,No,No,1,6.0,1279,3.0,...,No,Potential Lead,Other Cities,02.Medium,01.High,15.0,18.0,No,Yes,Modified


In [5]:
data.shape

(9240, 37)

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   Prospect ID                                     9240 non-null   object 
 1   Lead Number                                     9240 non-null   int64  
 2   Lead Origin                                     9240 non-null   object 
 3   Lead Source                                     9204 non-null   object 
 4   Do Not Email                                    9240 non-null   object 
 5   Do Not Call                                     9240 non-null   object 
 6   Converted                                       9240 non-null   int64  
 7   TotalVisits                                     9103 non-null   float64
 8   Total Time Spent on Website                     9240 non-null   int64  
 9   Page Views Per Visit                     

### Data Quality Check
Check if there's any missing value or any unique data or duplicated data

In [7]:
data.isnull().sum()

Prospect ID                                          0
Lead Number                                          0
Lead Origin                                          0
Lead Source                                         36
Do Not Email                                         0
Do Not Call                                          0
Converted                                            0
TotalVisits                                        137
Total Time Spent on Website                          0
Page Views Per Visit                               137
Last Activity                                      103
Country                                           2461
Specialization                                    1438
How did you hear about Madugital                  2207
What is your current occupation                   2690
What matters most to you in choosing a product    2709
Search                                               0
Magazine                                             0
Newspaper 

In [8]:
data.isna().mean() * 100

Prospect ID                                        0.000000
Lead Number                                        0.000000
Lead Origin                                        0.000000
Lead Source                                        0.389610
Do Not Email                                       0.000000
Do Not Call                                        0.000000
Converted                                          0.000000
TotalVisits                                        1.482684
Total Time Spent on Website                        0.000000
Page Views Per Visit                               1.482684
Last Activity                                      1.114719
Country                                           26.634199
Specialization                                    15.562771
How did you hear about Madugital                  23.885281
What is your current occupation                   29.112554
What matters most to you in choosing a product    29.318182
Search                                  

In [9]:
data['Lead Source'].unique()

array(['Olark Chat', 'Organic Search', 'Direct Traffic', 'Google',
       'Referral Sites', 'Welingak Website', 'Reference', 'google',
       'Facebook', nan, 'blog', 'Pay per Click Ads', 'bing',
       'Social Media', 'WeLearn', 'Click2call', 'Live Chat',
       'welearnblog_Home', 'youtubechannel', 'testone', 'Press_Release',
       'NC_EDM'], dtype=object)

In [10]:
data['TotalVisits'].unique()

array([  0.,   5.,   2.,   1.,   4.,   8.,  11.,   6.,   3.,   7.,  13.,
        17.,  nan,   9.,  12.,  10.,  16.,  14.,  21.,  15.,  22.,  19.,
        18.,  20.,  43.,  30.,  23.,  55., 141.,  25.,  27.,  29.,  24.,
        28.,  26.,  74.,  41.,  54., 115., 251.,  32.,  42.])

In [11]:
data['Page Views Per Visit'].unique()

array([ 0.  ,  2.5 ,  2.  ,  1.  ,  4.  ,  8.  ,  2.67, 11.  ,  5.  ,
        6.  ,  3.  ,  1.33,  1.5 ,  3.5 ,  7.  ,  2.33, 13.  ,  8.5 ,
        5.5 ,  1.67,   nan,  4.5 ,  3.33, 16.  , 12.  ,  1.71,  1.8 ,
        6.5 ,  4.33, 14.  ,  3.4 , 10.  ,  1.25,  1.75,  2.63, 15.  ,
        2.25,  3.67,  1.43,  9.  ,  2.6 ,  4.75,  1.27,  3.25,  5.33,
        2.57,  2.17,  2.75,  2.8 ,  2.2 ,  2.86,  3.91,  1.4 ,  5.67,
        3.2 ,  1.38,  2.09,  2.4 , 55.  ,  5.25,  6.71,  3.57,  2.22,
        1.83,  3.6 ,  1.2 ,  1.57,  1.56,  5.4 ,  4.25,  1.31,  1.6 ,
        2.9 ,  1.23,  1.78,  3.83,  7.5 ,  1.14,  2.71,  1.45,  2.38,
        1.86,  2.29,  1.21, 12.33,  3.43,  2.56,  6.33,  1.64,  8.21,
        4.4 ,  3.17,  8.33,  1.48,  1.22, 24.  ,  3.75,  6.67,  1.54,
        2.13,  2.14,  2.45,  3.29,  4.17,  1.63,  3.38,  1.17, 14.5 ,
        3.8 ,  1.19,  3.82,  2.83,  1.93, 11.5 ,  2.08])

In [12]:
data['Country'].unique()

array([nan, 'Indonesia', 'Russia', 'Kuwait', 'Oman',
       'United Arab Emirates', 'United States', 'Australia',
       'United Kingdom', 'Bahrain', 'Ghana', 'Singapore', 'Qatar',
       'Saudi Arabia', 'Belgium', 'France', 'Sri Lanka', 'China',
       'Canada', 'Netherlands', 'Sweden', 'Nigeria', 'Hong Kong',
       'Germany', 'Asia/Pacific Region', 'Uganda', 'Kenya', 'Italy',
       'South Africa', 'Tanzania', 'unknown', 'Malaysia', 'Liberia',
       'Switzerland', 'Denmark', 'Philippines', 'Bangladesh', 'Vietnam',
       'India'], dtype=object)

In [13]:
data['Specialization'].unique()

array(['Select', 'Business Administration', 'Media and Advertising', nan,
       'Supply Chain Management', 'IT Projects Management',
       'Finance Management', 'Travel and Tourism',
       'Human Resource Management', 'Marketing Management',
       'Banking, Investment And Insurance', 'International Business',
       'E-COMMERCE', 'Operations Management', 'Retail Management',
       'Services Excellence', 'Hospitality Management',
       'Rural and Agribusiness', 'Healthcare Management', 'E-Business'],
      dtype=object)

In [14]:
data['How did you hear about Madugital'].unique()

array(['Select', 'Word Of Mouth', 'Other', nan, 'Online Search',
       'Multiple Sources', 'Advertisements', 'Student of SomeSchool',
       'Email', 'Social Media', 'SMS'], dtype=object)

In [15]:
data['What is your current occupation'].unique()

array(['Unemployed', 'Student', nan, 'Working Professional',
       'Businessman', 'Other', 'Housewife'], dtype=object)

In [16]:
data['What matters most to you in choosing a product'].unique()

array(['Healthy for life', nan, 'Branding', 'Other'], dtype=object)

In [17]:
data['Tags'].unique()

array(['Interested in other courses', 'Ringing',
       'Will revert after reading the email', nan, 'Lost to EINS',
       'In confusion whether part time or DLP', 'Busy', 'switched off',
       'in touch with EINS', 'Already a student',
       'Diploma holder (Not Eligible)', 'Graduation in progress',
       'Closed by Horizzon', 'number not provided', 'opp hangup',
       'Not doing further education', 'invalid number',
       'wrong number given', 'Interested  in full time MBA',
       'Still Thinking', 'Lost to Others',
       'Shall take in the next coming month', 'Lateral student',
       'Interested in Next batch', 'Recognition issue (DEC approval)',
       'Want to take admission but has financial problems',
       'University not recognized'], dtype=object)

In [18]:
data['Lead Quality'].unique()

array(['Low in Relevance', nan, 'Might be', 'Not Sure', 'Worst',
       'High in Relevance'], dtype=object)

In [19]:
data['Lead Profile'].unique()

array(['Select', 'Potential Lead', nan, 'Other Leads', 'Lateral Student',
       'Dual Specialization Student', 'Student of SomeSchool'],
      dtype=object)

In [20]:
data['Asymmetrique Activity Score'].unique()

array([15., 14., 13., 17., 16., 11., 12., 10.,  9.,  8., 18., nan,  7.])

In [21]:
data['Asymmetrique Profile Score'].unique()

array([15., 20., 17., 18., 14., 16., 13., 19., 12., nan, 11.])

In [22]:
data['Asymmetrique Activity Index'].unique()

array(['02.Medium', '01.High', '03.Low', nan], dtype=object)

In [23]:
data['Asymmetrique Profile Index'].unique()

array(['02.Medium', '01.High', '03.Low', nan], dtype=object)

In [24]:
data.loc[data['Prospect ID'].duplicated()]

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,...,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity


In [25]:
data.duplicated().sum()

0

### Handling Missing Value
Fill the missing value and replace the mistype data

In [26]:
data_dropped=data.copy()

In [27]:
data_dropped['Lead Source']=data_dropped['Lead Source'].replace('google','Google')

In [28]:
data_dropped['Lead Source']=data_dropped['Lead Source'].replace('Facebook','Social Media')

In [29]:
data_dropped['Lead Source']=data_dropped['Lead Source'].replace('welearnblog_Home','blog')

In [30]:
data_dropped['Lead Source']=data_dropped['Lead Source'].fillna('Unknown')

In [31]:
data_dropped['TotalVisits']=data_dropped['TotalVisits'].fillna(data_dropped['TotalVisits'].mean())

In [32]:
data_dropped['Page Views Per Visit']=data_dropped['Page Views Per Visit'].fillna(data_dropped['Page Views Per Visit'].mean())

In [33]:
data_dropped['Country']=data_dropped['Country'].fillna('unknown')

In [34]:
data_dropped['Specialization']=data_dropped['Specialization'].fillna('Select')

In [35]:
data_dropped['How did you hear about Madugital']=data_dropped['How did you hear about Madugital'].fillna('Select')

In [36]:
data_dropped['What is your current occupation']=data_dropped['What is your current occupation'].fillna('Unknown')

In [37]:
data_dropped['What matters most to you in choosing a product']=data_dropped['What matters most to you in choosing a product'].fillna('Unknown')

In [38]:
data_dropped['Tags']=data_dropped['Tags'].fillna('Unknown')

In [39]:
data_dropped['Lead Quality']=data_dropped['Lead Quality'].fillna('Unknown')

In [40]:
data_dropped['Lead Profile']=data_dropped['Lead Profile'].fillna('Select')

In [41]:
data_dropped['Asymmetrique Activity Score']=data_dropped['Asymmetrique Activity Score'].fillna(data_dropped['Asymmetrique Activity Score'].mean())

In [42]:
data_dropped['Asymmetrique Profile Score']=data_dropped['Asymmetrique Profile Score'].fillna(data_dropped['Asymmetrique Profile Score'].mean())

In [43]:
data_dropped['Asymmetrique Activity Index']=data_dropped['Asymmetrique Activity Index'].fillna('Unknown')

In [44]:
data_dropped['Asymmetrique Profile Index']=data_dropped['Asymmetrique Profile Index'].fillna('Unknown')

In [45]:
data_dropped.isna().mean() * 100

Prospect ID                                        0.000000
Lead Number                                        0.000000
Lead Origin                                        0.000000
Lead Source                                        0.000000
Do Not Email                                       0.000000
Do Not Call                                        0.000000
Converted                                          0.000000
TotalVisits                                        0.000000
Total Time Spent on Website                        0.000000
Page Views Per Visit                               0.000000
Last Activity                                      1.114719
Country                                            0.000000
Specialization                                     0.000000
How did you hear about Madugital                   0.000000
What is your current occupation                    0.000000
What matters most to you in choosing a product     0.000000
Search                                  

The related variables or columns now have been filled.

## 2. Finding Data Insight 
Calculating the aggregate data based on a column and the correlation between two variables

In [46]:
data_dropped.groupby('Converted').mean()[['TotalVisits']]

Unnamed: 0_level_0,TotalVisits
Converted,Unnamed: 1_level_1
0,3.330423
1,3.628341


In [47]:
data_dropped.groupby('Converted').mean()[['Page Views Per Visit']]

Unnamed: 0_level_0,Page Views Per Visit
Converted,Unnamed: 1_level_1
0,2.368416
1,2.353896


In [48]:
data_dropped.groupby('Converted').mean()[['Total Time Spent on Website']]

Unnamed: 0_level_0,Total Time Spent on Website
Converted,Unnamed: 1_level_1
0,330.404473
1,738.546757


In [49]:
data_dropped.groupby('Converted').mean()[['Asymmetrique Profile Score']]

Unnamed: 0_level_0,Asymmetrique Profile Score
Converted,Unnamed: 1_level_1
0,16.174965
1,16.615864


In [50]:
data_dropped.groupby('Converted').mean()[['Asymmetrique Activity Score']]

Unnamed: 0_level_0,Asymmetrique Activity Score
Converted,Unnamed: 1_level_1
0,14.206293
1,14.465666


In [51]:
pd.crosstab(data_dropped['What is your current occupation'],data_dropped['Converted'],normalize=True)

Converted,0,1
What is your current occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
Businessman,0.000325,0.000541
Housewife,0.0,0.001082
Other,0.000649,0.001082
Student,0.014286,0.008442
Unemployed,0.341883,0.264177
Unknown,0.251082,0.040043
Working Professional,0.006385,0.070022


In [52]:
pd.crosstab(data_dropped['Lead Quality'],data_dropped['Converted'],normalize=True)

Converted,0,1
Lead Quality,Unnamed: 1_level_1,Unnamed: 2_level_1
High in Relevance,0.00368,0.06526
Low in Relevance,0.011472,0.051623
Might be,0.041234,0.127597
Not Sure,0.089394,0.028788
Unknown,0.405087,0.110823
Worst,0.063745,0.001299


In [53]:
pd.crosstab(data_dropped['Asymmetrique Profile Index'],data_dropped['Converted'],normalize=True)

Converted,0,1
Asymmetrique Profile Index,Unnamed: 1_level_1,Unnamed: 2_level_1
01.High,0.125325,0.113095
02.Medium,0.209957,0.091775
03.Low,0.001732,0.001623
Unknown,0.277597,0.178896


In [54]:
pd.crosstab(data_dropped['Asymmetrique Activity Index'],data_dropped['Converted'],normalize=True)

Converted,0,1
Asymmetrique Activity Index,Unnamed: 1_level_1,Unnamed: 2_level_1
01.High,0.062338,0.026515
02.Medium,0.239069,0.176407
03.Low,0.035606,0.003571
Unknown,0.277597,0.178896


In [55]:
pd.crosstab(data_dropped['Tags'],data_dropped['Converted'],normalize=True)

Converted,0,1
Tags,Unnamed: 1_level_1,Unnamed: 2_level_1
Already a student,0.05,0.000325
Busy,0.008766,0.011364
Closed by Horizzon,0.000216,0.038528
Diploma holder (Not Eligible),0.00671,0.000108
Graduation in progress,0.011255,0.000758
In confusion whether part time or DLP,0.000433,0.000108
Interested in full time MBA,0.012338,0.000325
Interested in Next batch,0.0,0.000541
Interested in other courses,0.054113,0.001407
Lateral student,0.0,0.000325


In [56]:
pd.crosstab(data_dropped['Last Notable Activity'],data_dropped['Converted'],normalize=True)

Converted,0,1
Last Notable Activity,Unnamed: 1_level_1,Unnamed: 2_level_1
Approached upfront,0.0,0.000108
Email Bounced,0.005519,0.000974
Email Link Clicked,0.013853,0.00487
Email Marked Spam,0.0,0.000216
Email Opened,0.192965,0.112987
Email Received,0.0,0.000108
Form Submitted on Website,0.000108,0.0
Had a Phone Conversation,0.000108,0.001407
Modified,0.283983,0.08474
Olark Chat Conversation,0.0171,0.002706


In [57]:
pd.crosstab(data_dropped['Do Not Email'],data_dropped['Converted'],normalize=True)

Converted,0,1
Do Not Email,Unnamed: 1_level_1,Unnamed: 2_level_1
No,0.547944,0.372619
Yes,0.066667,0.012771


In [58]:
pd.crosstab(data_dropped['Do Not Call'],data_dropped['Converted'],normalize=True)

Converted,0,1
Do Not Call,Unnamed: 1_level_1,Unnamed: 2_level_1
No,0.61461,0.385173
Yes,0.0,0.000216


In [59]:
pd.crosstab(data_dropped['Lead Source'],data_dropped['Converted'],normalize=True)

Converted,0,1
Lead Source,Unnamed: 1_level_1,Unnamed: 2_level_1
Click2call,0.000108,0.000325
Direct Traffic,0.186688,0.088528
Google,0.186797,0.124134
Live Chat,0.0,0.000216
NC_EDM,0.0,0.000108
Olark Chat,0.14145,0.048485
Organic Search,0.077706,0.047186
Pay per Click Ads,0.000108,0.0
Press_Release,0.000216,0.0
Reference,0.004762,0.05303


## 3. Data Manipulation
Manipulate data so it can be used in the modelling process

In [60]:
dummies=pd.get_dummies(data_dropped[['Lead Source','Country','Do Not Email', 'Do Not Call','Specialization',
                                     'How did you hear about Madugital','Tags','What matters most to you in choosing a product',
                                     'What is your current occupation','Through Recommendations','Lead Quality',
                                     'Asymmetrique Activity Index','Asymmetrique Profile Index','Lead Profile',
                                     'Last Notable Activity',]],drop_first=True)
dummies

Unnamed: 0,Lead Source_Direct Traffic,Lead Source_Google,Lead Source_Live Chat,Lead Source_NC_EDM,Lead Source_Olark Chat,Lead Source_Organic Search,Lead Source_Pay per Click Ads,Lead Source_Press_Release,Lead Source_Reference,Lead Source_Referral Sites,...,Last Notable Activity_Form Submitted on Website,Last Notable Activity_Had a Phone Conversation,Last Notable Activity_Modified,Last Notable Activity_Olark Chat Conversation,Last Notable Activity_Page Visited on Website,Last Notable Activity_Resubscribed to emails,Last Notable Activity_SMS Sent,Last Notable Activity_Unreachable,Last Notable Activity_Unsubscribed,Last Notable Activity_View in browser link Clicked
0,0,0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9235,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9236,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9237,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9238,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


## 4. Modelling Preparations
Join the dummies with the final data and drop the unnecessary variables from the final data

In [61]:
final_data=data_dropped.join(dummies)

In [62]:
final_data.drop(['Prospect ID', 'Lead Number', 'Lead Origin', 'Lead Source',
       'Do Not Email', 'Do Not Call', 'Last Activity',
       'Country', 'Specialization', 'How did you hear about Madugital',
       'What is your current occupation',
       'What matters most to you in choosing a product', 'Search', 'Magazine',
       'Newspaper Article', 'Madugital Telegram', 'Newspaper',
       'Digital Advertisement','Through Recommendations',
       'Receive More Updates About Our Products', 'Tags', 'Lead Quality',
       'Update me on Supply Chain Content', 'Get updates on DM Content',
       'Lead Profile', 'City', 'Asymmetrique Activity Index',
       'Asymmetrique Profile Index','I agree to pay the amount through cheque',
       'A free copy of Mastering The Interview','Last Notable Activity'], axis = 1, inplace = True)

In [63]:
final_data.columns

Index(['Converted', 'TotalVisits', 'Total Time Spent on Website',
       'Page Views Per Visit', 'Asymmetrique Activity Score',
       'Asymmetrique Profile Score', 'Lead Source_Direct Traffic',
       'Lead Source_Google', 'Lead Source_Live Chat', 'Lead Source_NC_EDM',
       ...
       'Last Notable Activity_Form Submitted on Website',
       'Last Notable Activity_Had a Phone Conversation',
       'Last Notable Activity_Modified',
       'Last Notable Activity_Olark Chat Conversation',
       'Last Notable Activity_Page Visited on Website',
       'Last Notable Activity_Resubscribed to emails',
       'Last Notable Activity_SMS Sent', 'Last Notable Activity_Unreachable',
       'Last Notable Activity_Unsubscribed',
       'Last Notable Activity_View in browser link Clicked'],
      dtype='object', length=157)

## 5. Modelling
Start the modelling using Decision Tree Classifier

In [64]:
train, test = train_test_split(final_data, test_size=0.3, random_state=1000)
train.shape, test.shape

((6468, 157), (2772, 157))

In [65]:
X_train=train.drop(['Converted'],1)
y_train=train['Converted']

X_test=test.drop(['Converted'],1)
y_test=test['Converted']

dt=DecisionTreeClassifier(random_state=1000)
dt.fit(X_train,y_train)

  X_train=train.drop(['Converted'],1)
  X_test=test.drop(['Converted'],1)


DecisionTreeClassifier(random_state=1000)

In [66]:
dt.predict(X_test)

array([1, 1, 1, ..., 1, 1, 0], dtype=int64)

In [67]:
y_test

1562    1
3407    1
8623    1
2575    1
5419    0
       ..
3567    0
3438    0
1518    1
4546    1
5554    0
Name: Converted, Length: 2772, dtype: int64

## 6. Modelling Evaluation
Check whether the model is good or needs an improvement

In [68]:
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
print(accuracy, precision, recall)

0.9242424242424242 0.9023255813953488 0.9023255813953488


## 7. Improvement
Because the model's accuracy, precision, and recall are more than 90% then the model is considered good so it doesn't need an improvement