## Holiday Package preciction

##### "Trips & Travel.Com" company wants to enable and establish a viable business model to expand the customer base. One of the ways to expand the customer base is to introduce a new offering of packages. Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages. However, the marketing cost was quite high because customers were contacted at random without looking at the available information. The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being. However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.

In [3]:
## Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [4]:
## Reading the file
Travel = pd.read_csv('Travel.csv')
Travel

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4883,204883,1,49.0,Self Enquiry,3,9.0,Small Business,Male,3,5.0,Deluxe,4.0,Unmarried,2.0,1,1,1,1.0,Manager,26576.0
4884,204884,1,28.0,Company Invited,1,31.0,Salaried,Male,4,5.0,Basic,3.0,Single,3.0,1,3,1,2.0,Executive,21212.0
4885,204885,1,52.0,Self Enquiry,3,17.0,Salaried,Female,4,4.0,Standard,4.0,Married,7.0,0,1,1,3.0,Senior Manager,31820.0
4886,204886,1,19.0,Self Enquiry,3,16.0,Small Business,Male,3,4.0,Basic,3.0,Single,3.0,0,5,0,2.0,Executive,20289.0


In [5]:
# Top 5 of the Dataset
Travel.head()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


In [6]:
# Bottom 5 of the Dataset
Travel.tail()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
4883,204883,1,49.0,Self Enquiry,3,9.0,Small Business,Male,3,5.0,Deluxe,4.0,Unmarried,2.0,1,1,1,1.0,Manager,26576.0
4884,204884,1,28.0,Company Invited,1,31.0,Salaried,Male,4,5.0,Basic,3.0,Single,3.0,1,3,1,2.0,Executive,21212.0
4885,204885,1,52.0,Self Enquiry,3,17.0,Salaried,Female,4,4.0,Standard,4.0,Married,7.0,0,1,1,3.0,Senior Manager,31820.0
4886,204886,1,19.0,Self Enquiry,3,16.0,Small Business,Male,3,4.0,Basic,3.0,Single,3.0,0,5,0,2.0,Executive,20289.0
4887,204887,1,36.0,Self Enquiry,1,14.0,Salaried,Male,4,4.0,Basic,4.0,Unmarried,3.0,1,3,1,2.0,Executive,24041.0


#### Handling Missing Values
- 1). Handling Missing values
- 2). Handling Duplicates
- 3). Check Data Type
- 4). Understand the Dataset

In [8]:
## Checking for null values
Travel.isnull().sum()

CustomerID                    0
ProdTaken                     0
Age                         226
TypeofContact                25
CityTier                      0
DurationOfPitch             251
Occupation                    0
Gender                        0
NumberOfPersonVisiting        0
NumberOfFollowups            45
ProductPitched                0
PreferredPropertyStar        26
MaritalStatus                 0
NumberOfTrips               140
Passport                      0
PitchSatisfactionScore        0
OwnCar                        0
NumberOfChildrenVisiting     66
Designation                   0
MonthlyIncome               233
dtype: int64

In [9]:
# Checking for info of the dataset
Travel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CustomerID                4888 non-null   int64  
 1   ProdTaken                 4888 non-null   int64  
 2   Age                       4662 non-null   float64
 3   TypeofContact             4863 non-null   object 
 4   CityTier                  4888 non-null   int64  
 5   DurationOfPitch           4637 non-null   float64
 6   Occupation                4888 non-null   object 
 7   Gender                    4888 non-null   object 
 8   NumberOfPersonVisiting    4888 non-null   int64  
 9   NumberOfFollowups         4843 non-null   float64
 10  ProductPitched            4888 non-null   object 
 11  PreferredPropertyStar     4862 non-null   float64
 12  MaritalStatus             4888 non-null   object 
 13  NumberOfTrips             4748 non-null   float64
 14  Passport

In [10]:
### Check all the categories
Travel['Gender'].value_counts()

Gender
Male       2916
Female     1817
Fe Male     155
Name: count, dtype: int64

In [11]:
Travel['MaritalStatus'].value_counts()

MaritalStatus
Married      2340
Divorced      950
Single        916
Unmarried     682
Name: count, dtype: int64

In [12]:
Travel['TypeofContact'].value_counts()

TypeofContact
Self Enquiry       3444
Company Invited    1419
Name: count, dtype: int64

In [13]:
Travel['Occupation'].value_counts()

Occupation
Salaried          2368
Small Business    2084
Large Business     434
Free Lancer          2
Name: count, dtype: int64

In [14]:
Travel['Designation'].value_counts()

Designation
Executive         1842
Manager           1732
Senior Manager     742
AVP                342
VP                 230
Name: count, dtype: int64

###### The 'Gender' category and the 'MaritalStatus' category are the columns with slight mis datas. Datas to be altered

In [16]:
Travel['Gender'] = Travel['Gender'].replace('Fe Male', 'Female')
Travel['MaritalStatus'] = Travel['MaritalStatus'].replace('Single', 'Unmarried')

In [17]:
## Checking for the values been replaced
Travel['Gender'].value_counts()

Gender
Male      2916
Female    1972
Name: count, dtype: int64

In [18]:
Travel['MaritalStatus'].value_counts()

MaritalStatus
Married      2340
Unmarried    1598
Divorced      950
Name: count, dtype: int64

In [19]:
## Checking missing values
features_with_na = [features for features in Travel.columns if Travel[features].isnull().sum()>=1]
for features in features_with_na :
    print(features,np.round(Travel[features].isnull().mean()*100,5), '% missing values')

Age 4.62357 % missing values
TypeofContact 0.51146 % missing values
DurationOfPitch 5.13502 % missing values
NumberOfFollowups 0.92062 % missing values
PreferredPropertyStar 0.53191 % missing values
NumberOfTrips 2.86416 % missing values
NumberOfChildrenVisiting 1.35025 % missing values
MonthlyIncome 4.76678 % missing values


In [20]:
features_with_null = []
for feature in Travel.columns :
    if Travel[feature].isnull().sum() >= 1:
        features_with_null.append(feature)

In [21]:
features_with_null

['Age',
 'TypeofContact',
 'DurationOfPitch',
 'NumberOfFollowups',
 'PreferredPropertyStar',
 'NumberOfTrips',
 'NumberOfChildrenVisiting',
 'MonthlyIncome']

In [22]:
for features in features_with_null :
     print(features,np.round(Travel[features].isnull().mean()*100,5), '% missing values')

Age 4.62357 % missing values
TypeofContact 0.51146 % missing values
DurationOfPitch 5.13502 % missing values
NumberOfFollowups 0.92062 % missing values
PreferredPropertyStar 0.53191 % missing values
NumberOfTrips 2.86416 % missing values
NumberOfChildrenVisiting 1.35025 % missing values
MonthlyIncome 4.76678 % missing values


In [23]:
## Just knowing the statistics on numerical cols (Null cols0
Travel[features_with_na].select_dtypes(exclude='object').describe()

Unnamed: 0,Age,DurationOfPitch,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,NumberOfChildrenVisiting,MonthlyIncome
count,4662.0,4637.0,4843.0,4862.0,4748.0,4822.0,4655.0
mean,37.622265,15.490835,3.708445,3.581037,3.236521,1.187267,23619.853491
std,9.316387,8.519643,1.002509,0.798009,1.849019,0.857861,5380.698361
min,18.0,5.0,1.0,3.0,1.0,0.0,1000.0
25%,31.0,9.0,3.0,3.0,2.0,1.0,20346.0
50%,36.0,13.0,4.0,3.0,3.0,1.0,22347.0
75%,44.0,20.0,4.0,4.0,4.0,2.0,25571.0
max,61.0,127.0,6.0,5.0,22.0,3.0,98678.0


##### Imputing Null Values
- 1). Impute Median value for Age column
- 2). Impute Mode for Type of Contract
- 3). Impute Median for Duration of Pitch
- 4). Impute Mode for NumberofFollowup as it is Discrete Feature
- 5). Impute Mode for PreferredPropertyStar
- 6). Impute Median for NumberofTrips
- 7). Impute Mode for NumberofChildrenVisiting
- 8). Impute Median for MonthlyIncome

In [25]:
#Age
Travel.Age.fillna(Travel.Age.median(), inplace = True)

#Type of Contract
Travel.TypeofContact.fillna(Travel.TypeofContact.mode()[0], inplace = True)

#DurationofPitch
Travel.DurationOfPitch.fillna(Travel.DurationOfPitch.median(), inplace = True)

#NumberofFollowUp
Travel.NumberOfFollowups.fillna(Travel.NumberOfFollowups.mode()[0], inplace = True)

#PreferredPropertyStar
Travel.PreferredPropertyStar.fillna(Travel.PreferredPropertyStar.mode()[0], inplace = True)

#NumberOfTrips
Travel.NumberOfTrips.fillna(Travel.NumberOfTrips.median(), inplace = True)

#NumberOfChildrenVisiting
Travel.NumberOfChildrenVisiting.fillna(Travel.NumberOfChildrenVisiting.mode()[0], inplace = True)

#MonthlyIncome
Travel.MonthlyIncome.fillna(Travel.MonthlyIncome.median(), inplace = True)

In [26]:
## Checking after Removing Null Values
Travel.isnull().sum()

CustomerID                  0
ProdTaken                   0
Age                         0
TypeofContact               0
CityTier                    0
DurationOfPitch             0
Occupation                  0
Gender                      0
NumberOfPersonVisiting      0
NumberOfFollowups           0
ProductPitched              0
PreferredPropertyStar       0
MaritalStatus               0
NumberOfTrips               0
Passport                    0
PitchSatisfactionScore      0
OwnCar                      0
NumberOfChildrenVisiting    0
Designation                 0
MonthlyIncome               0
dtype: int64

In [27]:
# Droping column 'CustomerID' due to less importance
Travel.drop('CustomerID', inplace = True, axis =1)

#### Feature Engineering

###### Feature Extraction

In [29]:
## Adding columns 'NumberofPersonVisiting' and 'NumberofChildrenVisiting'
Travel['TotalVisiting'] = Travel['NumberOfChildrenVisiting'] + Travel['NumberOfPersonVisiting']
Travel.drop(columns=['NumberOfChildrenVisiting', 'NumberOfPersonVisiting'], axis=1, inplace=True)

In [30]:
Travel.columns

Index(['ProdTaken', 'Age', 'TypeofContact', 'CityTier', 'DurationOfPitch',
       'Occupation', 'Gender', 'NumberOfFollowups', 'ProductPitched',
       'PreferredPropertyStar', 'MaritalStatus', 'NumberOfTrips', 'Passport',
       'PitchSatisfactionScore', 'OwnCar', 'Designation', 'MonthlyIncome',
       'TotalVisiting'],
      dtype='object')

In [31]:
## Splitting Numerical Features and Categorical Features

num_features = []
cat_features = []

for feature in Travel.columns :
    if Travel[feature].dtype != 'O':
        num_features.append(feature)
    else :
        cat_features.append(feature)

print("Number of numerical features : ", len(num_features))
print("Number of categorical features : ", len(cat_features))

Number of numerical features :  12
Number of categorical features :  6


In [32]:
## Further splitting numerical features into contineous and discrete features

discrete_features = []
contineous_features = []

for i in num_features :
    if len(Travel[i].unique()) <= 25 :
        discrete_features.append(i)
    else :
        contineous_features.append(i)

print("The number of discrete features in num_features are : ", len(discrete_features))
print("The number of contineous features in num_features are : ", len(contineous_features))

The number of discrete features in num_features are :  9
The number of contineous features in num_features are :  3


### Train Test Split and Model Training

In [34]:
from sklearn.model_selection import train_test_split

In [35]:
X = Travel.drop(['ProdTaken'], axis = 1)
y = Travel['ProdTaken']

In [36]:
y.value_counts()

ProdTaken
0    3968
1     920
Name: count, dtype: int64

In [37]:
## Seperate Data into Train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [38]:
X_train.shape, X_test.shape

((3910, 17), (978, 17))

In [39]:
## Creating Column Transformer for 3 types of transformers

cat_features = X.select_dtypes(include = "object").columns
num_features = X.select_dtypes(exclude = "object").columns

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder(drop = 'first')

preprocessor = ColumnTransformer(
    [ 
        ("OneHotEncoder", oh_transformer, cat_features),
        ("StandardScaler", numeric_transformer, num_features)
    ]
)

In [40]:
## Preprocessor - ColumnTransformer with OneHotEncoding and StandardScaler
preprocessor

In [41]:
X_train = preprocessor.fit_transform(X_train)

In [42]:
pd.DataFrame(X_train)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,-0.721400,-1.020350,1.284279,-0.725271,-0.127737,-0.632399,0.679690,0.782966,-0.382245,-0.774151
1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,-0.721400,0.690023,0.282777,-0.725271,1.511598,-0.632399,0.679690,0.782966,-0.459799,0.643615
2,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.721400,-1.020350,0.282777,1.771041,0.418708,-0.632399,0.679690,0.782966,-0.245196,-0.065268
3,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,...,-0.721400,-1.020350,1.284279,-0.725271,-0.127737,-0.632399,1.408395,-1.277194,0.213475,-0.065268
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.721400,2.400396,-1.720227,-0.725271,1.511598,-0.632399,-0.049015,-1.277194,-0.024889,2.061382
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3905,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,-0.721400,-0.653841,1.284279,-0.725271,-0.674182,-0.632399,-1.506426,0.782966,-0.536973,0.643615
3906,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.455047,-0.898180,-0.718725,1.771041,-1.220627,-0.632399,1.408395,0.782966,1.529609,-0.065268
3907,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.455047,1.545210,0.282777,-0.725271,2.058043,-0.632399,-0.777720,0.782966,-0.360576,0.643615
3908,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,1.455047,1.789549,1.284279,-0.725271,-0.127737,-0.632399,-1.506426,0.782966,-0.252799,0.643615


In [43]:
X_test = preprocessor.transform(X_test)

In [44]:
pd.DataFrame(X_test)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,...,1.455047,-0.287333,1.284279,-0.725271,-1.220627,-0.632399,-0.777720,-1.277194,-0.737510,-0.774151
1,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,...,0.366823,-0.531672,0.282777,0.522885,-1.220627,1.581280,1.408395,-1.277194,-0.670411,-0.065268
2,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,...,1.455047,0.812193,0.282777,-0.725271,0.965153,-0.632399,1.408395,0.782966,-0.420832,-0.774151
3,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,-0.721400,2.522566,2.285781,-0.725271,1.511598,-0.632399,-0.049015,0.782966,-0.113658,0.643615
4,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.455047,-1.020350,0.282777,0.522885,-0.127737,1.581280,0.679690,0.782966,-0.317047,2.061382
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
973,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,-0.721400,-1.020350,0.282777,-0.725271,1.511598,-0.632399,1.408395,0.782966,0.498219,0.643615
974,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,-0.721400,-1.142519,0.282777,1.771041,-0.674182,1.581280,-1.506426,-1.277194,-1.184015,-1.483035
975,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,-0.721400,1.056532,1.284279,-0.725271,-0.674182,1.581280,1.408395,0.782966,0.690012,0.643615
976,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,-0.721400,-0.287333,-2.721728,-0.725271,-0.674182,-0.632399,1.408395,0.782966,-0.228278,-0.774151


##### Applying transformation in Training = (fit_transform) and Test = (transform)

### Machine Learning Model Training (Random Forest Classifier)

In [47]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

In [48]:
from sklearn.metrics import accuracy_score, classification_report, ConfusionMatrixDisplay, precision_score, recall_score, f1_score, roc_auc_score, roc_curve

In [49]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [61]:
models = {

    "Random Forest Classifier" : RandomForestClassifier(),
    "Decision Tree Classifier" : DecisionTreeClassifier(),
    "AdaBoost Classifier" : AdaBoostClassifier(),
    "Gradient Boost" : GradientBoostingClassifier()
}

for i in range(len(list(models))):  # for i in range(len(list(models))) means you are setting up a loop that will run once because the length of the list is 1.
    model = list(models.values())[i] # Get the model
    model.fit(X_train,y_train)  # Train Model

    # Make Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Training set performance
    model_train_accuracy = accuracy_score(y_train, y_train_pred)
    model_train_f1 = f1_score(y_train, y_train_pred, average = 'weighted')
    model_train_precision = precision_score(y_train, y_train_pred)
    model_train_recall = recall_score(y_train, y_train_pred)
    model_train_rocauc_score = roc_auc_score(y_train, y_train_pred)

    # Test Set Performance
    model_test_accuracy = accuracy_score(y_test, y_test_pred)
    model_test_f1 = f1_score(y_test, y_test_pred, average = 'weighted')
    model_test_precision = precision_score(y_test, y_test_pred)
    model_test_recall = recall_score(y_test, y_test_pred)
    model_test_rocauc_score = roc_auc_score(y_test, y_test_pred)

    print(list(models.keys())[i])


    print("Model Performance for Training Set")
    print("- Accuracy: {:.4f}".format(model_train_accuracy))
    print("- F1 score: {:.4f}".format(model_train_f1))
    print("- Precision: {:.4f}".format(model_train_precision))
    print("- recall: {:.4f}".format(model_train_recall))
    print("- Roc Auc Score: {:.4f}".format(model_train_rocauc_score))


    print('-----------------------------------------------------')

    print("Model Performance for Test Set")
    print("- Accuracy: {:.4f}".format(model_test_accuracy))
    print("- F1 score: {:.4f}".format(model_test_f1))
    print("- Precision: {:.4f}".format(model_test_precision))
    print("- recall: {:.4f}".format(model_test_recall))
    print("- Roc Auc Score: {:.4f}".format(model_test_rocauc_score))




    print('='*35)
    print('\n')



Random Forest Classifier
Model Performance for Training Set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- recall: 1.0000
- Roc Auc Score: 1.0000
-----------------------------------------------------
Model Performance for Test Set
- Accuracy: 0.9284
- F1 score: 0.9226
- Precision: 0.9764
- recall: 0.6492
- Roc Auc Score: 0.8227


Decision Tree Classifier
Model Performance for Training Set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- recall: 1.0000
- Roc Auc Score: 1.0000
-----------------------------------------------------
Model Performance for Test Set
- Accuracy: 0.9192
- F1 score: 0.9186
- Precision: 0.8043
- recall: 0.7749
- Roc Auc Score: 0.8646


AdaBoost Classifier
Model Performance for Training Set
- Accuracy: 0.8565
- F1 score: 0.8365
- Precision: 0.7308
- recall: 0.3649
- Roc Auc Score: 0.6670
-----------------------------------------------------
Model Performance for Test Set
- Accuracy: 0.8354
- F1 score: 0.8115
- Precision: 0.6630
- recall: 0.3

In [65]:
# HyperParameter Tuning
rf_params = {"max_depth" : [5,8,15,None,10],
             "max_features" : [5,7,"auto",8],
             "min_samples_split" : [2,8,15,20],
             "n_estimators" : [100,200,500,1000]}

In [67]:
rf_params

{'max_depth': [5, 8, 15, None, 10],
 'max_features': [5, 7, 'auto', 8],
 'min_samples_split': [2, 8, 15, 20],
 'n_estimators': [100, 200, 500, 1000]}

In [69]:
df_params = {"criterion" : ['gini', 'entropy', 'log_loss'],
             "splitter" : ['best', 'random'],
             "max_depth" : [1,2,3,4,5,6,7],
            }

In [71]:
df_params

{'criterion': ['gini', 'entropy', 'log_loss'],
 'splitter': ['best', 'random'],
 'max_depth': [1, 2, 3, 4, 5, 6, 7]}

In [73]:
gb_params = {"loss" : ['log_loss', 'exponential'],
             "criterion" : ['friedman_mse', 'squared_error'],
            }

In [75]:
gb_params

{'loss': ['log_loss', 'exponential'],
 'criterion': ['friedman_mse', 'squared_error']}

In [77]:
# Models list for HyperParameter Tuning
randomcv_models = [("RF", RandomForestClassifier(), rf_params),
                   ("DT", DecisionTreeClassifier(), df_params),
                   ("GB", GradientBoostingClassifier(), gb_params)
                  ]

In [79]:
from sklearn.model_selection import RandomizedSearchCV

In [81]:
model_param = {}
for name, model, params in randomcv_models:
    random = RandomizedSearchCV(estimator = model,
                                param_distributions = params,
                                n_iter = 100,
                                cv = 3,
                                verbose = 2,
                                n_jobs = -1)
    random.fit(X_train,y_train)
    model_param[name] = random.best_params_

for model_name in model_param:
    print(f"------------------Best Params for {model_name}------------------------")
    print(model_param[model_name])

Fitting 3 folds for each of 100 candidates, totalling 300 fits
Fitting 3 folds for each of 42 candidates, totalling 126 fits
Fitting 3 folds for each of 4 candidates, totalling 12 fits
------------------Best Params for RF------------------------
{'n_estimators': 100, 'min_samples_split': 2, 'max_features': 8, 'max_depth': None}
------------------Best Params for DT------------------------
{'splitter': 'best', 'max_depth': 7, 'criterion': 'gini'}
------------------Best Params for GB------------------------
{'loss': 'log_loss', 'criterion': 'friedman_mse'}


In [83]:
models = {
    
    "Random Forest" : RandomForestClassifier(n_estimators=100, min_samples_split=2, max_features=8, max_depth=None),
    "Decision Tree" : DecisionTreeClassifier(splitter = 'best', max_depth = 7, criterion = 'gini'),
    "GradientBoosting" : GradientBoostingClassifier(loss = 'log_loss', criterion = 'friedman_mse')
}

for i in range(len(list(models))):  # for i in range(len(list(models))) means you are setting up a loop that will run once because the length of the list is 1.
    model = list(models.values())[i] # Get the model
    model.fit(X_train,y_train)  # Train Model

    # Make Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Training set performance
    model_train_accuracy = accuracy_score(y_train, y_train_pred)
    model_train_f1 = f1_score(y_train, y_train_pred, average = 'weighted')
    model_train_precision = precision_score(y_train, y_train_pred)
    model_train_recall = recall_score(y_train, y_train_pred)
    model_train_rocauc_score = roc_auc_score(y_train, y_train_pred)

    # Test Set Performance
    model_test_accuracy = accuracy_score(y_test, y_test_pred)
    model_test_f1 = f1_score(y_test, y_test_pred, average = 'weighted')
    model_test_precision = precision_score(y_test, y_test_pred)
    model_test_recall = recall_score(y_test, y_test_pred)
    model_test_rocauc_score = roc_auc_score(y_test, y_test_pred)

    print(list(models.keys())[i])


    print("Model Performance for Training Set")
    print("- Accuracy: {:.4f}".format(model_train_accuracy))
    print("- F1 score: {:.4f}".format(model_train_f1))
    print("- Precision: {:.4f}".format(model_train_precision))
    print("- recall: {:.4f}".format(model_train_recall))
    print("- Roc Auc Score: {:.4f}".format(model_train_rocauc_score))


    print('-----------------------------------------------------')

    print("Model Performance for Test Set")
    print("- Accuracy: {:.4f}".format(model_test_accuracy))
    print("- F1 score: {:.4f}".format(model_test_f1))
    print("- Precision: {:.4f}".format(model_test_precision))
    print("- recall: {:.4f}".format(model_test_recall))
    print("- Roc Auc Score: {:.4f}".format(model_test_rocauc_score))




    print('='*35)
    print('\n')


Random Forest
Model Performance for Training Set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- recall: 1.0000
- Roc Auc Score: 1.0000
-----------------------------------------------------
Model Performance for Test Set
- Accuracy: 0.9356
- F1 score: 0.9314
- Precision: 0.9638
- recall: 0.6963
- Roc Auc Score: 0.8450


Decision Tree
Model Performance for Training Set
- Accuracy: 0.9013
- F1 score: 0.8933
- Precision: 0.8522
- recall: 0.5693
- Roc Auc Score: 0.7733
-----------------------------------------------------
Model Performance for Test Set
- Accuracy: 0.8681
- F1 score: 0.8554
- Precision: 0.7627
- recall: 0.4712
- Roc Auc Score: 0.7178


GradientBoosting
Model Performance for Training Set
- Accuracy: 0.8939
- F1 score: 0.8819
- Precision: 0.8756
- recall: 0.5021
- Roc Auc Score: 0.7429
-----------------------------------------------------
Model Performance for Test Set
- Accuracy: 0.8589
- F1 score: 0.8398
- Precision: 0.7732
- recall: 0.3927
- Roc Auc Score: 0.68