# Description

## Background and Context

You are a Data Scientist for a tourism company named "Visit with us". The Policy Maker of the company wants to enable and establish a viable business model to expand the customer base.

A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector.

One of the ways to expand the customer base is to introduce a new offering of packages.

Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages.

However, the marketing cost was quite high because customers were contacted at random without looking at the available information.

The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.

However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.

You as a Data Scientist at "Visit with us" travel company have to analyze the customers' data and information to provide recommendations to the Policy Maker and Marketing Team and also build a model to predict the potential customer who is going to purchase the newly introduced travel package.

Objective

To predict which customer is more likely to purchase the newly introduced travel package.

## Index

- <a href = #link1>Import Libraries and Load data </a>


- <a href = #link2>Overview of the dataset</a> 


- <a href = #link3>EDA</a> 


- <a href = #link4>Illustrate the insights based on EDA</a> 


- <a href = #link3>Split the dataset</a>


- <a href = #link4>Decision Tree Model </a> 


- <a href = #link5>Random Forest Model</a>


- <a href = #link6>Boosting Models</a>


- <a href = #link7>Stacking Model</a>


- <a href = #link8>Business Recommendations</a>

## <a id = "link1"></a> Import Libraries and Load data

**Importing libraries we need.**

In [27]:
!pip install xgboost



In [28]:
conda install py-xgboost

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.10.0

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [29]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np   
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.gridspec as gridspec

from sklearn.model_selection import GridSearchCV, train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier,RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, StackingClassifier
from xgboost import XGBClassifier
from sklearn import metrics

# increase cell width and height 
from IPython.core.display import HTML, display
display(HTML("<style>div.output_scroll {width:100%; height:50em}<style>"))

In [30]:
tourismDf = pd.read_excel (r'Tourism.xlsx', sheet_name='Tourism')  # load data from excel

## <a id = "link2"></a>Overview of the dataset

In [31]:
tourismDf.head()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisited,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisited,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


In [32]:
tourismDf.info()   # print data info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   CustomerID               4888 non-null   int64  
 1   ProdTaken                4888 non-null   int64  
 2   Age                      4662 non-null   float64
 3   TypeofContact            4863 non-null   object 
 4   CityTier                 4888 non-null   int64  
 5   DurationOfPitch          4637 non-null   float64
 6   Occupation               4888 non-null   object 
 7   Gender                   4888 non-null   object 
 8   NumberOfPersonVisited    4888 non-null   int64  
 9   NumberOfFollowups        4843 non-null   float64
 10  ProductPitched           4888 non-null   object 
 11  PreferredPropertyStar    4862 non-null   float64
 12  MaritalStatus            4888 non-null   object 
 13  NumberOfTrips            4748 non-null   float64
 14  Passport                

**Convert relevant columns to categorical columnms.**

In [33]:
# object type columns are converted to categorical datatype

for feature in tourismDf.columns: # Loop through all columns in the dataframe
    if tourismDf[feature].dtype == 'object': # Only apply for columns with categorical strings
        tourismDf[feature] = pd.Categorical(tourismDf[feature])# Convert object datatype to categorical datatype


In [34]:
# Few columns which are of datatype int, float have categorical data. so such columns are converted to categorical datatype.


cat_cols = ['ProdTaken','CityTier','PreferredPropertyStar','Passport','OwnCar','Gender'] # columns which are supposed to be of category datatype
for feature in cat_cols: # Loop through items in the list
    tourismDf[feature] = pd.Categorical(tourismDf[feature])# Convert column datatype to categorical datatype
tourismDf.head(5)


Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisited,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisited,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


In [35]:
tourismDf.info()  # after converting all relevant columns with categorical data into categorical datatype.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   CustomerID               4888 non-null   int64   
 1   ProdTaken                4888 non-null   category
 2   Age                      4662 non-null   float64 
 3   TypeofContact            4863 non-null   category
 4   CityTier                 4888 non-null   category
 5   DurationOfPitch          4637 non-null   float64 
 6   Occupation               4888 non-null   category
 7   Gender                   4888 non-null   category
 8   NumberOfPersonVisited    4888 non-null   int64   
 9   NumberOfFollowups        4843 non-null   float64 
 10  ProductPitched           4888 non-null   category
 11  PreferredPropertyStar    4862 non-null   category
 12  MaritalStatus            4888 non-null   category
 13  NumberOfTrips            4748 non-null   float64 
 14  Passport

**Check missing values.**

In [36]:
tourismDf.isna().sum().sort_values()

CustomerID                   0
ProdTaken                    0
OwnCar                       0
CityTier                     0
PitchSatisfactionScore       0
Occupation                   0
Gender                       0
NumberOfPersonVisited        0
Designation                  0
ProductPitched               0
MaritalStatus                0
Passport                     0
TypeofContact               25
PreferredPropertyStar       26
NumberOfFollowups           45
NumberOfChildrenVisited     66
NumberOfTrips              140
Age                        226
MonthlyIncome              233
DurationOfPitch            251
dtype: int64

In [37]:
tourismDf.describe()

Unnamed: 0,CustomerID,Age,DurationOfPitch,NumberOfPersonVisited,NumberOfFollowups,NumberOfTrips,PitchSatisfactionScore,NumberOfChildrenVisited,MonthlyIncome
count,4888.0,4662.0,4637.0,4888.0,4843.0,4748.0,4888.0,4822.0,4655.0
mean,202443.5,37.622265,15.490835,2.905074,3.708445,3.236521,3.078151,1.187267,23619.853491
std,1411.188388,9.316387,8.519643,0.724891,1.002509,1.849019,1.365792,0.857861,5380.698361
min,200000.0,18.0,5.0,1.0,1.0,1.0,1.0,0.0,1000.0
25%,201221.75,31.0,9.0,2.0,3.0,2.0,2.0,1.0,20346.0
50%,202443.5,36.0,13.0,3.0,4.0,3.0,3.0,1.0,22347.0
75%,203665.25,44.0,20.0,3.0,4.0,4.0,4.0,2.0,25571.0
max,204887.0,61.0,127.0,5.0,6.0,22.0,5.0,3.0,98678.0


In [38]:
for feature in tourismDf.columns:
    if tourismDf[feature].dtype.name == 'category':
        print('---------------------------------')
        print(tourismDf[feature].value_counts())
        print('---------------------------------')        

---------------------------------
0    3968
1     920
Name: ProdTaken, dtype: int64
---------------------------------
---------------------------------
Self Enquiry       3444
Company Invited    1419
Name: TypeofContact, dtype: int64
---------------------------------
---------------------------------
1    3190
3    1500
2     198
Name: CityTier, dtype: int64
---------------------------------
---------------------------------
Salaried          2368
Small Business    2084
Large Business     434
Free Lancer          2
Name: Occupation, dtype: int64
---------------------------------
---------------------------------
Male       2916
Female     1817
Fe Male     155
Name: Gender, dtype: int64
---------------------------------
---------------------------------
Basic           1842
Deluxe          1732
Standard         742
Super Deluxe     342
King             230
Name: ProductPitched, dtype: int64
---------------------------------
---------------------------------
3.0    2993
5.0     956
4.0  

In [39]:
tourismDf = tourismDf.replace({'Gender':{'Fe Male':'Female','Female':'Female','Male':'Male'}})

NameError: name 'Gender' is not defined

In [None]:
tourismDf['Gender'] = pd.Categorical(tourismDf['Gender'])

print(tourismDf['Gender'].value_counts())

print(tourismDf['Gender'].dtype)

## <a id = "link3"></a>EDA

**Univariate analysis - Bivariate analysis - Use appropriate visualizations to identify the patterns and insights - Come up with a customer profile (characteristics of a customer) of the different packages - Any other exploratory deep dive - 8M**

## Univariant analysis

In [None]:
for feature in tourismDf.columns:
    if tourismDf[feature].dtype.name == 'category':
        plt.figure(figsize=(10,5))
        g = sns.countplot(tourismDf[feature])
        for p in g.patches:
            g.annotate(format(p.get_height(), '.1f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
            


### Insights from univariant count plots

1.  There are more customers who contacted on self-enquiry rather than company invited. which means advertising is reaching the audience.
2.  More number of customers came from CityTier 1.
3.  Majority of customers are Salaried people.
4.  Suprisingly most preferred property start is 3. might be people compromised quality for cost. 
5.  Married people are more among the customers.
6.  Majority of customers are with out passport so travel packages can more concentrate on places with in the country.

In [None]:
for feature in tourismDf.columns[1:]:
    if tourismDf[feature].dtype.name != 'category':
        #print(feature)
        plt.figure(figsize=(10,5))
        g = sns.distplot(tourismDf[feature])

###  Insights from Univariant dist plots
1. Majority of customers age is between 30 and 40 yrs.
2. Duration of pitch is left skewed. means most of pitched last for less duration between (0-20).
3. Majority of customers salary is between 20000  to 25000 apprx.


## BiVariant Analysis

In [None]:
sns.distplot(tourismDf[(tourismDf['ProdTaken']==0) & (tourismDf['Age']>0)]['Age'],color='r',label=0)
sns.distplot(tourismDf[  (tourismDf['ProdTaken']==1)  & (tourismDf['Age']>0)   ]['Age'],color='g',label=1)
plt.legend()
plt.show()

# Similar trends are found between the customers who accepted the Prod and who did not accpet the Prod.

In [None]:
sns.distplot(tourismDf[(tourismDf['ProdTaken']==0) & (tourismDf['DurationOfPitch']>0)]['DurationOfPitch'],color='r',label=0)
sns.distplot(tourismDf[  (tourismDf['ProdTaken']==1)  & (tourismDf['DurationOfPitch']>0)   ]['DurationOfPitch'],color='g',label=1)
plt.legend()
plt.show()

In [None]:
sns.distplot(tourismDf[(tourismDf['ProdTaken']==0) & (tourismDf['NumberOfPersonVisited']>0)]['NumberOfPersonVisited'],color='r',label=0)
sns.distplot(tourismDf[  (tourismDf['ProdTaken']==1)  & (tourismDf['NumberOfPersonVisited']>0)   ]['NumberOfPersonVisited'],color='g',label=1)
plt.legend()
plt.show()



In [None]:
sns.distplot(tourismDf[(tourismDf['ProdTaken']==0) & (tourismDf['NumberOfFollowups']>0)]['NumberOfFollowups'],color='r',label=0)
sns.distplot(tourismDf[  (tourismDf['ProdTaken']==1)  & (tourismDf['NumberOfFollowups']>0)   ]['NumberOfFollowups'],color='g',label=1)
plt.legend()
plt.show()

In [None]:


sns.distplot(tourismDf[(tourismDf['ProdTaken']==0) & (tourismDf['NumberOfTrips']>0)]['NumberOfTrips'],color='r',label=0)
sns.distplot(tourismDf[  (tourismDf['ProdTaken']==1)  & (tourismDf['NumberOfTrips']>0)   ]['NumberOfTrips'],color='g',label=1)
plt.legend()
plt.show()

# Customers who have accepted the Prod have taken more number of trips.

In [None]:
sns.distplot(tourismDf[(tourismDf['ProdTaken']==0) & (tourismDf['PitchSatisfactionScore']>0)]['PitchSatisfactionScore'],color='r',label=0)
sns.distplot(tourismDf[  (tourismDf['ProdTaken']==1)  & (tourismDf['PitchSatisfactionScore']>0)   ]['PitchSatisfactionScore'],color='g',label=1)
plt.legend()
plt.show()

In [None]:
sns.distplot(tourismDf[(tourismDf['ProdTaken']==0) & (tourismDf['MonthlyIncome']>0)]['MonthlyIncome'],color='r',label=0)
sns.distplot(tourismDf[  (tourismDf['ProdTaken']==1)  & (tourismDf['MonthlyIncome']>0)   ]['MonthlyIncome'],color='g',label=1)
plt.legend()
plt.show()

# Majority of customers with Monthly income between 20000 - 30000 accepted the Prod.

In [None]:
print( 'cross tab between TypeofContact and ProdTaken' )
typeofContact_df = pd.crosstab(tourismDf['TypeofContact'],tourismDf['ProdTaken'],normalize=False)
print(typeofContact_df)
print('-----------------------------------------------')
print( 'cross tab percentages between TypeofContact and ProdTaken' )
typeofContact_df = pd.crosstab(tourismDf['TypeofContact'],tourismDf['ProdTaken'],normalize='index')
print(typeofContact_df)
typeofContact_df = typeofContact_df.stack().reset_index().rename(columns={0:'value'})
typeofContact_df


plt.figure(figsize=(15,5)) 
g = sns.barplot(x=typeofContact_df['TypeofContact'],y=typeofContact_df['value'],hue='ProdTaken',data=typeofContact_df)
for p in g.patches:
    g.annotate(format(p.get_height()*100, '.2f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
    
    
# Insights
# Comparitively slightly high percentage of company invited accepted Prod when compared to self_enquired.

In [None]:
print( 'cross tab between Occupation and ProdTaken' )
typeofContact_df = pd.crosstab(tourismDf['Occupation'],tourismDf['ProdTaken'],normalize=False)
print(typeofContact_df)
print('-----------------------------------------------')
print( 'cross tab percentages between Occupation and ProdTaken' )
typeofContact_df = pd.crosstab(tourismDf['Occupation'],tourismDf['ProdTaken'],normalize='index')
print(typeofContact_df)
typeofContact_df = typeofContact_df.stack().reset_index().rename(columns={0:'value'})
typeofContact_df


plt.figure(figsize=(15,5)) 
g = sns.barplot(x=typeofContact_df['Occupation'],y=typeofContact_df['value'],hue='ProdTaken',data=typeofContact_df)
for p in g.patches:
    g.annotate(format(p.get_height()*100, '.2f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
    
    
# Insights
# All Free Lancer customers accepted Prod. so looks like Freelancers are more intrested in Prod.
# Higher percertage of Large business cusotmers accepted Prod.

In [None]:


print( 'cross tab between ProductPitched and ProdTaken' )
typeofContact_df = pd.crosstab(tourismDf['ProductPitched'],tourismDf['ProdTaken'],normalize=False)
print(typeofContact_df)
print('-----------------------------------------------')
print( 'cross tab percentages between ProductPitched and ProdTaken' )
typeofContact_df = pd.crosstab(tourismDf['ProductPitched'],tourismDf['ProdTaken'],normalize='index')
print(typeofContact_df)
typeofContact_df = typeofContact_df.stack().reset_index().rename(columns={0:'value'})
typeofContact_df


plt.figure(figsize=(15,5)) 
g = sns.barplot(x=typeofContact_df['ProductPitched'],y=typeofContact_df['value'],hue='ProdTaken',data=typeofContact_df)
for p in g.patches:
    g.annotate(format(p.get_height()*100, '.2f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
    
    
# Insights
# Higher percertage of Basic Product cusotmers accepted Product.

In [None]:
print( 'cross tab between Gender and ProdTaken' )
typeofContact_df = pd.crosstab(tourismDf['Gender'],tourismDf['ProdTaken'],normalize=False)
print(typeofContact_df)
print('-----------------------------------------------')
print( 'cross tab percentages between Gender and ProdTaken' )
typeofContact_df = pd.crosstab(tourismDf['Gender'],tourismDf['ProdTaken'],normalize='index')
print(typeofContact_df)
typeofContact_df = typeofContact_df.stack().reset_index().rename(columns={0:'value'})
typeofContact_df


plt.figure(figsize=(15,5)) 
g = sns.barplot(x=typeofContact_df['Gender'],y=typeofContact_df['value'],hue='ProdTaken',data=typeofContact_df)
for p in g.patches:
    g.annotate(format(p.get_height()*100, '.2f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
    
    
# There is no gender bias in accepting the Prod. Male and Female have same accepting trends.

In [None]:
print( 'cross tab between MaritalStatus and ProdTaken' )
typeofContact_df = pd.crosstab(tourismDf['MaritalStatus'],tourismDf['ProdTaken'],normalize=False)
print(typeofContact_df)
print('-----------------------------------------------')
print( 'cross tab percentages between MaritalStatus and ProdTaken' )
typeofContact_df = pd.crosstab(tourismDf['MaritalStatus'],tourismDf['ProdTaken'],normalize='index')
print(typeofContact_df)
typeofContact_df = typeofContact_df.stack().reset_index().rename(columns={0:'value'})
typeofContact_df


plt.figure(figsize=(15,5)) 
g = sns.barplot(x=typeofContact_df['MaritalStatus'],y=typeofContact_df['value'],hue='ProdTaken',data=typeofContact_df)
for p in g.patches:
    g.annotate(format(p.get_height()*100, '.2f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
    
# Customers with marital status as Single have higher acceptance rate.

    
## Insights from Bivariant Graphs
1.  All Free Lancer customers accepted Prod. so looks like Freelancers are more intrested in Prod.
    Higher percertage of Large business cusotmers accepted Prod.
    
2.  Higher percertage of Basic Product cusotmers accepted Product.

3.  There is no gender bias in accepting the Prod. Male and Female have same accepting trends.

4.  Customers with marital status as Single have higher acceptance rate.

5.  Customers who have accepted the Prod have taken more number of trips.

6.  ProdTaken =0 and ProdTaken =1 trends behave same with DurationOfPitch.

7.  Majority of customers with Monthly income between 20000 - 30000 accepted the Prod.

8.  There is no significant age difference between Majority of customers who accepted the Prod and who did not accpet the Prod.

In [None]:
corr = abs(tourismDf.iloc[:,1:].corr()) # correlation matrix
lower_triangle = np.tril(corr, k = -1)  # select only the lower triangle of the correlation matrix
mask = lower_triangle == 0  # to mask the upper triangle in the following heatmap

plt.figure(figsize = (15,8))  # setting the figure size
sns.heatmap(lower_triangle, center=0.5, cmap= 'Blues', annot= True, xticklabels = corr.index, yticklabels = corr.columns,
            cbar= False, linewidths= 1, mask = mask)   # Da Heatmap
plt.xticks(rotation = 50)   # Aesthetic purposes
plt.yticks(rotation = 20)   # Aesthetic purposes
plt.show()

NumberOfChildrenVisited and NumberOfPersonsVisited has 0.61 corr. so dropping NumberOfChildrenVisited.

In [None]:
tourismDf['ProductPitched'].value_counts()

In [None]:
def get_customer_profile(prod_cust_df):

    tot = len(prod_cust_df)

    fig, ax_list = plt.subplots(3, 3, figsize=(20,20))

    g1 = sns.distplot(prod_cust_df['Age'], ax=ax_list[0,0]) 

    g2 = sns.countplot(prod_cust_df['Occupation'], ax=ax_list[0,1]) 
    for p in g2.patches:
        g2.annotate(format((p.get_height()/tot)*100, '.2f'), 
                    (p.get_x() + p.get_width() / 2., p.get_height()), 
                    ha = 'center', va = 'center', 
                    xytext = (0, 9), 
                    textcoords = 'offset points')

    g3 = sns.countplot(prod_cust_df['MaritalStatus'], ax=ax_list[0,2])
    for p in g3.patches:
        g3.annotate(format((p.get_height()/tot)*100, '.2f'), 
                    (p.get_x() + p.get_width() / 2., p.get_height()), 
                    ha = 'center', va = 'center', 
                    xytext = (0, 9), 
                    textcoords = 'offset points')

    g4 = sns.countplot(prod_cust_df['Designation'], ax=ax_list[1,0])
    for p in g4.patches:
        g4.annotate(format((p.get_height()/tot)*100, '.2f'), 
                    (p.get_x() + p.get_width() / 2., p.get_height()), 
                    ha = 'center', va = 'center', 
                    xytext = (0, 9), 
                    textcoords = 'offset points')
 
    g5 = sns.distplot(prod_cust_df['MonthlyIncome'], ax=ax_list[1,1]) 


    g6 = sns.countplot(prod_cust_df['Gender'], ax=ax_list[1,2]) 
    for p in g6.patches:
        g6.annotate(format((p.get_height()/tot)*100, '.2f'), 
                    (p.get_x() + p.get_width() / 2., p.get_height()), 
                    ha = 'center', va = 'center', 
                    xytext = (0, 9), 
                    textcoords = 'offset points')


    g7 = sns.countplot(prod_cust_df['OwnCar'], ax=ax_list[2,0])
    for p in g7.patches:
        g7.annotate(format((p.get_height()/tot)*100, '.2f'), 
                    (p.get_x() + p.get_width() / 2., p.get_height()), 
                    ha = 'center', va = 'center', 
                    xytext = (0, 9), 
                    textcoords = 'offset points')
    

    g8 = sns.countplot(prod_cust_df['Passport'], ax=ax_list[2,1]) 
    for p in g8.patches:
        g8.annotate(format((p.get_height()/tot)*100, '.2f'), 
                    (p.get_x() + p.get_width() / 2., p.get_height()), 
                    ha = 'center', va = 'center', 
                    xytext = (0, 9), 
                    textcoords = 'offset points')

    g9 = sns.countplot(prod_cust_df['ProdTaken'], ax=ax_list[2,2])
    for p in g9.patches:
        g9.annotate(format((p.get_height()/tot)*100, '.2f'), 
                    (p.get_x() + p.get_width() / 2., p.get_height()), 
                    ha = 'center', va = 'center', 
                    xytext = (0, 9), 
                    textcoords = 'offset points')

    plt.show()

In [None]:
tourismDf_Basic = tourismDf[tourismDf['ProductPitched']=='Basic']
get_customer_profile(tourismDf_Basic)

In [None]:
tourismDf_Deluxe = tourismDf[tourismDf['ProductPitched']=='Deluxe']
get_customer_profile(tourismDf_Deluxe)

In [None]:
tourismDf_Standard = tourismDf[tourismDf['ProductPitched']=='Standard']
get_customer_profile(tourismDf_Standard)

In [None]:
tourismDf_SD = tourismDf[tourismDf['ProductPitched']=='Super Deluxe']
get_customer_profile(tourismDf_SD)

In [None]:
tourismDf_king = tourismDf[tourismDf['ProductPitched']=='King']
get_customer_profile(tourismDf_king)

## <a id = "link4"></a> Illustrate the insights based on EDA

## Key points from Description

Extracts from the description which lead me towards my conclusion

i) <i>"The Policy Maker of the company wants to enable and establish a <B> viable business model to expand the customer base. </B>"</i>

ii) <i>"However, this time company wants to harness the available data of existing and potential customers <B> to make the marketing expenditure more efficient.</B>"</i>

So from the above excerpts from the description, I feel there is more concentration on making efficient spending of marketing expenditure.

which means by using required ML model, of all the predicted customers who are likely to buy package,if most of them actually buy the package , then the objective is reached.

i.e we need to maximize the true positives among the total true predictions , then the 


if we are able to predict majority of customers who are  more likely to purchase travel package, then 

