# Project - Ensemble Techniques - Travel Package Purchase Prediction

## Background and Context

A tourism company named "Visit with us" wants to enable and establish a viable business model to expand the customer base by harnessing the available data of existing and potential customers to make the marketing expenditure more efficient.

One of the ways to expand the customer base is to introduce a new offering of packages. Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, it has been observed that 18% of the customers purchased the packages.

However, the marketing cost was quite high because customers were contacted at random without looking at the available information.

The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.

We are required to analyze the customers' data and information to provide recommendations to the Policy Maker and Marketing Team and also build a model to predict the potential customer who is going to purchase the newly introduced travel package.

## Objective

To predict which customer is more likely to purchase the newly introduced travel package based on the available data so that customer base can be expanded and the marketing expenditure can be optimized.


## Key Questions

1. What are the key factors influencing whether a customer will buy Wellness Tourism Package or not?
2. Is there a good predictive model so that we can increase the conversion rate of targetted marketting? 
3. What does the performance assessment look like for such a model?


## Data Dictionary

### Customer details:

* CustomerID: Unique customer ID
* ProdTaken: Whether the customer has purchased a package or not (0: No, 1: Yes)
* Age: Age of customer
* TypeofContact: How customer was contacted (Company Invited or Self Inquiry)
* CityTier: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3
* Occupation: Occupation of customer
* Gender: Gender of customer
* NumberOfPersonVisiting: Total number of persons planning to take the trip with the customer
* PreferredPropertyStar: Preferred hotel property rating by customer
* MaritalStatus: Marital status of customer
* NumberOfTrips: Average number of trips in a year by customer
* Passport: The customer has a passport or not (0: No, 1: Yes)
* OwnCar: Whether the customers own a car or not (0: No, 1: Yes)
* NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer
* Designation: Designation of the customer in the current organization
* MonthlyIncome: Gross monthly income of the customer

### Customer interaction data: 

* PitchSatisfactionScore: Sales pitch satisfaction score
* ProductPitched: Product pitched by the salesperson
* NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch
* DurationOfPitch: Duration of the pitch by a salesperson to the customer

## Import necessary libraries and load data

In [None]:
# Library to suppress warnings or deprecation notes 
import warnings
warnings.filterwarnings('ignore')

# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Removes the limit from the number of displayed columns and rows.
# This is so that we can see the entire dataframe when we print it
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', 200)
# We are setting the random seed via np.random.seed so that
# we get the same random results every time
np.random.seed(1)

In [None]:
# Load the Tourism data from sheet Tourism of Tourism.xlsx
data = pd.read_excel('Tourism.xlsx', sheet_name='Tourism')

In [None]:
print(f'Number of rows: {data.shape[0]} and number of columns: {data.shape[1]}')

In [None]:
# Check the first 5 rows of data
data.head()

In [None]:
# Check the last 5 rows of data
data.tail()

In [None]:
# checking column names, datatypes and number of non-null values
data.info()

> **Observations:**
> * CustomerID column can be dropped as it is a unique number and does not contain any useful information.
> * There are null or missing values in the data for Age, TypeofContact,  DurationOfPitch, NumberOfFollowups, PreferredPropertyStar, NumberOfTrips, NumberOfChildrenVisiting and MonthlyIncome columns.
> * Both numerical and object data type columns are present in the dataset.

In [None]:
# Take a backup of the original dataset
data_bkp = data.copy()

# Delete the CustomerID column from the working dataset
data.drop(['CustomerID'], axis=1, inplace=True)

In [None]:
# Checking for duplicate rows.
data.duplicated().sum()

In [None]:
data[data.duplicated()].sample(10)

> **Observations:**
> * There are 141 duplicated rows in the current dataset.
> * We are going to remove the duplicate rows as they will not help in model building as such.

In [None]:
# Remove the duplicate rows and reset the index
data.drop_duplicates(ignore_index=True, inplace=True)
print(f'Number of rows: {data.shape[0]} and number of columns: {data.shape[1]}')

In [None]:
# Check the Summary of the data
data.describe().T

> **Observations:**
> * *ProdTaken*: This is our target variable. Values are 0 or 1. Only 18.8% have 1 values.
> * *Age*: Minimum 18 and maximum 61 years with median 36 years. Majority of customers are middle aged.
> * *CityTier*: Customers are from three different city tiers. Majority of them are from Tier 3 cities.
> * *DurationOfPitch*: Duration of pitch has a mean of 15 (assuming minutes) but there are some outliers. As max and 75 percentile values differ a lot.
> * *NumberOfPersonVisiting*: It varies from 1 to 5 person with median of 3. This data looks good.
> * *NumberOfFollowups*: Every customer was followed up at least once. Some customers were followed up to 6 times.
> * *PreferredPropertyStar*: Varies from 3 to 5 stars ratings with median 3 stars.
> * *NumberOfTrips*: Customers atleast had one trip, some customers have more than 20 trips which seems outlier as 75 percentile range is upto 4 trips.
> * *Passport*: Around 29% customers have passports.
> * *PitchSatisfactionScore*: This feature varies from 1 to 5 with a mean and median as 3.
> * *OwnCar*: About 61% customers own a car.
> * *NumberOfChildrenVisiting*: Customers took upto maximum of 3 children during their trips. 
> * *MonthlyIncome*: Monthly income varies from 1K to 98.6K with a median of 25.5K. It seems there are some outliers towards the high end.

Let's check the summary of other categorical variables, including the null values if any.

In [None]:
cat_cols = data.select_dtypes(include=["object", "category"]).columns
for col in cat_cols:
    print('>> Domain of: ', col)
    print('-------------------------------')
    cat = data[col].value_counts(dropna=False).sort_values(ascending=False)
    print(cat)
    print('-------------------------------\n')

> **Observations:**
> * *TypeofContact*: There are **25 missing values** otheriwise it has two unique values.
> * *Occupation*: Consists of four unique values majority are Salaried and Small Business.
> * *Gender*: **143 rows has value "Fe Male"** this needs to be treated as **"Female"**
> * *ProductPitched*: There are 5 different types of products as stated in the problem description.
> * *MaritalStatus*: Single and Unmarried will be treated differently based on the advice.
> * *Designation*: Looks good with 5 different designations majority are Executive and Manager.

## Fix missing and wrong values
We already know that there are null or missing values in many columns and also some wrong values such as "Fe Male". Let us fix these issues in the dataset.

In [None]:
data.isnull().sum()

In [None]:
# Take a look at the sample data where Age is null to see if there are any patterns
data[data.Age.isnull()].sample(10)

We see there is no releation between Null values of Age with null values of DurationOfPitch or MonthlyIncome. Null values seems to be random data entry error.

### Fix Gender column values

In [None]:
#Correcting the Gender 
data['Gender'] = data['Gender'].apply(lambda x: x if x != 'Fe Male' else 'Female')

In [None]:
data['Gender'].value_counts()

### Fix TypeofContact column values

Null values of TypeofContact will be filled with the mode value since there are only 25 of them.

In [None]:
# Handling Type of contact
data['TypeofContact'] = data['TypeofContact'].fillna('Self Enquiry')

In [None]:
data['TypeofContact'].value_counts()

### Fix Age and MonthlyIncome column values

Missing values of Age and MonthlyIncome will be replaced with median.

In [None]:
#Fill Age null values with Median Age
for col in ['Age', 'MonthlyIncome']:
    data[col] = data[col].fillna(data[col].median())

### Fix Others column values

All othere missing values will be filled up with mean.

In [None]:
cols =['NumberOfFollowups', 'DurationOfPitch', 'PreferredPropertyStar', 'NumberOfTrips', 'NumberOfChildrenVisiting']
for col in cols:
    data[col] =  data[col].fillna(round(data[col].mean()))


Finally let's again check if there are any missing values pending to be treated.

In [None]:
data.isnull().sum()

In [None]:
# Dataset has no more missing values, lets take a look at the first 5 rows.
data.head()

## Univariate Analysis
For making the univariate analysis easier, we define below functions that will help us to plot both non-categorical columns such as Age, MonthlyIncome etc. as box and hist plots and categorical columns such as TypeofContact, CityTier, Occupation etc. as count plot with percentages.

In [None]:
# function to plot a boxplot and a histogram along the same scale.
def box_hist_plot(data, feature, figsize=(12, 7), bins=20):
    """
    This will show a box and hist plot in a column alignment, For hist plot kde is set to True

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    """
    # creating the 2 subplots
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  
    # boxplot will be created and a star will indicate the mean value of the column
    sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color="violet")  
    # For hist plot
    sns.histplot(data=data, x=feature, kde=True, bins=bins, ax=ax_hist2) 
    # Add mean to the histogram
    ax_hist2.axvline(data[feature].mean(), color="green", linestyle="--") 
    # Add median to the histogram
    ax_hist2.axvline(data[feature].median(), color="black", linestyle="-")  

In [None]:
# function to create bar plot with percent labels
def bar_perc_plot(data, feature):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    """

    # calculate the fig width dynamically
    total = len(data[feature]) 
    count = data[feature].nunique()
    plt.figure(figsize=(count + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        # Sort the bars from high to low
        order=data[feature].value_counts().sort_values(ascending=False).index,
    )

    for bar in ax.patches:
        # percentage of each class of the category
        label = "{:.2f}%".format(100 * bar.get_height() / total)  
        x = bar.get_x() + bar.get_width() / 2  # width of the plot
        y = bar.get_height()  # height of the plot

        # annotate the percentage
        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        ) 

    plt.show();  # show the plot

### Analysis of Age

In [None]:
#Age
box_hist_plot(data, 'Age')

> **Observations:**
> * Majority of customers are between 30 to 40 years.
> * Mean and median of the customers are between 35 to 40 years. 
> * Age is normally distributed and there are no outliers

### Analysis of DurationOfPitch

In [None]:
#DurationOfPitch
box_hist_plot(data, 'DurationOfPitch')

> **Observations:**
> * Duration of pitch has a right skew because of few outliers.
> * Mean and median of pitch is around 15 mins.

Pitch durations more than 45 mins seems very odd. We will check the outliers and set it to median value.

In [None]:
# Check the durations above 45 mins
data[data.DurationOfPitch > 45]

In [None]:
# it seems there are only two rows with such high values and both of these customers have ot taken the product.
# Lets set their value to max 45 mins
data['DurationOfPitch'] = data['DurationOfPitch'].apply(lambda x: x if x < 45 else 45)

In [None]:
box_hist_plot(data, 'DurationOfPitch')

Now the Duration of pitch seems normally distributed although it is right skewed.

### Analysis of MonthlyIncome

In [None]:
box_hist_plot(data, 'MonthlyIncome')

> **Observations:**
> * Monthly income is also right skewed distribution.
> * Median monthly income is around 22K. We are not sure about the currency but outliers which are closed to 100K seems very high.

We can guess the monthly income from the median value of income based on designations. But first lets check the outliers which are more than 50K.

In [None]:
data[data.MonthlyIncome > 50000]

In [None]:
# Check monthly income median based grouped by designation.
data.groupby('Designation')['MonthlyIncome'].apply(lambda x: x.median())

Since the outliers are Executive with median income around 20K, it is very unlikely that they will have monthly income around 100K. Lets set these income values to Executive median income.

In [None]:
data['MonthlyIncome'] = data['MonthlyIncome'].apply(lambda x: x if x < 50000 else 20755)

In [None]:
box_hist_plot(data, 'MonthlyIncome')

Now the Monthly income is centrally distributed although there are outliers but they are acceptable based on the designations.

### Analysis of ProdTaken

In [None]:
bar_perc_plot(data, 'ProdTaken')

> **Observations:**
> * Only 18.8% customers have bought travel packages in the given dataset.
> * Most of the customers have not bought any travel package, which means we have an unbalanced dataset.

### Analysis of TypeofContact

In [None]:
bar_perc_plot(data, 'TypeofContact')

> **Observations:**
> * Company invited customers are around 29%.
> * Self enquiry customers are around 71%. We would have expected that self enquiry customers should by the travel packages but it appears that is not the case, there are factors which are discouraging a self enquired customer to move away and only 18.8% of all customers have bought travel package.

### Analysis of CityTier

In [None]:
bar_perc_plot(data, 'CityTier')

> **Observations:**
> * Most of the customers are from Tier 1 followed by 2 and 3.
> * Tier 3 city customers are very less around 4%.

### Analysis of Occupation

In [None]:
bar_perc_plot(data, 'Occupation')

> **Observations:**
> * Most of the customers are Salaried and Small Business.
> * We have around 9% Large Business customers and negligible number of Free Lancers.

### Analysis of Gender

In [None]:
bar_perc_plot(data, 'Gender')

> **Observations:**
> * Around 60% of our customers are Male.
> * A little over 40% customers are Female.

### Analysis of NumberOfPersonVisiting

In [None]:
bar_perc_plot(data, 'NumberOfPersonVisiting')

> **Observations:**
> * Most NumberOfPersonVisiting is 3 followed by 2 and 4 members.
> * NumberOfPersonVisiting alone or higher than 4 are very small.

### Analysis of NumberOfFollowups

In [None]:
bar_perc_plot(data, 'NumberOfFollowups')

> **Observations:**
> * Most of the customers were followed up 4 times followed by 3 and 5 times.
> * Number of customers who were followed by for very less time such as 1 or 2 or more than 5 are less.

### Analysis of ProductPitched

In [None]:
bar_perc_plot(data, 'ProductPitched')

> **Observations:**
> * Basic and Deluxe travel packages were pitched most followed by Standard and Super Deluxe.
> * King travel package was offered to only around 7.4% customers.

Although the dataset has the above five travel products but for this analysis we are interested in another product namely Wellness Travel Package. During model building we will encode his column as ordered numerical values as we are not interested in these existing products.

### Analysis of PreferredPropertyStar

In [None]:
bar_perc_plot(data, 'PreferredPropertyStar')

> **Observations:**
> * Around 61% customers preferred 3 star rated properties.
> * 5 and 4 star rated properties are prefferred by almost equal number of customers - around 20% each.

### Analysis of MaritalStatus

In [None]:
bar_perc_plot(data, 'MaritalStatus')

> **Observations:**
> * Most customers around 48% in the dataset are Married.
> * Around 20% customers are Divorced followed by Single and Unmarried.

### Analysis of NumberOfTrips

In [None]:
bar_perc_plot(data, 'NumberOfTrips')

> **Observations:**
> * Around 30% of customers have 2 trips per year followed by 3 and one trip per year.
> * There are few customers having 19 or more trips per year which is very unusual.

Let's check how many customers have number of trips more than 8.

In [None]:
data[data.NumberOfTrips > 8]

In [None]:
# There are only 4 customers having Number of trips more than 8. 
# In order to fix these outliers, for these customers we are going to set the number of trips to 8.
data['NumberOfTrips'] = data['NumberOfTrips'].apply(lambda x: x if x < 8 else 8)

In [None]:
# See the distribution of NumberOfTrips after outlier treatment
bar_perc_plot(data, 'NumberOfTrips')

### Analysis of PitchSatisfactionScore

In [None]:
bar_perc_plot(data, 'PitchSatisfactionScore')

> **Observations:**
> * Majority of customers (30%) have given satisfaction score as 3.
> * However around more than 30% customer has given low satisfaction scores like 1 or 2. Company should investigate this issue.
> * Around 40% of customers have given very good satisfaction score either 4 or 5.

### Analysis of OwnCar

In [None]:
bar_perc_plot(data, 'OwnCar')

> **Observations:**
> * A little over 60% customers in the dataset own a car.
> * However close to 40% customers don't own a car. 

### Analysis of NumberOfChildrenVisiting

In [None]:
bar_perc_plot(data, 'NumberOfChildrenVisiting')

> **Observations:**
> * Majority of customers planed to travel had one child followed by 2.
> * 22% customers reported no accompanying children.
> * 3 or more children accompanied in the trip are very less around 7%.

### Analysis of Designation

In [None]:
bar_perc_plot(data, 'Designation')

> **Observations:**
> * There are five different designations, majority of the customers are Executive or Manager.
> * Higher designation customers such as AVP and VP are less in number. 

Company designation does not generally decide a person's travel package buying. It is more corelated with Age and monthly income. Hence we are encode this data as ordered numerical feature before running our model.

## Multi-Variate Analysis

### Distribution of Product Taken with Non-Categorical features

Ignoring the outliers let's check the distribution of Product Taken based on the non-categorical features

In [None]:
cols = data[
    [
        "Age",
        "DurationOfPitch",
        "MonthlyIncome",
        "NumberOfTrips"
    ]
].columns.tolist()
plt.figure(figsize=(12, 8))

for i, variable in enumerate(cols):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(data = data, x="ProdTaken", y=variable, showfliers=False)
    plt.tight_layout()
    plt.title(variable)
plt.show()

> **Observations:**
> * Mostly younger age people from 25 years to 40 years has taken a travel package.
> * Also the customers who have taken a travel package, duration of pitch is higher.
> * Relatively lower monthly income group people such as Executives and Managers have taken a travel package.
> * Distribution of customers by ProdTaken and NumberOfTrips have no difference.

### Distribution of Personal Loan with Categorical features

Now let's check the distribution of product taken customers with regards to categorical values.

In [None]:
# function to count and a normalized stack bar chart
def count_and_normalize_plot(data, predictor, target):
    """
    Print the category counts and plot a normalized stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 100)
    
    fig = plt.figure(figsize=(14,5))
    
    # Add the first plot
    ax1 = fig.add_subplot(121)
    ax1.set_title(predictor + ' by ' + target)
    sns.countplot(data = data, x=target, hue=predictor, ax=ax1)

    # Add the second plot
    ax2 = fig.add_subplot(122)
    ax2.set_title('Normalized ' + predictor + ' by ' + target)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, ax=ax2)
    plt.legend(
        loc="lower left",
        frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

#### TypeofContact Vs ProdTaken

In [None]:
count_and_normalize_plot(data, 'TypeofContact', 'ProdTaken')

> **Observations:**
> * Self enquiry customers are more in the product taken category which is makes sense.
> * However from the normalized chart we see that **higher percentage of company invited customers** have taken a travel package.

#### CityTier Vs ProdTaken

In [None]:
count_and_normalize_plot(data, 'CityTier', 'ProdTaken')

> **Observations:**
> * Customers who have taken travel package are mainly from Tier 1 and 3 cities.
> * However, normalized chart shows that **higher percentage of Tier 3 city customers** have taken a travel package.

#### Occupation Vs ProdTaken

In [None]:
count_and_normalize_plot(data, 'Occupation', 'ProdTaken')

> **Observations:**
> * Salaried and Small business customers are more in the product taken category.
> * However **higher percentage of Large Business customers** have taken a travel package.
> * Since data points for Free Lancer customers are only 2 we cannot make any conclusion.

#### Gender Vs ProdTaken

In [None]:
count_and_normalize_plot(data, 'Gender', 'ProdTaken')

> **Observations:**
> * More Male customers have taken travel packages.
> * Also slightly higher percentage of Male customers have taken a product compared to Female customers.

#### NumberOfPersonVisiting Vs ProdTaken

In [None]:
count_and_normalize_plot(data, 'NumberOfPersonVisiting', 'ProdTaken')

> **Observations:**
> * Customers who planned to visit with 2 to 4 persons have taken a product.
> * Most of the customers who have taken a travel product planned to visit with 3 persons.
> * There are no data points for customers who have taken a product and planned to visit wither with 1 or 5 persons.

#### NumberOfFollowups Vs ProdTaken

In [None]:
count_and_normalize_plot(data, 'NumberOfFollowups', 'ProdTaken')

> **Observations:**
> * Most number of customers who took a travel package was followed up 4 times.
> * However from the normalized chart we see that **higher percentage of customers taken a product who were followed up more** like 5 or 6 times.

#### ProductPitched Vs ProdTaken

In [None]:
count_and_normalize_plot(data, 'ProductPitched', 'ProdTaken')

> **Observations:**
> * Majority of the customers have taken the basic travel package followed by Standard.
> * Basic travel package seems to be popular as higher percentage of customers who were pitched the basic package, have taken it.
> * There is not much demand for Super Deluxe and King Travel packages.

#### PreferredPropertyStar Vs ProdTaken

In [None]:
count_and_normalize_plot(data, 'PreferredPropertyStar', 'ProdTaken')

> **Observations:**
> * Most of the customers who took a product preferred 3 start rating properties.
> * However **higher percentage of customers who preferred 5 and 4 property ratings** have taken a travel package.

#### MaritalStatus Vs ProdTaken

In [None]:
count_and_normalize_plot(data, 'MaritalStatus', 'ProdTaken')

> **Observations:**
> * Most of the customers who have taken a travel package are single or married followed by unmarried and divorced.
> * However from the normalized chart we see that **Single customers stands out** followed by married.

#### Passport Vs ProdTaken

In [None]:
count_and_normalize_plot(data, 'Passport', 'ProdTaken')

> **Observations:**
> * Very few about 10% customers who do not have passport, taken a travel package.
> * But normalized chart clearly shows that **close to 40% of customers, who have passport, have taken a travel package**. This observation is very encouraging.

#### PitchSatisfactionScore Vs ProdTaken

In [None]:
count_and_normalize_plot(data, 'PitchSatisfactionScore', 'ProdTaken')

> **Observations:**
> * Most of the customers who have taken a travel package gave average 3 pitch satisfaction score.
> * Interesting to see that when pitch satisfaction score is 5 higher percentage of customers from this category have taken a travel package. 

#### OwnCar Vs ProdTaken

In [None]:
count_and_normalize_plot(data, 'OwnCar', 'ProdTaken')

> **Observations:**
> * Owning a car did not make much difference with regards to product taken.

#### NumberOfChildrenVisiting Vs ProdTaken

In [None]:
count_and_normalize_plot(data, 'NumberOfChildrenVisiting', 'ProdTaken')

> **Observations:**
> * Self enquiry customers are more in the product taken category which is makes sense.
> * However from the normalized chart we see that **higher percentage of company invited customers** have taken a travel package.

#### Designation Vs ProdTaken

In [None]:
count_and_normalize_plot(data, 'Designation', 'ProdTaken')

> **Observations:**
> * Most of the customers who have taken a travel package are Executives, in other words, designation with lowest median monthly salary.
> * Higher designation with higher median monthly income have not preferred the travel packages.

Based on the data dictionary and observations we see ProductPitched and Designation values are ordered or ranked in nature. Hence we will encode these features with ordered numerical values. Chances are that Designation and Monthly Income may be highly correlated.

In [None]:
#data_bkp2 = data.copy()
replaceStruct = {
        "ProductPitched": {"Basic": 1, "Standard":2 , "Deluxe": 3, "Super Deluxe": 4,"King": 5},
        "Designation":     {"Executive": 1, "Manager": 2 ,"Senior Manager": 3 ,"AVP": 4, "VP": 5}
    }
data=data.replace(replaceStruct)

In [None]:
data.head()

In [None]:
# Check the correlation between features
plt.figure(figsize=(15,8))
sns.heatmap(data.corr(), annot=True);

> **Observations:**
> * As expected Designation and Mothly income are highly correlated (0.85)
> * Designation and Product Pitched are also highly correlated (0.82), which again makes sense as company may have pitched higher value products to higer designated customers.
> * Similarly we can also see correlation (0.67) between Product Pitched and Monthly income. Although not very strong.
> * Number of children visiting and number of persons visiting are also correlated (0.61)
> * Age has positive correlation with Monthly Income and Designation which is also self explanatory.
> * There is no significant correlation among other features.

In [None]:
# Let us see the pairplot for all numerical variables
sns.pairplot(data, hue='ProdTaken');

## Model Building

We are going to now build several classification models based on Ensemble Techniques. However for each model the evaluation criteria remains same. At the end we'll compare the performance of all the models and provide recommendations for **"Visit With Us"** marketing department Wellness Travel Package campaign. We are going to follow below steps:

1. Split the data into the train and test set.
2. Train models on the training data.
3. Try to improve the model performance using hyperparameter tuning.
4. Test the performance on the test data.

### Model evaluation criterion:

### Model can make wrong predictions as:
1. Predicting a person who has taken a travel package but in reality he/she did not.
2. Predicting a person who did not take a travel package but in reality he/she did.

### Which case is more important? 

* Predicting a person who has taken a travel package but in reality he/she did not then cost related to marketing on travel package to the customer is wasted. While we are looking to optimize the marketting cost but this has less importance than an oppertunity cost. 

* If we predict a customer who did not take a travel package but in reality he/she did then it will be a loss of oppertunity for the company and this is precisly we want to avoid since we want to increase the customer base of travel package.


### How to reduce this loss i.e need to reduce False Negatives?
*  `recall` should be maximized, the greater the `recall` higher the chances of identifying customers who would like to buy a the Wellness Tourism Package.

In [None]:
# Libraries to split data, impute missing values 
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Libraries to import decision tree classifier and different ensemble classifiers
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier

# Libtune to tune model, get different metric scores
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import GridSearchCV

### Split Data

In [None]:
# Let's drop the Designation and ProductPitched feature as it is higly correlated with Monthly Income. 
# Also for ProductPitched, we are not interested in a particular product, since we have new product, 
# but the end result if the product was taken.
X = data.drop(['ProdTaken', 'Designation', 'ProductPitched'], axis=1)
y = data['ProdTaken']

In [None]:
X = pd.get_dummies(X, drop_first=True)
# Splitting data into training and test set:
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=1,stratify=y)
print(X_train.shape, X_test.shape)

The Stratify arguments maintain the original distribution of classes in the target variable while splitting the data into train and test sets.

In [None]:
# Lets take look at the sample training dataset
X.sample(10)

In [None]:
y.value_counts(1)

In [None]:
y_test.value_counts(1)

In order to calculate different metrics and confusion matrix and promote code reuse we are going to write below utility functions.
* The model_performance_classification_sklearn function will be used to check the model performance of models. 
* The confusion_matrix_sklearn function will be used to plot confusion matrix.
* The show_feature_imp function will show the important features for the given model when applicable

In [None]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1,
        },
        index=[0],
    )

    return df_perf

In [None]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

In [None]:
def show_feature_imp(model):
    """
    Show feature importance from a decision tree type models
    """
    feature_names = X_train.columns
    importances = model.feature_importances_
    indices = np.argsort(importances)

    plt.figure(figsize=(12,12))
    plt.title('Feature Importances')
    plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
    plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
    plt.xlabel('Relative Importance')
    plt.show()

## Decision Tree Classifier

In [None]:
#Fitting the model
d_tree = DecisionTreeClassifier(random_state=1)
d_tree.fit(X_train,y_train)

#Calculating different metrics
d_tree_model_train_perf=model_performance_classification_sklearn(d_tree,X_train,y_train)
print("Training performance:\n",d_tree_model_train_perf)
d_tree_model_test_perf=model_performance_classification_sklearn(d_tree,X_test,y_test)
print("\nTesting performance:\n",d_tree_model_test_perf)

#Creating confusion matrix
print("\nConfusion Matrix:")
confusion_matrix_sklearn(d_tree,X_test,y_test)

> **Observations:**
> * As expected, DecisionTreeClassifier overfits the training dataset with recall score 1.0
> * The testing recall is 0.65. Train and test recall scores have big difference.

### Hyperparameter Tuning

In [None]:
#Choose the type of classifier. 
dtree_estimator = DecisionTreeClassifier(class_weight={0:0.18,1:0.72},random_state=1)

# Grid of parameters to choose from
parameters = {'max_depth': [2, 4, 6, 8, None], 
              'min_samples_leaf': [1, 2, 5, 7, 10],
              'max_leaf_nodes' : [2, 3, 5, 10,15],
              'min_impurity_decrease': [0.0001,0.001,0.01,0.1]
             }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
dtree_estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
dtree_estimator.fit(X_train, y_train)

In [None]:
#Calculating different metrics
dtree_estimator_model_train_perf=model_performance_classification_sklearn(d_tree,X_train,y_train)
print("Training performance:\n",dtree_estimator_model_train_perf)
dtree_estimator_model_test_perf=model_performance_classification_sklearn(d_tree,X_test,y_test)
print("Testing performance:\n",dtree_estimator_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(dtree_estimator,X_test,y_test)

> **Observations:**
> * After hyperparameter tuning, we do not see much improvement.
> * Training and testing recall scores are 1.0 and 0.65 respectively. 
> * Model is overfitted.

In [None]:
show_feature_imp(d_tree)

> **Observations:**
> * The model has given high importance to Age, DurationOfPitch, MonthlyIncome, Passport.
> * It has also given importance to PitchSatisfactionScore, NumberOfFollowups, Self Enquiry, CityTier. 
> * It has also given importance to male customers who are single and works in Small and Large businesses.

## Random Forest Classifier

In [None]:
#Fitting the model
rf_estimator = RandomForestClassifier(random_state=1)
rf_estimator.fit(X_train,y_train)

#Calculating different metrics
rf_estimator_model_train_perf=model_performance_classification_sklearn(rf_estimator,X_train,y_train)
print("Training performance:\n",rf_estimator_model_train_perf)
rf_estimator_model_test_perf=model_performance_classification_sklearn(rf_estimator,X_test,y_test)
print("Testing performance:\n",rf_estimator_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(rf_estimator,X_test,y_test)

> **Observations:**
> * Training and testing recall scores are 1.0 and 0.50 respectively.
> * The model is highly overfitted.

### Hyperparameter Tuning

In [None]:
# Choose the type of classifier. 
rf_tuned = RandomForestClassifier(class_weight={0:0.18,1:0.82},random_state=1,oob_score=True,bootstrap=True)

parameters = {  
                'max_depth': [3, 6, 9, None],
                'max_features': ['sqrt','log2',None],
                'min_samples_leaf': np.arange(1,15,5),
                'min_samples_split': np.arange(2, 20, 5),
                'n_estimators': [20,40,80,100]
            }


# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=scorer, cv=5, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
rf_tuned.fit(X_train, y_train)

In [None]:
#Calculating different metrics
rf_tuned_model_train_perf=model_performance_classification_sklearn(rf_tuned,X_train,y_train)
print("Training performance:\n",rf_tuned_model_train_perf)
rf_tuned_model_test_perf=model_performance_classification_sklearn(rf_tuned,X_test,y_test)
print("Testing performance:\n",rf_tuned_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(rf_tuned,X_test,y_test)

> **Observations:**
> * Training and testing recall scores are 0.69 and 0.67 respectively.
> * Although the training performance has dropped but the model is not overfitted.
> * A good improvement over the last model.

In [None]:
show_feature_imp(rf_tuned)

> **Observations:**
> * The model has given high importance to Passport, Age, Single and Monthly Income.
> * It has also given importance to DurationOfPitch, Married, NumberOfFfollowups, Large Business, City Tier
> * Very low importance to Self Enquiry or OwnCar features.

## Bagging Classifier with LogisticRegression

In [None]:
#Fitting the model
from sklearn.linear_model import LogisticRegression
lg_bagging_classifier = BaggingClassifier(random_state=1,base_estimator=LogisticRegression(solver='liblinear', random_state=1))
lg_bagging_classifier.fit(X_train,y_train)

#Calculating different metrics
lg_bagging_classifier_model_train_perf=model_performance_classification_sklearn(lg_bagging_classifier,X_train,y_train)
print("Training performance:\n", lg_bagging_classifier_model_train_perf)
lg_bagging_classifier_model_test_perf=model_performance_classification_sklearn(lg_bagging_classifier,X_test,y_test)
print("Testing performance:\n", lg_bagging_classifier_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(lg_bagging_classifier,X_test,y_test)

> **Observations:**
> * Model has given very low recall score both for training and test.
> * Although test reacll score is higher than training score but performance is not acceptable.
> * For this use case Bagging Classifier with LogisticRegression is not useful.

### Hyperparameter Tuning

In [None]:
# Choose the type of classifier. 
lg_bagging_estimator_tuned = BaggingClassifier(random_state=1, base_estimator=LogisticRegression(solver='liblinear', random_state=1))

# Grid of parameters to choose from
parameters = {'max_samples': [0.7,0.8,0.9,1], 
              'max_features': [0.7,0.8,0.9,1],
              'n_estimators' : [10,20,30,40,50],
             }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(lg_bagging_estimator_tuned, parameters, scoring=scorer, cv=5, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
lg_bagging_estimator_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
lg_bagging_estimator_tuned.fit(X_train, y_train)

In [None]:
#Calculating different metrics
lg_bagging_estimator_tuned_model_train_perf=model_performance_classification_sklearn(lg_bagging_estimator_tuned,X_train,y_train)
print("Training performance:\n", lg_bagging_estimator_tuned_model_train_perf)
lg_bagging_estimator_tuned_model_test_perf=model_performance_classification_sklearn(lg_bagging_estimator_tuned,X_test,y_test)
print("Testing performance:\n", lg_bagging_estimator_tuned_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(lg_bagging_estimator_tuned,X_test,y_test)

> **Observations:**
> * After hyperparameter tuning both training and test recall score is 1.
> * Assumption is that it is overfitting both training and test dataset, by reducing the Accuracy and Precision score to very low.
> * Since the accuracy and precision score is very low, this model cannot be useful.

## Bagging Classifier with DecisionTreeClassifier

In [None]:
#Fitting the model, Default base_estimator for BaggingClassifier is DecisionTreeClassifier
dt_bagging_classifier = BaggingClassifier(random_state=1)
dt_bagging_classifier.fit(X_train,y_train)

#Calculating different metrics
dt_bagging_classifier_model_train_perf=model_performance_classification_sklearn(dt_bagging_classifier,X_train,y_train)
print("Training performance:\n", dt_bagging_classifier_model_train_perf)
dt_bagging_classifier_model_test_perf=model_performance_classification_sklearn(dt_bagging_classifier,X_test,y_test)
print("Testing performance:\n", dt_bagging_classifier_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(dt_bagging_classifier,X_test,y_test)

> **Observations:**
> * Training and testing recall scores are 0.95 and 0.55 respectively.
> * The model is highly overfitted for recall score.
> * Although Accuracy and Precision looks good.

### Hyperparameter Tuning

In [None]:
# Choose the type of classifier. 
dt_bagging_estimator_tuned = BaggingClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {'max_samples': [0.7,0.8,0.9,1], 
              'max_features': [0.7,0.8,0.9,1],
              'n_estimators' : [20,40,80,100],
             }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(dt_bagging_estimator_tuned, parameters, scoring=scorer, cv=5, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
dt_bagging_estimator_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
dt_bagging_estimator_tuned.fit(X_train, y_train)

In [None]:
#Calculating different metrics
dt_bagging_estimator_tuned_model_train_perf=model_performance_classification_sklearn(dt_bagging_estimator_tuned,X_train,y_train)
print("Training performance:\n", dt_bagging_estimator_tuned_model_train_perf)
dt_bagging_estimator_tuned_model_test_perf=model_performance_classification_sklearn(dt_bagging_estimator_tuned,X_test,y_test)
print("Testing performance:\n", dt_bagging_estimator_tuned_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(dt_bagging_estimator_tuned,X_test,y_test)

> **Observations:**
> * After hyperparameter tuning, Training and testing recall scores are 1.0 and 0.60 respectively.
> * We don't see much improvement after tuning. The model is highly overfitted for recall score.
> * Although Accuracy and Precision have improved.

BaggingClassifier does not have feature importance as output param.

## AdaBoost Classifier

In [None]:
#Fitting the model
ab_classifier = AdaBoostClassifier(random_state=1)
ab_classifier.fit(X_train,y_train)

#Calculating different metrics
ab_classifier_model_train_perf=model_performance_classification_sklearn(ab_classifier,X_train,y_train)
print("Training performance:\n", ab_classifier_model_train_perf)
ab_classifier_model_test_perf=model_performance_classification_sklearn(ab_classifier,X_test,y_test)
print("Testing performance:\n", ab_classifier_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(ab_classifier,X_test,y_test)

> **Observations:**
> * Training and testing recall scores are 0.30 and 0.26 respectively.
> * Although the model is not overfitted but recall score is unacceptable.
> * Accuracy and Precision looks good for training dataset but not for test dataset.

### Hyperparameter Tuning

In [None]:
# Choose the type of classifier. 
abc_tuned = AdaBoostClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    #Let's try different max_depth for base_estimator
    "base_estimator":[DecisionTreeClassifier(max_depth=3),DecisionTreeClassifier(max_depth=5),
                      DecisionTreeClassifier(max_depth=7)],
    "n_estimators": [20,40,80,100],
    "learning_rate":np.arange(0.1,2,0.1)
}

# Type of scoring used to compare parameter  combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=scorer, cv=5, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)

In [None]:
#Calculating different metrics
abc_tuned_model_train_perf=model_performance_classification_sklearn(abc_tuned,X_train,y_train)
print("Training performance:\n", abc_tuned_model_train_perf)
abc_tuned_model_test_perf=model_performance_classification_sklearn(abc_tuned,X_test,y_test)
print("Testing performance:\n", abc_tuned_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(abc_tuned,X_test,y_test)

> **Observations:**
> * After hyperparameter tuning, Training and testing recall scores are 0.89 and 0.54 respectively.
> * There is improvement after tuning the model is highly overfitted for recall score.
> * Although Accuracy and Precision have improved.

In [None]:
show_feature_imp(abc_tuned)

> **Observations:**
> * The model has given high importance to MonthlyIncome, Age and DurationOfPitch.
> * It has also given importance to PitchSatisfactionScore, NumberOfTrips, PreferredPropertyStar, Passport
> * Very low importance to OwnCar, Salaried and Married customers.

## Gradient Boosting Classifier

In [None]:
#Fitting the model
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_classifier.fit(X_train,y_train)

#Calculating different metrics
gb_classifier_model_train_perf=model_performance_classification_sklearn(gb_classifier,X_train,y_train)
print("Training performance:\n",gb_classifier_model_train_perf)
gb_classifier_model_test_perf=model_performance_classification_sklearn(gb_classifier,X_test,y_test)
print("Testing performance:\n",gb_classifier_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(gb_classifier,X_test,y_test)

> **Observations:**
> * Training and testing recall scores are 0.43 and 0.34 respectively.
> * Although the model is not overfitted but recall score is unacceptable.
> * Accuracy and Precision looks good for training dataset but not for test dataset.

### Hyperparameter Tuning

In [None]:
# Choose the type of classifier. 
gbc_tuned = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)

# Grid of parameters to choose from
parameters = {
    "n_estimators": [100,150,200,250],
    "subsample":[0.8,0.9,1],
    "max_features":[0.7,0.8,0.9,1]
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=scorer, cv=5, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)

In [None]:
#Calculating different metrics
gbc_tuned_model_train_perf=model_performance_classification_sklearn(gbc_tuned,X_train,y_train)
print("Training performance:\n",gbc_tuned_model_train_perf)
gbc_tuned_model_test_perf=model_performance_classification_sklearn(gbc_tuned,X_test,y_test)
print("Testing performance:\n",gbc_tuned_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(gbc_tuned,X_test,y_test)

> **Observations:**
> * After hyperparameter tuning, Training and testing recall scores are 0.60 and 0.42 respectively.
> * The scores have slightly improved but recall score is low.
> * The model is overfitted.

In [None]:
show_feature_imp(gbc_tuned)

> **Observations:**
> * The model has given high importance to Age, MonthlyIncome, Passport and DurationOfPitch.
> * It has also given importance to NumberOfFollowups, Single, NumberOfTrips, PreferredPropertyStar, CityTier
> * Very low importance to OwnCar, Salaried and Married customers.

## XGBoost Classifier

In [None]:
#Fitting the model
xgb_classifier = XGBClassifier(random_state=1, eval_metric='logloss')
xgb_classifier.fit(X_train,y_train)

#Calculating different metrics
xgb_classifier_model_train_perf=model_performance_classification_sklearn(xgb_classifier,X_train,y_train)
print("Training performance:\n",xgb_classifier_model_train_perf)
xgb_classifier_model_test_perf=model_performance_classification_sklearn(xgb_classifier,X_test,y_test)
print("Testing performance:\n",xgb_classifier_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(xgb_classifier,X_test,y_test)

> **Observations:**
> * Training and testing recall scores are 0.99 and 0.61 respectively.
> * This model is giving a better recall score from all previous models on test data.
> * Accuracy and Precision looks good for training dataset.
> * Model is slightly overfitted.

### Hyperparameter Tuning

In [None]:
# Choose the type of classifier. 
xgb_tuned = XGBClassifier(random_state=1, eval_metric='logloss')

# Grid of parameters to choose from
parameters = {
    "n_estimators": [10,30,50],
    "scale_pos_weight":[1,2,5],
    "subsample":[0.7,0.9,1],
    "learning_rate":[0.05, 0.1,0.2],
    "colsample_bytree":[0.7,0.9,1],
    "colsample_bylevel":[0.5,0.7,1]
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters,scoring=scorer, cv=5, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)

In [None]:
#Calculating different metrics
xgb_tuned_model_train_perf=model_performance_classification_sklearn(xgb_tuned,X_train,y_train)
print("Training performance:\n",xgb_tuned_model_train_perf)
xgb_tuned_model_test_perf=model_performance_classification_sklearn(xgb_tuned,X_test,y_test)
print("Testing performance:\n",xgb_tuned_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(xgb_tuned,X_test,y_test)

> **Observations:**
> * Training and testing recall scores are 0.87 and 0.71 respectively.
> * The tuned XGBoost model has given a better recall score then all previous models on test data.
> * Accuracy and Precision looks good.

In [None]:
show_feature_imp(xgb_tuned)

> **Observations:**
> * The model has given high importance to Passport and Single followed by occupation Large Business and PreferredPropertyStar.
> * It has also given importance to Unmarried, MonthlyIncome, Age, NumberOfFollowups, DurationOfPitch, CityTier
> * Interestingly given low importance to Male, OwnCar, NumberOfPersonVisiting and NumberOfChildrenVisiting.

## Stacking Classifier

In [None]:
estimators = [('Ada Boost',AdaBoostClassifier(random_state=1)), ('Gradient Boost',GradientBoostingClassifier(random_state=1))]

final_estimator = XGBClassifier(random_state=1, eval_metric='logloss')

stacking_classifier= StackingClassifier(estimators=estimators,final_estimator=final_estimator)

stacking_classifier.fit(X_train,y_train)

In [None]:
#Calculating different metrics
stacking_classifier_model_train_perf=model_performance_classification_sklearn(stacking_classifier,X_train,y_train)
print("Training performance:\n",stacking_classifier_model_train_perf)
stacking_classifier_model_test_perf=model_performance_classification_sklearn(stacking_classifier,X_test,y_test)
print("Testing performance:\n",stacking_classifier_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(stacking_classifier,X_test,y_test)

> **Observations:**
> * Training and testing recall scores are 0.46 and 0.35 respectively.
> * Although the model is not overfitted but recall score is not very good.
> * Accuracy and Precision looks good for training dataset but not for test dataset.
> * More different types of models need to be tried to improve the recall score for StackingClassifier.

## Comparing all models

### Training Performance Comparison

In [None]:
# training performance comparison
models_train_comp_df = pd.concat(
    [d_tree_model_train_perf.T,dtree_estimator_model_train_perf.T,rf_estimator_model_train_perf.T,rf_tuned_model_train_perf.T,
     dt_bagging_classifier_model_train_perf.T,dt_bagging_estimator_tuned_model_train_perf.T,ab_classifier_model_train_perf.T,
     abc_tuned_model_train_perf.T,gb_classifier_model_train_perf.T,gbc_tuned_model_train_perf.T,xgb_classifier_model_train_perf.T,
    xgb_tuned_model_train_perf.T,stacking_classifier_model_train_perf.T],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree",
    "Decision Tree Estimator",
    "Random Forest Estimator",
    "Random Forest Tuned",
    "Bagging Classifier (DT)",
    "Bagging Estimator Tuned (DT)",
    "Adaboost Classifier",
    "Adabosst Classifier Tuned",
    "Gradient Boost Classifier",
    "Gradient Boost Classifier Tuned",
    "XGBoost Classifier",
    "XGBoost Classifier Tuned",
    "Stacking Classifier"]
print("Training performance comparison:")
models_train_comp_df.T.sort_values(by='Recall', ascending=False)

### Testing Performance Comparison

In [None]:
# testing performance comparison
models_test_comp_df = pd.concat(
    [d_tree_model_test_perf.T,dtree_estimator_model_test_perf.T,rf_estimator_model_test_perf.T,rf_tuned_model_test_perf.T,
     dt_bagging_classifier_model_test_perf.T,dt_bagging_estimator_tuned_model_test_perf.T,ab_classifier_model_test_perf.T,
     abc_tuned_model_test_perf.T,gb_classifier_model_test_perf.T,gbc_tuned_model_test_perf.T,xgb_classifier_model_test_perf.T,
    xgb_tuned_model_test_perf.T,stacking_classifier_model_test_perf.T],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree",
    "Decision Tree Estimator",
    "Random Forest Estimator",
    "Random Forest Tuned",
    "Bagging Classifier (DT)",
    "Bagging Estimator Tuned (DT)",
    "Adaboost Classifier",
    "Adabosst Classifier Tuned",
    "Gradient Boost Classifier",
    "Gradient Boost Classifier Tuned",
    "XGBoost Classifier",
    "XGBoost Classifier Tuned",
    "Stacking Classifier"]
print("Testing performance comparison:")
models_test_comp_df.T.sort_values(by='Recall', ascending=False)

## Conclusion

We analyzed the historical data of "Visit With Us" company's previous travel package campaigns and their conversion rate including customer interaction parameters using different EDA techniques. We also explored different Ensemble Techniques, such as Bagging, Boosting and Stacking algorithms to build various predictive models to better understand the features that may have influenced the campaign result.

Based on results of EDA and model predictions below conclusions can be draw:

* Mostly younger customers between 25 and 40 years of age have taken one of previous the travel packages. And many models such as Random Forest, AdaBoost, GradientBoost have given high importance to Age.
* People who have passports seemed to have taken travel packages and models also have given high importance to this feature. Specially XGBoost has given highest importance to Passport.
* Customers who are single and unmarried have taken previous travel package more than other statuses and models also predicts the same.
* Out of five different occupations (Executive, Manager, Sr. Manager, AVP, VP), relatively lower monthly income occupations such as Executive, Manager and Sr. Manager have preferred to buy a travel package from the company. Since monthly income and designations are correlated. Many models also given high importance to monthly income feature. 
* Customers location also seems to have an influence in the conversion rate. Customers from Tier 3 cities have taken travel packages more than others.
* Occupations in Large Businesses and Salaried seemed to have influence in the deciding if a customer would by a travel package.
* Need less to mention that Self Enquired customers definitely have higher chance of taking a travel package.
* Many models are also suggesting customers who preferred good hotels/resorts with avobe average property ratings are more likely to buy travel packages.
* Other customer features such as Owning a car or number of people or children planning to visit or even gender, number of trips etc. have less importance or significance in decision making for a customer to buy a travel package.

On the customer interaction analysis we have some important conclusions as below:

* Duration of pitch plays a big role in the conversion rate of potential customer to buy a travel package. Most of the models have given high importance to this feature.
* PitchSatisfactionScore is also important factor for many customer in making a decision.
* Number of followups in many cases helped deciding a customer to buy previous travel packages.

## Recommendations

The company "Visit With Us" is now planning to launch a new product i.e. Wellness Tourism Package. Based on the previous tavel packages campaign data, analysis and model predictions following recommendations can be provided to policy makers and marketing team to optimize the cost of campaign and increase the customer base.

Customers can be profiled to below groups in order of priority as campaigning targets:

* Younger aged customers who are single or unmarried and having passports: This group of customers should be the go to customers for the new travel package campaign.
* Customers working as Executive, Managers and Sr. Manager in Large Businesses and Salaried occupation are will be the next targets.
* Customers residing in Tier 3 and Tier 2 cities and earlier preferred good property ratings, had few number of trips are next.

When customer self enquires, sales representatives handling the case should take these customrs seriously with proper product pitch and followups as they have high probability in buying the package.

It is recommended that sales representatives are properly trained to give the detailed product pitch on the Wellness Tourism Package, increase their satisfaction score and follow up multiple times as these customer interaction behaviors have high importance towards increasing the customer base. 

In order to increase the customer base further in future company needs to investigate why earlier travel packages were not popular with aged and wealthy customers with higher designation such as AVP and VP. Also it needs to be investigated how to attract customers from Tier 1 and Tier 2 cities more to increase the customer base in those cities.
