# Team Travel Insurance

---

Context  <br /> 
A Tour & Travels Company Is Offering Travel Insurance Package To Their Customers.
The New Insurance Package Also Includes Covid Coverage.
The Company Wants To Know Which Customers Would Be Interested In Buying It Based On Its Database History.
The Insurance Was Offered To Some Of The Customers In 2019 And The Given Data Has Been Extracted From The Performance/Sales Of The Package During That Period.
The Data Is Provided For Almost 2000 Of Its Previous Customers.

Focus on:
- Isolating and analyzing the target variable
- Cleaning data
    - Assessing valid types
    - Converting corrupted values
    - Removing invalid data
- Identifying features that have relationships to your target variable and plotting the relationship
- Examining potential outliers and documenting limitations of the dataset
- Deriving information that might predict your target variable
- Articulating the potential value of your findings to a buisness, company, government, or other organization


**Also, we would like to know:**
 - Is this a good source of data?
 - Why / why not?


One of the needs for data science in organizations is to bring measure to vague problems. What can be measured in this dataset with certainty? Drive your presentation from what can be measured, reported. 

Also, if possible, suggest what can be done with this data in terms of actionable outcomes and to what extent.
     

Content
* Age- Age Of The Customer
* Employment Type- The Sector In Which Customer Is Employed
* GraduateOrNot- Whether The Customer Is College Graduate Or Not
* AnnualIncome- The Yearly Income Of The Customer In Indian Rupees[Rounded To Nearest 50 Thousand Rupees]
* FamilyMembers- Number Of Members In Customer's Family
* ChronicDisease- Whether The Customer Suffers From Any Major Disease Or Conditions Like Diabetes/High BP or Asthama,etc.
* FrequentFlyer- Derived Data Based On Customer's History Of Booking Air Tickets On Atleast 4 Different Instances In The Last 2 Years[2017-2019].
* EverTravelledAbroad- Has The Customer Ever Travelled To A Foreign Country[Not Necessarily Using The Company's Services]
* TravelInsurance- Did The Customer Buy Travel Insurance Package During Introductory Offering Held In The Year 2019.


Travel Insurance Prediction Data. Retrieved 10.3.21 from https://www.kaggle.com/tejashvi14/travel-insurance-prediction-data.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
%matplotlib inline



filepath = './data/TravelInsurancePrediction.csv'


In [None]:
ti = pd.read_csv(filepath)
ti.head()

In [None]:
# split into train and test
train = ti.sample(frac=.9)
mask = ~ti.index.isin(train.index)
test = ti[mask].copy()
print(train.shape[0],test.shape[0])

In [None]:
# start with the insight that no government sector employees with college degrees purchashed insurance
# it also sees like non-governmental employees buy more insurance...
train.groupby(['Employment Type','GraduateOrNot'])['TravelInsurance'].agg(['mean','count'])

#ti[mask]['TravelInsurance'].sum()

In [None]:
# so I feel like this cohort is resolved and I'm going to look at the remaining people with them split off
mask = (train['Employment Type']=='Government Sector')&(train['GraduateOrNot']=='No')
split_train = train.drop(train[mask].index).copy()

In [None]:
split_train.head(1)

In [None]:
# larger families seem to buy more insurance
split_train.groupby(['FamilyMembers'])['TravelInsurance'].agg(['mean','count'])

In [None]:
# chronic diseases don't seem to factor too much...
split_train.groupby(['ChronicDiseases'])['TravelInsurance'].agg(['mean','count'])

In [None]:
# age is pretty noisy...
split_train.groupby(['Age'])['TravelInsurance'].agg(['mean','count'])

In [None]:
# now we're getting somewhere, Frequent Flyers buy at a rate 2x higher
split_train.groupby(['FrequentFlyer'])['TravelInsurance'].agg(['mean','count'])

In [None]:
# people who have traveled abroad buy at a rate 3x...
split_train.groupby(['EverTravelledAbroad'])['TravelInsurance'].agg(['mean','count'])


In [None]:
# if we combine them, we see a further splitting of customers... 
split_train.groupby(['EverTravelledAbroad','FrequentFlyer'])['TravelInsurance'].agg(['mean','count'])

In [None]:
# affluent people definitely buy more insurance, but how should we read this?
sns.boxplot(y=split_train['AnnualIncome'],x=split_train['TravelInsurance']);

In [None]:
def compute_purchase_rate_at_income_thresholds(df,min_pct=80, max_pct=96):
    incomes = []
    # iterate through different percentages of income between 80 and 95 percent
    for i in range(min_pct,max_pct,3):
        income = df['AnnualIncome'].quantile(i*.01)
        if income in incomes:
            continue
        else:
            incomes.append(income)
        mask = df['AnnualIncome']>=income
        print('income above '+str(int(income))+ ' '+str(round(
            df[mask]['TravelInsurance'].mean(),2))+ '% of people bought insurance, '+str(
            df[mask].shape[0])+' in total')
    return

In [None]:
compute_purchase_rate_at_income_thresholds(split_train)

In [None]:
#<37% of people bought insurance, but 91% of people with income above 1.35M
# so we create a boolean value to mark income >= 1.35M
split_train['income_thresh'] = split_train['AnnualIncome'].map(lambda x: 1 if x>=1350000 else 0)
split_train.sample(3)

In [None]:
# so we combine our three strongest features and we see, that affluent people almost always buy insurance
# except if they have never traveled abroad and are not a frequent flyer
split_train.groupby(['EverTravelledAbroad','FrequentFlyer','income_thresh'])['TravelInsurance'].agg(['mean','count'])

In [None]:
# so we pull these people out and continue our analysis of the remaining people
mask = (split_train['AnnualIncome']>=1350000)&(
    (split_train['EverTravelledAbroad']=='Yes')|(split_train['FrequentFlyer']=='Yes'))
split_train.drop(split_train[mask].index,inplace=True)

In [None]:
# just as high income was different, really low income is different
mask = split_train['AnnualIncome']<=split_train['AnnualIncome'].quantile(.07)
print(split_train['AnnualIncome'].quantile(.07))
print(split_train[mask].shape[0])
split_train[mask]['TravelInsurance'].mean()

In [None]:
# the next tranche of income buys at average rates
mask = (split_train['AnnualIncome']>=split_train['AnnualIncome'].quantile(.08))&(
    split_train['AnnualIncome']<=split_train['AnnualIncome'].quantile(.12))
print(split_train[mask].shape[0])
split_train[mask]['TravelInsurance'].mean()

In [None]:
# so let's pull out the least affluent people
mask = split_train['AnnualIncome']<=350000
split_train.drop(split_train[mask].index,inplace=True)

In [None]:
# this no longer as meaningful after you account for high income
split_train.groupby(['EverTravelledAbroad','FrequentFlyer'])['TravelInsurance'].agg(['mean','count'])

In [None]:
# but we can see how other features still split the data
split_train.groupby(['GraduateOrNot'])['TravelInsurance'].agg(['mean','count'])

In [None]:
# affluence still matters
print(split_train['TravelInsurance'].mean())
compute_purchase_rate_at_income_thresholds(split_train)

In [None]:
# now that we've accounted for wealthy people, age seems to matter quite a bit, 
# with older people more likely to buy insurance
split_train.groupby(['Age'])['TravelInsurance'].agg(['mean','count'])

In [None]:
# but we can try combining age with other variables to look for interactions, and look what we find... 
# not merely older people, but specifically older people with large families buy insurance at high rates
split_train['age_thresh'] = split_train['Age'].map(lambda x: 1 if x>=33 else 0)
split_train.groupby(['age_thresh','FamilyMembers'])['TravelInsurance'].agg(['mean','count'])

In [None]:
# so we pull these people out
mask = (split_train['Age']>=33)&(split_train['FamilyMembers']>=6)
split_train.drop(split_train[mask].index,inplace=True)

In [None]:
# new baseline is 20%
split_train['TravelInsurance'].mean()

In [None]:
# there is some more information here that we could mine, and it would help, but it seems marginal

# income seems to matter a bit
split_train['income_thresh'] = split_train['AnnualIncome'].map(lambda x: 1 if x>=1300000 else 0)
split_train.groupby(['income_thresh'])['TravelInsurance'].agg(['mean','count'])

In [None]:
# non graduates seem to buy at a bit higher rate
split_train.groupby(['GraduateOrNot'])['TravelInsurance'].agg(['mean','count'])

In [None]:
# chronic disases matter a bit..
split_train.groupby(['ChronicDiseases'])['TravelInsurance'].agg(['mean','count'])

In [None]:
# but let's just use the features we derived

def add_feature_column(df, col, mask):
    df.loc[df[mask].index,col]=1
    df[col].fillna(0,inplace=True)
    return df
    

def add_features_for_training(df):

    # affluence + experience traveling
    mask = (df['AnnualIncome']>=1350000)&(
        (df['EverTravelledAbroad']=='Yes')|(df['FrequentFlyer']=='Yes'))
    df = add_feature_column(df, col='high_inc_travel_exp',mask=mask)
   

    # 5% percentile income
    mask = df['AnnualIncome']<=350000
    df = add_feature_column(df, col='low_inc',mask=mask)

    # goverment workers without a degree
    mask = (df['Employment Type']=='Government Sector')&(df['GraduateOrNot']=='No')
    df = add_feature_column(df, col='gov_no_deg',mask=mask)

    # older people with large families
    mask = (df['Age']>=33)&(df['FamilyMembers']>=6)
    df = add_feature_column(df, col='lg_fam_older',mask=mask)
    
    return df


In [None]:
train = add_features_for_training(train)
test = add_features_for_training(test)

In [None]:
# pull out our features 
train_features = ['high_inc_travel_exp','low_inc','gov_no_deg','lg_fam_older']
X_train = train[train_features].copy()
X_test = train['TravelInsurance']
# identical features in test data
y_train = test[train_features].copy()
y_test = test['TravelInsurance']

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
# instantiate
logr = LogisticRegression()
# fit on training data, happens inplace
logr.fit(X_train,X_test)

In [None]:
# score on test data
logr.score(y_train,y_test)

In [None]:
# compare to baseline - we just predict the modal value - you'd be right ~64% of the time if you just predicted 
# no one buys travel insurance
1-X_test.mean()

In [None]:
# so because of the features we derived via our EDA, our predictions are 20% better

In [None]:
# we can pull out out our preds and look at them
# what did the model seem to do that we couldn't have accomplished just using some averages instead?
y_train['preds'] = logr.predict_proba(y_train)[:,1:]
y_train.drop_duplicates(subset=train_features)