# <div style="color:white;display:fill;border-radius:15px;background-color:#032137;letter-spacing:0.5px;overflow:hidden"><p style="padding:15px;color:white;overflow:hidden;text-align: center;margin:0;font-size:180%">Template DS</p></div>

### **The Business Problem**

This project's goal is to retrieve a list of customers of the health insurance product ordered by their probability of buying a car insurance as well. This aims to provide the sales team with a high probability list of buyers to the new company's product.

**The main goal:** provide a list of customers with the highest buying probability

**The finished product:** will be a API that will feed a Telegram chat bot or a Google Sheet file with the customers names to be contacted.

### **The Dataset**
In this dataset there are 381.109 rows and 12 columns. Each columns brings a feature about the company's customers and each row represents a different customer. 
- **id**: a customer's identification number.
- **gender**: a customer's gender
- **age**: a customer's age
- **driving_license**: whether a customer has a driving license or not
- **region_code**: the code of the region where a customer live
- **previously_insured**: whether a customer was previously insured or not
- **vehicle_age**: a customer's vehicle age
- **vehicle_damage**: whether a vehicle has any kind of damage or not
- **annual_premium**: the amount a customer pays as premium in a year
- **policy_sales_channel**: an anonymized code for the channel of outreaching to the customer
- **vintage**: the number of days a customer has been associated with the company
- **response**: whether a customer is interested in vehicle insurance or not

## <div style="color:white;display:fill;border-radius:15px;background-color:#123752;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:110%">1. Importings</p></div>

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">1.1 Libraries</p></div>

In [1]:
import pandas as pd
import numpy as np

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">1.2. Helper Functions</p></div>

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">1.3 Data Loading</p></div>

In [17]:
data_raw = pd.read_csv('/home/bruno/repos/PA004_HealthInsuranceCrossSell/data/train.csv')

# <div style="color:white;display:fill;border-radius:15px;background-color:#123752;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:110%">2. Data exploration and problem comprehension</p></div>


- Main goal/problem
- Sub-goals
- What will the finished product be?

### **The Business Problem**

This project's goal is to retrieve a list of customers of the health insurance product ordered by their probability of buying a car insurance as well. This aims to provide the sales team with a high probability list of buyers to the new company's product.

**The main goal:** provide a list of customers with the highest buying probability

**The finished product:** will be a API that will feed a Telegram chat bot or a Google Sheet file with the customers names to be contacted.

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">2.1 First Data Exploration and Manipulation</p></div>

- Renaming colunms
- Inspecting the dataset
- Filling out NAs

### **The Dataset**

In this dataset there are 381.109 rows and 12 columns. Each columns brings a feature about the company's customers and each row represents a different customer. 
- **id**: a customer's identification number.
- **gender**: a customer's gender
- **age**: a customer's age
- **driving_license**: whether a customer has a driving license or not
- **region_code**: the code of the region where a customer live
- **previously_insured**: whether a customer was previously insured or not
- **vehicle_age**: a customer's vehicle age
- **vehicle_damage**: whether a vehicle has any kind of damage or not
- **annual_premium**: the amount a customer pays as premium in a year
- **policy_sales_channel**: an anonymized code for the channel of outreaching to the customer
- **vintage**: the number of days a customer has been associated with the company
- **response**: whether a customer is interested in vehicle insurance or not

### Rename features to lower case

In [30]:
df2 = data_raw.copy()

In [31]:
df2.columns = [df1.columns[i].lower() for i in range(0, len(df1.columns))]

### Check NAs

In [32]:
df2.isna().sum()

id                      0
gender                  0
age                     0
driving_license         0
region_code             0
previously_insured      0
vehicle_age             0
vehicle_damage          0
annual_premium          0
policy_sales_channel    0
vintage                 0
response                0
dtype: int64

### Data description

In [33]:
print(f'O número de linhas é de {df2.shape[0]} linhas.')
print(f'O número de colunas é de {df2.shape[1]} colunas.')

O número de linhas é de 381109 linhas.
O número de colunas é de 12 colunas.


In [34]:
df2.head()

Unnamed: 0,id,gender,age,driving_license,region_code,previously_insured,vehicle_age,vehicle_damage,annual_premium,policy_sales_channel,vintage,response
0,1,Male,44,1,28.0,0,> 2 Years,Yes,40454.0,26.0,217,1
1,2,Male,76,1,3.0,0,1-2 Year,No,33536.0,26.0,183,0
2,3,Male,47,1,28.0,0,> 2 Years,Yes,38294.0,26.0,27,1
3,4,Male,21,1,11.0,1,< 1 Year,No,28619.0,152.0,203,0
4,5,Female,29,1,41.0,1,< 1 Year,No,27496.0,152.0,39,0


In [35]:
df2.sample(5)

Unnamed: 0,id,gender,age,driving_license,region_code,previously_insured,vehicle_age,vehicle_damage,annual_premium,policy_sales_channel,vintage,response
164205,164206,Female,29,1,30.0,1,< 1 Year,No,28897.0,152.0,194,0
342921,342922,Male,44,1,28.0,1,1-2 Year,Yes,31089.0,26.0,114,0
116340,116341,Female,28,1,8.0,1,< 1 Year,No,33443.0,152.0,54,0
2663,2664,Male,64,1,28.0,0,1-2 Year,Yes,27309.0,13.0,265,0
60833,60834,Male,24,1,28.0,0,1-2 Year,Yes,2630.0,156.0,37,1


In [36]:
df2.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,381109.0,190555.0,110016.836208,1.0,95278.0,190555.0,285832.0,381109.0
age,381109.0,38.822584,15.511611,20.0,25.0,36.0,49.0,85.0
driving_license,381109.0,0.997869,0.04611,0.0,1.0,1.0,1.0,1.0
region_code,381109.0,26.388807,13.229888,0.0,15.0,28.0,35.0,52.0
previously_insured,381109.0,0.45821,0.498251,0.0,0.0,0.0,1.0,1.0
annual_premium,381109.0,30564.389581,17213.155057,2630.0,24405.0,31669.0,39400.0,540165.0
policy_sales_channel,381109.0,112.034295,54.203995,1.0,29.0,133.0,152.0,163.0
vintage,381109.0,154.347397,83.671304,10.0,82.0,154.0,227.0,299.0
response,381109.0,0.122563,0.327936,0.0,0.0,0.0,0.0,1.0


In [37]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381109 entries, 0 to 381108
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    381109 non-null  int64  
 1   gender                381109 non-null  object 
 2   age                   381109 non-null  int64  
 3   driving_license       381109 non-null  int64  
 4   region_code           381109 non-null  float64
 5   previously_insured    381109 non-null  int64  
 6   vehicle_age           381109 non-null  object 
 7   vehicle_damage        381109 non-null  object 
 8   annual_premium        381109 non-null  float64
 9   policy_sales_channel  381109 non-null  float64
 10  vintage               381109 non-null  int64  
 11  response              381109 non-null  int64  
dtypes: float64(3), int64(6), object(3)
memory usage: 34.9+ MB


In [38]:
# Getting categorical and numerical variables
num_features = df2.select_dtypes(include=['int64', 'float64'])
cat_features = df2.select_dtypes(exclude=['int64', 'float64', 'datetime64[ns]'])

# <div style="color:white;display:fill;border-radius:15px;background-color:#123752;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:110%">3. Hypothesis Mental Map Creation</p></div>


- Mental map for hypothesis and questions
- Hypothesis and questions list

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">3.1 Hypothesis</p></div>

# <div style="color:white;display:fill;border-radius:15px;background-color:#123752;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:110%">4. Feature Engineering</p></div>


- Fillout remaining NAs 
- Derive new variables as needed

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">4.1 Changing Features</p></div>

In [48]:
df4 = df2.copy()

In [46]:
df4['vehicle_age'] = df4['vehicle_age'].apply(lambda x: 'over_2_years' if x == '> 2 Years' else 'between_1_2_years' if x == '1-2 Year' else 'bellow_1_year')
df4['vehicle_damage'] = df4['vehicle_damage'].apply(lambda x: 1 if x == 'Yes' else 0)

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">4.2 New Features</p></div>

# <div style="color:white;display:fill;border-radius:15px;background-color:#123752;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:110%">5. Data selection and filtering</p></div>



- Filter data rows
- Filter data columns
- Based on the questions and hypothesis, select columns
- Create a new filtered dataframe
- Create the widgets to filter the data

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">5.1 Selecting variables</p></div>

In [50]:
df5 = df4.copy()

# <div style="color:white;display:fill;border-radius:15px;background-color:#123752;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:110%">6. Exploratory Data Analysis (EDA)</p></div>


- Answer the hypothesis list
- Build data visualization solutions and plots

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">6.1 Univariate Analysis</p></div>

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">6.2 Bnivariate Analysis</p></div>

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">6.3 Multivariate Analysis</p></div>

# <div style="color:white;display:fill;border-radius:15px;background-color:#123752;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:110%">7. Data Preparation</p></div>


- Normalize, re-scale and transform (enconding) variables to suit model requirements
- It may be a good idea to normalize all of the features so they are comparable in magnitude

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">7.1 Normalization</p></div>

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">7.2 Transformations</p></div>

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">7.3 Encodings</p></div>

# <div style="color:white;display:fill;border-radius:15px;background-color:#123752;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:110%">8. Feature Selection through Boruta algorithm</p></div>


- Use Boruta algorithm to select best features to machine learning models

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">8.1 Boruta implementation</p></div>

# <div style="color:white;display:fill;border-radius:15px;background-color:#123752;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:110%">9. Model Implementation</p></div>


- Implement different machine learning models and algorithms
- Conduct cross-velidation computing
- Conduct single performance metrics computing

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">9.1 First model</p></div>

# <div style="color:white;display:fill;border-radius:15px;background-color:#123752;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:110%">10. Hyperparameter Fine-Tuning</p></div>


- Implement hyperparameter search (i.e. Bayes Search) to find best model hyperparameter values
- Re-train model using best values

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">10.1 Bayes Search</p></div>

# <div style="color:white;display:fill;border-radius:15px;background-color:#123752;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:110%">11. Model Error Estimation and Interpretation</p></div>


- Use model errors to interpret the goals 
- Model learning performance
- Model generalization performance
- What it means to business?

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">11.1 Model error comparison</p></div>

# <div style="color:white;display:fill;border-radius:15px;background-color:#123752;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:110%">12. Model Deployment</p></div>


- Deploy the model to a cloud service so it can be used by its consumers

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">12.1 Class</p></div>

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">12.2 API Handler</p></div>

# <div style="color:white;display:fill;border-radius:15px;background-color:#486D88;letter-spacing:0.5px;overflow:hidden"><p style="padding:10px;color:white;overflow:hidden;text-align: center;margin:0;font-size:80%">13.3 API Tester</p></div>