<br>
<div class="alert alert-info">
    <b> <h1> Life Insurance case study </h1></b>
</div>


https://www.kaggle.com/c/prudential-life-insurance-assessment/data


## Introduction

Siham, one of the largest issuers of life insurance in the Morocco, wants to develop an on-line the life insurance application process. Customers provide extensive information to identify risk classification and eligibility, including scheduling medical exams. Siham wants to make it quicker and less labor intensive for new and existing customers to get a quote while maintaining privacy boundaries.

By developing a predictive model that accurately classifies risk using a more automated approach, you can greatly help Siham to better understand the predictive power of the data points in the existing assessment, enabling it to significantly streamline the process.

### Data : 

|  Num 	|   Name	|   Values	|
|-------|-------|-------|
|  1 	|  `Id` 	|   int	|
|  2	|   `Product_Info_1`	|  real 	|
|  3 	|   `Product_Info_2`	|  'D3', 'A1', 'E1', 'D4', 'D2', 'A8', 'A2', 'D1', 'A7', 'A6', 'A3','A5', 'C4', 'C1', 'B2', 'C3', 'C2', 'A4', 'B1' 	|
|  4 	|   `Product_Info_3`	|  integer 	|
|  5 	|  `Product_Info_4`|	real   	|
|  6 	|  `Product_Info_5`|	integer 2, 3	|
|  7 	|   `Product_Info_6`|	integer 1, 3   	|
|  8 	|   `Product_Info_7`|	integer 1, 2, 3   	|
|  9 	|   `Ins_Age`|	real   	|
|  10 	|   `Ht`|	real   	|
|  11 	|   `Wt`|	real    |
|  12 	|   `BMI`|	real	|
|  13 	|   `Medical_Keyword_`|	int 1/0   	|
|  14 	|   `Response`|	integer 1 -> 8   	|

### Description of the data
* `id`: A unique identifier associated with an application.
* `Product_Info_1_to_7`: A set of normalized variables relating to the product applied for
* `Ins_Age`: Normalized age of applicant
* `Ht`: Normalized height of applicant
* `Wt`: Normalized weight of applicant
* `BMI`: Normalized BMI of applicant. Body mass index (BMI) is a measure of body fat based on height and weight.

* `Employment_Info_1-6`: A set of normalized variables relating to the employment history of the applicant.

* `InsuredInfo_1-6`: A set of normalized variables providing information about the applicant.
* `Insurance_History_1-9`: A set of normalized variables relating to the insurance history of the applicant.

* `Family_Hist_1-5`: A set of normalized variables relating to the family history of the applicant.

* `Medical_Keyword_`: A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application.
* `Response`: This is the target variable, an ordinal variable relating to the final decision associated with an application 

In [None]:
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
# Importations 
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport

sns.set()

#pd.set_option('display.max_columns', None)  # or 1000
#pd.set_option('display.max_rows', None)  # or 1000
#pd.set_option('display.max_colwidth', -1)  # or 199

## 1 Acquire the data


### 1.1 Read  CSV

**Q:** Load the [data] in the current path and display the head: the name of the file is trainCourse.csv.

In [None]:
df = pd.read_csv('trainCourse.csv')
df.head()

**Q:** How many samples and features that we have in the dataset

In [None]:
df.shape

### 1.2 Describing data


**Q:** Which features are available in the dataset?


In [None]:
df.columns

**Q:** Print the global information about the data and the different feature types 

In [None]:
df.info(verbose=True)

**Q:** Which features are categorical?

In [None]:
df.describe(include=['O'])

In [None]:
# Select columns of type 'O'
df.select_dtypes(include='O')

**Q:** Which features are numerical?

In [None]:
df.describe(exclude='O')

In [None]:
df.select_dtypes(exclude='O')

#df.describe(include=[np.number])

**Q:** The Id Variable is it a variable that you can use in machine learning? Print the number of unique value of Id

In [None]:
df['Id'].describe()

In [None]:
len(df.Id.unique())
df['Id'].nunique()

5938 instances and all of them are unique so we can drop this variable and for that we will use the **drop** function from Pandas

In [None]:
df = df.drop('Id',axis=1)

In [None]:
df.shape

### 1.3 Exploration

**Q:** **What is the distribution of numerical feature values across the samples?

In [None]:
df.describe()

**Q:** are there missing values in the data? How many? give the ratio for each feature

In [None]:
# Way 1 : no nan data
df.notnull().sum()[df.notnull().sum() < df.shape[0]]

In [None]:
#Way 2 : only nan data
df.isnull().sum()[df.isnull().sum() >0]

In [None]:
# Ratio
len(df.isnull().sum()[df.isnull().sum() >0]/df.shape[0])

**Q:** Now let's look at the variable that we are interested in predicting ** the Y variable ** which is **Response**. Using the count function, display the class imbalance

In [None]:
df.Response.value_counts()/df.shape[0]
df['Response'].value_counts(normalize=True)

##  2 Exploratory Data Analysis

We arrive at following assumptions based on data analysis done so far. We may validate these assumptions further before taking appropriate actions.

* **Correlating.**: We want to know how well does each feature correlate with Survival. 
* **Completing.** : We want to complete missing values
* **Correcting.**: We want de drop some features
* **Creating.**: We want to creat new features


**Q:** Draw a big picture of the data using ProfileReport

In [None]:
profile = ProfileReport(df, title='Life Insurance', html={'style':{'full_width':True}})

In [None]:
profile.to_widgets()

It can be very interesting to look at the *distribution* of our variables, comparing them to the *output variable*.

Here we plot it using the seaborn library.

In [None]:
features = df.columns

def PlotDistributiontest(X_train,Y_train,nb_of_features=20):
    plt.figure(figsize=(12,nb_of_features*4))
    gs = gridspec.GridSpec(nb_of_features, 2) #Customizing Figure Layouts Using GridSpec 
    for i in range(nb_of_features):
        
        ax = plt.subplot(gs[i,0])
        try :
            sns.distplot(X_train[Y_train == 0][X_train.columns[i]], bins=50,color='red') 
            ax.set_xlabel('')
            ax.set_title('histogram of feature of risk 0: ' + str(X_train.columns[i]))
        except :
            print ('erreur')

        try :
            ax = plt.subplot(gs[i,1])
            sns.distplot(X_train[Y_train == 1][X_train.columns[i]], bins=50,color='green')
            ax.set_xlabel('')
            ax.set_title('histogram of feature of risk 1: ' + str(X_train.columns[i]))
        except: 
            print('erreur')
    plt.show()

In [None]:
PlotDistributiontest(df.drop('Product_Info_2', axis=1),df['Response'],nb_of_features=5)

In [None]:
def PlotDistribution(X_train,Y_train,nb_of_features=20):
    plt.figure(figsize=(12,nb_of_features*4))
    gs = gridspec.GridSpec(nb_of_features, 1) #Customizing Figure Layouts Using GridSpec 
    for i in range(nb_of_features):
        
        try :
            ax = plt.subplot(gs[i])
            sns.distplot(X_train[Y_train == 0][X_train.columns[i]], bins=50,color='red') 
            sns.distplot(X_train[Y_train == 1][X_train.columns[i]], bins=50,color='green')
            ax.set_xlabel('')
            ax.set_title('histogram of feature of risk: ' + str(X_train.columns[i]))
        except :
            print ('erreur')
            

In [None]:
PlotDistribution(df.drop('Product_Info_2', axis=1),df['Response'],nb_of_features=5)

## 3 Preparing the data

### 3.1 Splitting the Dataset into Training set and Test Set

**Q:** Splitting the dataset between X and y

In [None]:
y = df["Response"]
X = df.drop(["Response"],axis=1)

**Q:** Split the data between train 80%  and test 20% in stratified manner (X_train, X_test, y_train, y_test)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

### 3.2  Sampling and co-variance shift

We have now a training set of instances, and a test set and we have Labels and the name of the **Y** variable let's taka a closer look. 

**Q:** Look at the class imbalance difference between the test and train set .

In [None]:
y_train.value_counts()/X_train.shape[0]

In [None]:
y_test.value_counts()/X_test.shape[0]

**Q:** Check that there isn't a co-variance shift between the train and the test.

In [None]:
sns.distplot(X_train['Product_Info_4'],bins=50)
sns.distplot(X_test['Product_Info_4'],bins=50)

In [None]:
# Step 1 :  remove nan column

train_sample = X_train.copy().dropna(axis=1) # seulement les features sans nan
train_sample['Target'] = 1
test_sample = X_test.copy().dropna(axis=1)
test_sample['Target'] = 0

cov_shift = pd.concat([train_sample,test_sample], axis=0,ignore_index=True)
y_shift = cov_shift['Target']
cov_shift = cov_shift.drop('Target',axis=1)
cov_shift = cov_shift.drop('Product_Info_2',axis=1) # categorical


train_sample.shape,test_sample.shape

In [None]:
# Step 2 : Run a model to check if we can make a difference between the train and test
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve
from sklearn.ensemble import RandomForestClassifier

features = cov_shift.columns

clf =  RandomForestClassifier(n_estimators=150, max_depth=2)

predictions = np.zeros(y_shift.shape)

for feature in features : 
    cv = StratifiedKFold(n_splits=20,shuffle=True)

    for fold, (train_idx, test_idx) in enumerate(cv.split(cov_shift,y_shift)):
        X_train_s, X_test_s = cov_shift.loc[train_idx], cov_shift.loc[test_idx]
        y_train_s, y_test_s = y_shift[train_idx], y_shift[test_idx]

        clf.fit(X_train_s[[feature]], y_train_s)
        probs = clf.predict_proba(X_test_s[[feature]])[:, 1]
        predictions[test_idx] = probs
    
    print ('Feature {}: ROC-AUC {}'.format(feature, roc_auc_score(y_shift, predictions)))
    

### 3.3 Data encoding

We have seen one hot encoding (or creation of dummy variables) let 's use a sickit learn package to do the trick. Remember our categorical variable Product_Info_2:

**Q:** Print the different modality of this variable

In [None]:
X_train['Product_Info_2'].unique()

We will now create 19 dummy variables oner per modality of the Product_Info_2 variable and populate those with 0 and 1 accoridng to the instances. There are several way for doing this.  

* we can use OneHotEncoder or MultiLabelBinarizer or LabelBinarizer or get_dummies
* https://chrisalbon.com/machine_learning/preprocessing_structured_data/one-hot_encode_features_with_multiple_labels/
* https://stackoverflow.com/questions/50473381/scikit-learns-labelbinarizer-vs-onehotencoder

**Q:** Use any method for encoding the labels of Product_Info_2 feature: . Place those variables in a separate dataframe X_train_dum, and X_test_dum. Also print the head the obtained result

In [None]:
from sklearn.preprocessing import LabelBinarizer

encoder = LabelBinarizer()
encoder.fit(X_train['Product_Info_2'])
X_train_dum = encoder.transform(X_train['Product_Info_2'])
X_test_dum = encoder.transform(X_test['Product_Info_2'])

X_train_dum = pd.DataFrame(data=X_train_dum)
X_test_dum = pd.DataFrame(data=X_test_dum)


X_train_dum.head()

encoder.classes_
#TODO

We now have 19 columns labeled 0 through 18 which we are going to quickly rename using a for loop function. 

**Q:** Use a list to rename the columns of X_train_dum, and X_test_dum as 'Product_Info_2_' + 'modality'

In [None]:
X_train_dum.columns = ['Product_Info_2_'+mode for mode in encoder.classes_]
X_test_dum.columns = ['Product_Info_2_'+mode for mode in encoder.classes_]
X_train_dum.index = X_train.index
X_test_dum.index = X_test.index

X_train_dum.head()
X_train = X_train.join(X_train_dum)
X_train = X_train.drop('Product_Info_2',axis=1)
X_train.sample()

**Q:** Add the X_train_dum, and X_test_dum to the orginal data and remove the categorical variable using pandas join function

In [None]:
X_train = X_train.join(X_train_dum)
X_train = X_train.drop('Product_Info_2',axis=1)
X_train.sample()

### 3.4 Data imputaion

Fast and classic Imputation transformer for completing missing values.
* https://scikit-learn.org/stable/modules/impute.html

**Q:** use impute to fill missing values (before check of missing values)

In [None]:
## Check if there are any missing values


In [None]:
## Way 1 : Impute your values using pandas (fillna)


In [None]:
## Way 2 : Impute using SimplteImputer

from sklearn.impute import SimpleImputer



### 3.5 Scaling the data 

**Q:** Normalizing by the range of the data Min: MinMaxScaler

In [None]:
# X = (X - min(X)) / (max(X) - min(X))    range between 0 and 1

from sklearn.preprocessing import StandardScaler, MinMaxScaler


X_train_minmax = 
X_test_minmax = 

#### b) StandardScaler : Standardize features by removing the mean and scaling to unit variance

In [None]:
# X = (X - mean(X)) / std(X)   range between 0 and 1
from sklearn.preprocessing import StandardScaler, MinMaxScaler


X_train_std = scaler.transform(X_train)
X_test_std = scaler.transform(X_test)

### 4. Model Baseline

In [None]:
X_train.shape,y_train.shape

In [None]:
#Evaluate metric(s) by cross-validation and also record fit/score times. Use Decision tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate



In [None]:
cv_dt

In [None]:
print("Mean score: ", cv_dt['test_score'].mean(), "Mean std: ", cv_dt['test_score'].std())


### 5. Feature Selection

#### 5.1 Pearson Correlation

Pearson's correlation coefficient is the covariance of the two variables divided by the product of their standard deviations.

$$p_{X,Y} = \frac{cov(X,Y)}{sigma_{X}sigma_{Y}}$$

$$ \text{ where cov is the covariance,   }  sigma_{X} \text{  is the standard deviation of X and }  sigma_{Y} \text{  is the standard deviation of Y} $$

In [None]:
# Pearson Correlation
from scipy.stats import pearsonr

** NB **
The function below computes the pearson correlation using pearsonr

In [None]:
def TabPearsonr(X_train,y_train):
    """
    Computes the pearson correlation between each column of X_train and the column Y_train.
    
    Arguments:
    X_train -- learning set
    Y_train -- learning output
    Labels -- names of the features
    
    Returns:
    TabResultsPearson -- the table of the normalized mutual informations such as
    TabResultsPearson[j] = PearsonCorr(X_train[:,i],Y_train)
    """
    ListPearsonCorr = []
    for feature in features:
        ListPearsonCorr.append(pearsonr(X_train[feature],y_train))
    
    TabResultsPearson = pd.DataFrame(ListPearsonCorr,columns=['Pearson Correlation','P-Value'],index=features)
    #TabResultsPearson = TabResultsPearson.transpose()
    return TabResultsPearson


In [None]:
TabResultsPearson = TabPearsonr(X_train,y_train)
TabResultsPearson

In [None]:
# https://dataschool.com/fundamentals-of-analysis/correlation-and-p-value/
def SelectBestFeatures(TabResultsPearson,thresPVal=0.05):
    """
    Extracts the features that are correlated the most to Y_train and that the pValue is under thresPVal.
    
    Arguments:
    TabResultsPearson -- table of the correlations
    thresPVal -- pValue threshold
    
    Returns:
    ListSelectedFeatures -- List of selected features
    """
    
    potentiel = TabResultsPearson[TabResultsPearson['P-Value']< thresPVal]
    MAD = (potentiel.abs() - potentiel.abs().median()).abs().median()
    print (potentiel)
    #IdxCols = (TabResultsPearson.loc['P-Value',:] < thresPVal).nonzero()[0]
    #TabResultsPearsonLoc = TabResultsPearson.iloc['Pearson Correlation',IdxCols]
    
    #MAD = (TabResultsPearsonLoc.abs() \    - TabResultsPearsonLoc.abs().median()).abs().median()
    #ListSelectedFeatures = TabResultsPearsonLoc.loc[TabResultsPearsonLoc.abs() > TabResultsPearsonLoc.abs().median() + MAD]
    #ListSelectedFeatures = ListSelectedFeatures.index
    #ListSelectedFeatures = ListSelectedFeatures.values
    #return ListSelectedFeatures

In [None]:
ListSelectedFeatures = SelectBestFeatures(TabResultsPearson,thresPVal=0.05)

print("Selected Features : ")
print(ListSelectedFeatures)


In [None]:
X_train2 = pd.DataFrame(X_train2,columns=Cols)
X_test2 = pd.DataFrame(X_test2,columns=Cols)

newX_train = X_train2[ListSelectedFeatures].copy()
newX_test = X_test2[ListSelectedFeatures].copy()

#### 2.2 Wrapped method


#### Recursive Feature Elimination

The RFE procedure removes one by one the less significative variables using the chosen learning model; in this case, we use the random forests.

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier



In [None]:
sortedIdSelected = np.argsort(tf.ranking_)
sortedCols = newX_train.columns[sortedIdSelected]
newX_train[sortedCols].head()

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
def modelfit(X_train,Y_train,X_test,Y_test,ListSelectFeatures,Labels):
    """
    Computes the 200-trees RandomForest fitted model on Y_train using the selected features.
    Then predicts on X_test.
    
    Arguments:
    X_train -- learning set
    Y_train -- learning output
    X_test -- test set
    Y_test -- test output
    ListSelectFeatures -- list of selected features
    Labels -- names of the features
    
    Returns:
    pred -- prediction vector
    MSE -- mean squared error
    R2 -- R2 score
    
    NB :
    model = RFModel.fit(X_train,Y_train)
    pred = model.predict(X_test)
    """
    
    print("MSE=",mean_squared_error(pred,Y_test))
    print("R2=",r2_score(Y_test,pred))
    return pred,mean_squared_error(pred,Y_test), r2_score(Y_test,pred)

Dummy example :

In [None]:
## Example

LabelsRand = range(50)
matRand_Train = np.array([[1.0] * 30 + [0.0] * 20] * 50 + [[1.0] * 30 + [0.0] * 20] * 50)
yRand_Train = np.array([1] * 50 + [0] * 50)
matRand_Test = np.array([[0.0] * 25 + [1.0] * 25] * 40)
yRand_Test = np.array([0] * 20 + [0] * 20)
print(matRand_Train.shape)
ListSelectFeatures = range(20)

In [None]:
modelfit(matRand_Train,yRand_Train,matRand_Test,yRand_Test,range(15),LabelsRand);
