# IGA-02. Motor Own Damage Insurance in Russia

### Students

- Dobrego Daria
- Du Shaohui
- Magomedova Zamira
- Makarkina Irina


### Problem setup

*In this assignment you are asked to work with the **MODIpolicies** dataset to test the hypothesis of a positive influence of down-sampling on prediction performance of a classifier, specifically, multinomial logistic regression.
The pseudo-code for completing the assignment is provided below.*

In [48]:
# load packages
import pandas as pd
import numpy as np

import statsmodels.api as sm
import sklearn.metrics as sklm
from sklearn.utils import resample

from imblearn.under_sampling import RandomUnderSampler as RUS
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression as LR
from sklearn.metrics import classification_report

In [2]:
# read the data from the MS Excel file
data_modi = pd.read_excel('MODIpolicies.xlsx', 'data', index_col=None, na_values=['NA'])
data_modi.head()

Unnamed: 0,Claims,Franchise,Loan,CarAge,Experience,Gender,Class
0,2,0,0,0,9,0,3
1,0,0,1,0,4,0,8
2,1,0,0,4,4,0,9
3,0,0,0,0,17,1,9
4,1,0,0,2,7,1,6


In [3]:
# separate the outcome variable and the features
X = data_modi.drop("Claims", axis=1)
y = data_modi["Claims"]

# split the sample into training (80%) and test (20%)
# (!) do not change 'random_state=0'
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)


### Task 1 (1 point).
*Describe the ORIGINAL FULL data (any two aspects) and give brief comments.*

In [4]:
# data description
data_modi.describe()

Unnamed: 0,Claims,Franchise,Loan,CarAge,Experience,Gender,Class
count,3720.0,3720.0,3720.0,3720.0,3720.0,3720.0,3720.0
mean,0.253763,0.189247,0.125806,1.263978,9.658333,0.557796,5.149462
std,0.523307,0.68416,0.331676,1.581757,8.102376,0.496715,2.866338
min,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,0.0,0.0,0.0,0.0,4.0,0.0,3.0
50%,0.0,0.0,0.0,1.0,8.0,1.0,4.0
75%,0.0,0.0,0.0,2.0,13.0,1.0,9.0
max,2.0,4.0,1.0,7.0,46.0,1.0,10.0


In [32]:
data_modi["Claims"].value_counts()

0    2933
1     630
2     157
Name: Claims, dtype: int64

#### Give your comments in this chunk

*Consideration 1*:
We probably have outliers in the experience variable, seeing as 75% of the sample has the experience of 13 years and less, and the maximum experience is of 46 years. We assume that it is a right skewed distribution.

*Consideration 2*:
No missing values in the dataset. Also, we have almost equal proportion of male to female observations, which means that dataset is representative at least in regards to that var. 

#We also assume that most people want to have insurance in advance because up to 75th percentile the claim is 0.

### Task 2 (1 point).
*Estimate a multinomial logistic regression on the training sample and construct a classification report. Give a brief comment on the obtained classification results (specify any two principal considerations).*

In [29]:
# estimate MLR on the training sample    
# function to create regression & print matrix
def myLR(X_tr, y_tr, cutoff=0.5):

    LRmodel = LR(random_state=0, solver='lbfgs', max_iter=1000000).fit(X_tr, y_tr)
    y_test_pred_probs = LRmodel.predict_proba(X_test)
    y_test_pred_classes = np.array([(y_test_pred_probs[i][1]>cutoff).astype(int) for i in range(len(X_test))])
    print(classification_report(y_true=y_test, y_pred = y_test_pred_classes))

myLR(X_train, y_train, cutoff=0.5)
# predict the outcomes on the test sample
# construct a classification report


              precision    recall  f1-score   support

           0       0.78      1.00      0.88       582
           1       0.00      0.00      0.00       133
           2       0.00      0.00      0.00        29

    accuracy                           0.78       744
   macro avg       0.26      0.33      0.29       744
weighted avg       0.61      0.78      0.69       744



  'precision', 'predicted', average, warn_for)


#### Give your comments in this chunk

*Consideration 1:* class 0 is predicted perfectly all other classes were not predicted presumably due to sample imbalance

*Consideration 2:* accuracy is relatively not bad 78%, however recalls and precision data is not representative


### Task 3 (2 points).
*Set random seed to **1234** and RANDOMLY downsample (without replacement) Class0 in the training sample to **500** observations. Estimate a multinomial logistic regression on the obtained (down-sampled) training sample, predict on the ORIGINAL test sample, and construct a classification report.*

In [60]:
### DOWNSAMPLING ###
    
# extract Class0
dm_class0 = data_modi[data_modi['Claims']==0]
# extract classes 1 and 2
dm_class1 = data_modi[data_modi['Claims']==1]
dm_class2 = data_modi[data_modi['Claims']==2]

# downsampling: get the indices and change X_train and y_train for class0
dm_class0_downsampled = resample(dm_class0, replace=False, n_samples=500, random_state=1234)

#combine the final downsampled sample with the others
dm_downsampled = pd.concat([dm_class0_downsampled, dm_class1, dm_class2])
dm_downsampled['Claims'].value_counts()
# we will use function resample from sklearn utils
# which allows to resample arrays
 
# downsampling: get the indices and change X_train and y_train for class0
# we downsample class of 0 = dm_class0

1    630
0    500
2    157
Name: Claims, dtype: int64

In [64]:
# estimate MLR on the down-sampled training sample 
y = dm_downsampled["Claims"]
X = dm_downsampled.drop('Claims', axis=1)

myLR(X, y, cutoff=0.5)

# predict the outcomes on the original test sample
z = myLR(X_test, y_test, cutoff=0.5)
y_test_pred = z.predict(X_test)

# construct a classification report


              precision    recall  f1-score   support

           0       0.80      0.57      0.67       582
           1       0.19      0.47      0.27       133
           2       0.00      0.00      0.00        29

    accuracy                           0.53       744
   macro avg       0.33      0.35      0.31       744
weighted avg       0.66      0.53      0.57       744

              precision    recall  f1-score   support

           0       0.78      1.00      0.88       582
           1       0.00      0.00      0.00       133
           2       0.00      0.00      0.00        29

    accuracy                           0.78       744
   macro avg       0.26      0.33      0.29       744
weighted avg       0.61      0.78      0.69       744



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


AttributeError: 'NoneType' object has no attribute 'predict'

### Task 4 (1 point).
*Compare the results of classification on the original and the down-sampled samples. Provide any **two** considerations.*

#### Give your comments in this chunk

*Consideration 1*:

*Consideration 2*:
