# Project 1 - Regression
## Forecasting the number of motor insurance claims
### This notebook uses the dateset *freMTPL2freq.csv*

(c) Nuno António 2022 - Rev. 1.0

## Dataset description

- **IDpol**: The policy ID (used to link with the claims dataset).
- **ClaimNb**: Number of claims during the exposure period.
- **Exposure**: The exposure period.
- **Area**: The area code.
- **VehPower**: The power of the car (ordered categorical).
- **VehAge**: The vehicle age, in years.
- **DrivAge**: The driver age, in years (in France, people can drive a car at 18).
- **BonusMalus**: Bonus/malus, between 50 and 350: <100 means bonus, >100 means malus in France.
- **VehBrand**: The car brand (unknown categories).
- **VehGas**: The car gas, Diesel or regular.
- **Density**: The density of inhabitants (number of inhabitants per km2) in the city the driver of the car lives in.
- **Region**: The policy regions in France (based on a standard French classification)

For additional information on the dataset check https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3164764

## Work description

### Overview
<p>You should organize into groups of 3 to 5 students, where you will assume the role of a consultant. You are asked to develop a model to forecast how many claims will each policy holder from a car insurer in France have in the following year. The insurance company wants to use this model to improve the policies' premiums (pricing).</p>
<p>Employing the CRISP-DM process model, you are expected to define, describe and explain the model built. Simultaneous, you should explain how your model can help the insurance company reaching its objectives.</p>

### Questions or additional informations
For any additional questions, don't hesitate to get in touch with the instructor. The instructor will also act as the insurance company/project stakeholder.

## Initializations and data loading

In [7]:
# Loading packages
import os
import csv
import numpy as np
!pip install pandas
import pandas as pd
import matplotlib.pyplot as plt
!pip install seaborn
import seaborn as  sns
!pip install category_encoders
import category_encoders as ce
!pip install sklearn
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import metrics
!pip install yellowbrick
!pip uninstall scikit-learn -y
!pip3 install scikit-learn 
import yellowbrick
from yellowbrick.model_selection import RFECV
from yellowbrick.model_selection import LearningCurve
from yellowbrick.regressor import ResidualsPlot
from yellowbrick.regressor import PredictionError

You should consider upgrading via the 'C:\Users\ahmed\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'C:\Users\ahmed\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'C:\Users\ahmed\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'C:\Users\ahmed\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip' command.
You should consider upgrading via the 'C:\Users\ahmed\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip' command.


Found existing installation: scikit-learn 1.0.2
Uninstalling scikit-learn-1.0.2:
  Successfully uninstalled scikit-learn-1.0.2


ERROR: Exception:
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.2800.0_x64__qbz5n2kfra8p0\lib\site-packages\pip\_internal\cli\base_command.py", line 173, in _main
    status = self.run(options, args)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.2800.0_x64__qbz5n2kfra8p0\lib\site-packages\pip\_internal\commands\uninstall.py", line 97, in run
    uninstall_pathset.commit()
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.2800.0_x64__qbz5n2kfra8p0\lib\site-packages\pip\_internal\req\req_uninstall.py", line 436, in commit
    self._moved_paths.commit()
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.2800.0_x64__qbz5n2kfra8p0\lib\site-packages\pip\_internal\req\req_uninstall.py", line 287, in commit
    save_dir.cleanup()
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.2800.0_x64__qbz5n2kfra8p0\lib\site-packag

Collecting scikit-learn

You should consider upgrading via the 'C:\Users\ahmed\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip' command.



  Using cached scikit_learn-1.0.2-cp39-cp39-win_amd64.whl (7.2 MB)
Installing collected packages: scikit-learn
Successfully installed scikit-learn-1.0.2


In [8]:
# Loading the dataset and visualizing summary statistics
ds = pd.read_csv('freMTPL2freq.csv')


### Describe the data

In [9]:
ds.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
IDpol,678013.0,2621857.0,1641783.0,1.0,1157951.0,2272152.0,4046274.0,6114330.0
ClaimNb,678013.0,0.05324677,0.2401173,0.0,0.0,0.0,0.0,16.0
Exposure,678013.0,0.5287501,0.3644415,0.002732,0.18,0.49,0.99,2.01
VehPower,678013.0,6.454631,2.050906,4.0,5.0,6.0,7.0,15.0
VehAge,678013.0,7.044265,5.666232,0.0,2.0,6.0,11.0,100.0
DrivAge,678013.0,45.49912,14.13744,18.0,34.0,44.0,55.0,100.0
BonusMalus,678013.0,59.7615,15.63666,50.0,50.0,50.0,64.0,230.0
Density,678013.0,1792.422,3958.647,1.0,92.0,393.0,1658.0,27000.0


In [10]:
ds.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
IDpol,678013.0,,,,2621856.921071,1641782.752655,1.0,1157951.0,2272152.0,4046274.0,6114330.0
ClaimNb,678013.0,,,,0.053247,0.240117,0.0,0.0,0.0,0.0,16.0
Exposure,678013.0,,,,0.52875,0.364442,0.002732,0.18,0.49,0.99,2.01
Area,678013.0,6.0,C,191880.0,,,,,,,
VehPower,678013.0,,,,6.454631,2.050906,4.0,5.0,6.0,7.0,15.0
VehAge,678013.0,,,,7.044265,5.666232,0.0,2.0,6.0,11.0,100.0
DrivAge,678013.0,,,,45.499122,14.137444,18.0,34.0,44.0,55.0,100.0
BonusMalus,678013.0,,,,59.761502,15.636658,50.0,50.0,50.0,64.0,230.0
VehBrand,678013.0,11.0,B12,166024.0,,,,,,,
VehGas,678013.0,2.0,Regular,345877.0,,,,,,,


In [11]:
# Show top rows
ds.head()

Unnamed: 0,IDpol,ClaimNb,Exposure,Area,VehPower,VehAge,DrivAge,BonusMalus,VehBrand,VehGas,Density,Region
0,1.0,1,0.1,D,5,0,55,50,B12,Regular,1217,R82
1,3.0,1,0.77,D,5,0,55,50,B12,Regular,1217,R82
2,5.0,1,0.75,B,6,2,52,50,B12,Diesel,54,R22
3,10.0,1,0.09,B,7,0,46,50,B12,Diesel,76,R72
4,11.0,1,0.84,B,7,0,46,50,B12,Diesel,76,R72


In [12]:
ds.dtypes

IDpol         float64
ClaimNb         int64
Exposure      float64
Area           object
VehPower        int64
VehAge          int64
DrivAge         int64
BonusMalus      int64
VehBrand       object
VehGas         object
Density         int64
Region         object
dtype: object

In [13]:
#Need to switch Area, VehBrand, VehGas, Region to categorical variables

### Data quality

In [14]:
#Check for null values
print(ds.isnull().sum())

IDpol         0
ClaimNb       0
Exposure      0
Area          0
VehPower      0
VehAge        0
DrivAge       0
BonusMalus    0
VehBrand      0
VehGas        0
Density       0
Region        0
dtype: int64


### Data Preparation

In [15]:
# Switch necessary columns to categorical variables 
cat_cols = ['Area', 'VehBrand', 'VehGas', 'Region']
ds[cat_cols] = ds[cat_cols].apply(lambda x:x.astype('category'))

In [16]:
categorical= ds.select_dtypes(exclude=["number","bool_","object_"]).columns.tolist()
categorical
for var in ds[categorical]:
    print(var,":\n",ds[var].value_counts(), sep="")

Area:
C    191880
D    151596
E    137167
A    103957
B     75459
F     17954
Name: Area, dtype: int64
VehBrand:
B12    166024
B1     162736
B2     159861
B3      53395
B5      34753
B6      28548
B4      25179
B10     17707
B11     13585
B13     12178
B14      4047
Name: VehBrand, dtype: int64
VehGas:
Regular    345877
Diesel     332136
Name: VehGas, dtype: int64
Region:
R24    160601
R82     84752
R93     79315
R11     69791
R53     42122
R52     38751
R91     35805
R72     31329
R31     27285
R54     19046
R73     17141
R41     12990
R25     10893
R26     10492
R23      8784
R22      7994
R83      5287
R74      4567
R94      4516
R21      3026
R42      2200
R43      1326
Name: Region, dtype: int64


### Exploratory Data Analysis 

In [17]:
!pip install pandas-profiling
from pandas_profiling import ProfileReport


Collecting pandas-profiling
  Using cached pandas_profiling-3.1.0-py2.py3-none-any.whl (261 kB)
Collecting pydantic>=1.8.1
  Using cached pydantic-1.9.0-cp39-cp39-win_amd64.whl (2.1 MB)
Collecting phik>=0.11.1
  Using cached phik-0.12.0-cp39-cp39-win_amd64.whl (659 kB)
Collecting joblib~=1.0.1
  Using cached joblib-1.0.1-py3-none-any.whl (303 kB)
Collecting requests>=2.24.0
  Using cached requests-2.27.1-py2.py3-none-any.whl (63 kB)
Collecting missingno>=0.4.2
  Downloading missingno-0.5.1-py3-none-any.whl (8.7 kB)
Collecting multimethod>=1.4
  Using cached multimethod-1.7-py3-none-any.whl (9.5 kB)
Collecting visions[type_image_path]==0.7.4
  Using cached visions-0.7.4-py3-none-any.whl (102 kB)
Collecting tqdm>=4.48.2
  Downloading tqdm-4.63.0-py2.py3-none-any.whl (76 kB)
Collecting markupsafe~=2.0.1
  Using cached MarkupSafe-2.0.1-cp39-cp39-win_amd64.whl (14 kB)
Collecting PyYAML>=5.0.0
  Using cached PyYAML-6.0-cp39-cp39-win_amd64.whl (151 kB)
Collecting htmlmin>=0.1.12
  Using cache

You should consider upgrading via the 'C:\Users\ahmed\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip' command.


In [18]:
profile = ProfileReport(ds, title='Pandas Profiling Report - Motor Insurance Claims Dataset', explorative=True)

In [19]:
#profile.to_file('Pandas Profiling Report - Motor Insurance Claims Dataset')

#### Findings from Profiling:
- Numerical variables: 8, Categorical: 4, Total=12
- No missing values, No duplicate rows

**Variables**

- ClaimsNb:
  - 11 distinct values
  - 95% are of 0 claims
  - 95th percentile 1 (4.7% of observations), max=16, IQR = 0
  - no negative numbers
  - Mean = 0.05 approximately, std_dev = 0.24, skew = 5.59, kurt= -0.65
- Exposure:
  - 187 distinct values, no negative values, no zeros
  - min= 0.0027, max = 2.01, mean= 0.528, 95th percentile = 1, IQR = 0.81, Skew = 0.08, Kurt=-1.52, std_dev = 0.36
  -  most common value = 1 (24.8% of observations)
- Area:
  - 6 distinct values
  - highest frequency 'c' (28.3%), lowest frequency 'f' (2.6%)
- VehPower:
  - **VehPower is a categorical, ordinal variable, yet presented in real numbers, consider the effect of that on modeling and consider changing to categorical/binning** 
- VehAge:
  - 78 distinct values
  - no negative numbers 
  - Mean = 7.04 approximately, std_dev = 5.66, skew = 1.14, kurt= 6.52
  - min= 0.0, max = 100, 95th percentile = 17, IQR = 9
  -  most common value = 1 (10.5% of observations)
  -  consider removing outliers
- DrivAge:
  - 83 distinct values
  - min=18, max=100, mean=45.49, 5th percentile=25, 95th percentile, std_dev= 14.13
  - skew= 0.43, kurt=-0.34
- BonusMalus
  - High correlation with DrivAge, makes sense, the older you get, the more bonus you can accumulate
  - Noting extreme values
  - 5th percentile = 50, 95th percentile 95, min=50, max=230

In [20]:
variables = list(ds.columns[1:])
variables

['ClaimNb',
 'Exposure',
 'Area',
 'VehPower',
 'VehAge',
 'DrivAge',
 'BonusMalus',
 'VehBrand',
 'VehGas',
 'Density',
 'Region']

In [21]:
ds1 = ds.drop('IDpol', axis=1)

In [22]:
numerical=ds1.select_dtypes(include=[np.number]).columns.tolist()
numerical

['ClaimNb',
 'Exposure',
 'VehPower',
 'VehAge',
 'DrivAge',
 'BonusMalus',
 'Density']

In [23]:
# Histograms on all numeric variables
#ds1[numerical].hist(bins=20, figsize=(7,7), layout=(5,5), xlabelsize=8, ylabelsize=8 )
ds1[numerical].hist()


array([[<AxesSubplot:title={'center':'ClaimNb'}>,
        <AxesSubplot:title={'center':'Exposure'}>,
        <AxesSubplot:title={'center':'VehPower'}>],
       [<AxesSubplot:title={'center':'VehAge'}>,
        <AxesSubplot:title={'center':'DrivAge'}>,
        <AxesSubplot:title={'center':'BonusMalus'}>],
       [<AxesSubplot:title={'center':'Density'}>, <AxesSubplot:>,
        <AxesSubplot:>]], dtype=object)

In [24]:
sns.boxplot(x=ds1['VehAge'])

<AxesSubplot:xlabel='VehAge'>

In [28]:
x=[1,2,4,5,6,7,8,9,0]
plt.plot(x)
plt.show()


Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.

