# Project 1 - Regression
## Forecasting the number of motor insurance claims
### This notebook uses the dateset *freMTPL2freq.csv*

(c) Nuno António 2022 - Rev. 1.0

## Dataset description

- **IDpol**: The policy ID (used to link with the claims dataset).
- **ClaimNb**: Number of claims during the exposure period.
- **Exposure**: The exposure period.
- **Area**: The area code.
- **VehPower**: The power of the car (ordered categorical).
- **VehAge**: The vehicle age, in years.
- **DrivAge**: The driver age, in years (in France, people can drive a car at 18).
- **BonusMalus**: Bonus/malus, between 50 and 350: <100 means bonus, >100 means malus in France.
- **VehBrand**: The car brand (unknown categories).
- **VehGas**: The car gas, Diesel or regular.
- **Density**: The density of inhabitants (number of inhabitants per km2) in the city the driver of the car lives in.
- **Region**: The policy regions in France (based on a standard French classification)

For additional information on the dataset check https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3164764

## Work description

### Overview
<p>You should organize into groups of 3 to 5 students, where you will assume the role of a consultant. You are asked to develop a model to forecast how many claims will each policy holder from a car insurer in France have in the following year. The insurance company wants to use this model to improve the policies' premiums (pricing).</p>
<p>Employing the CRISP-DM process model, you are expected to define, describe and explain the model built. Simultaneous, you should explain how your model can help the insurance company reaching its objectives.</p>

### Questions or additional informations
For any additional questions, don't hesitate to get in touch with the instructor. The instructor will also act as the insurance company/project stakeholder.

## Initializations and data loading

In [1]:
# Loading packages
import pandas as pd

In [2]:
# Loading the dataset and visualizing summary statistics
ds = pd.read_csv('freMTPL2freq.csv')
ds.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
IDpol,678013,,,,2621860.0,1641780.0,1.0,1157950.0,2272150.0,4046270.0,6114330.0
ClaimNb,678013,,,,0.0532468,0.240117,0.0,0.0,0.0,0.0,16.0
Exposure,678013,,,,0.52875,0.364442,0.00273224,0.18,0.49,0.99,2.01
Area,678013,6.0,C,191880.0,,,,,,,
VehPower,678013,,,,6.45463,2.05091,4.0,5.0,6.0,7.0,15.0
VehAge,678013,,,,7.04426,5.66623,0.0,2.0,6.0,11.0,100.0
DrivAge,678013,,,,45.4991,14.1374,18.0,34.0,44.0,55.0,100.0
BonusMalus,678013,,,,59.7615,15.6367,50.0,50.0,50.0,64.0,230.0
VehBrand,678013,11.0,B12,166024.0,,,,,,,
VehGas,678013,2.0,Regular,345877.0,,,,,,,


In [3]:
# Show top rows
ds.head()

Unnamed: 0,IDpol,ClaimNb,Exposure,Area,VehPower,VehAge,DrivAge,BonusMalus,VehBrand,VehGas,Density,Region
0,1.0,1,0.1,D,5,0,55,50,B12,Regular,1217,R82
1,3.0,1,0.77,D,5,0,55,50,B12,Regular,1217,R82
2,5.0,1,0.75,B,6,2,52,50,B12,Diesel,54,R22
3,10.0,1,0.09,B,7,0,46,50,B12,Diesel,76,R72
4,11.0,1,0.84,B,7,0,46,50,B12,Diesel,76,R72
