**Libraries**

In [1]:
import pandas as pd
import numpy as np
import arff
from ydata_profiling import ProfileReport

## Datenvorbereitung

**Daten sind zu importieren**

In [45]:
data_freq = arff.load('data/freMTPL2freq.arff')
df_freq = pd.DataFrame(data_freq, columns=["idpol", "claimnb", "exposure", "area", "vehpower",
"vehage","drivage", "bonusmalus", "vehbrand", "vehgas", "density", "region"])
data_sev = arff.load('data/freMTPL2sev.arff') 
df_sev = pd.DataFrame(data_sev, columns=["idpol", "claimamount"])

In [6]:
#TODO: mtpl2 #3 - Erster Überblick über den Datensatz schaffen.

**Automatisierter Datenreport wird erstellt**

In [8]:
profile_sev = ProfileReport(df_sev, title="Profiling Report df_sev")
profile_freq = ProfileReport(df_freq, title="Profiling Report df_freq")

#Profilreport wird als HTML abgelegt

profile_sev.to_file("data/profiling-df_sev.html")
profile_freq.to_file("data/profiling-df_freq.html")

#### Erste Findings aus Profilreport

Erste kritische Findings in df_sev:

* claimamount ist stark rechtschief.
* Es gibt 241 Duplikate.

Erste kritische Findings in df_freq:

* drivage is highly overall correlated with bonusmalus	High correlation  
* bonusmalus is highly overall correlated with drivage	High correlation  
* density is highly overall correlated with area	High correlation  
* area is highly overall correlated with density	High correlation  
* idpol has unique values	Unique  
* claimNb has 643953 (95.0%) zeros	Zeros  
* vehage has 57739 (8.5%) zeros  

#### Überblick über Dimensionen

In [15]:
print("df_sev hat " +str(df_sev.shape[0]) +" Reihen mit " + str(df_sev.shape[1]) + " Spalten.")

df_sev hat 26639 Reihen mit 2 Spalten.


In [39]:
print("df_freq hat " +str(df_freq.shape[0]) +" Reihen mit " + str(df_freq.shape[1]) + " Spalten.")

df_freq hat 678013 Reihen mit 12 Spalten.


#### Überblick über Datentypen

In [46]:
print(df_sev.info(), df_freq.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26639 entries, 0 to 26638
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   idpol        26639 non-null  float64
 1   claimamount  26639 non-null  float64
dtypes: float64(2)
memory usage: 416.4 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 678013 entries, 0 to 678012
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   idpol       678013 non-null  float64
 1   claimnb     678013 non-null  float64
 2   exposure    678013 non-null  float64
 3   area        678013 non-null  object 
 4   vehpower    678013 non-null  float64
 5   vehage      678013 non-null  float64
 6   drivage     678013 non-null  float64
 7   bonusmalus  678013 non-null  float64
 8   vehbrand    678013 non-null  object 
 9   vehgas      678013 non-null  object 
 10  density     678013 non-null  float64
 11  region      678013 n

Übersicht **df_sev**:

| Column    | Featurename | Inhalt | Datentyp (ist) | Datentyp (soll) |
|-----------|-------------|--------|--------|-----------------|
| 0         |  idpol      | ID des Vertrags | float64 | $\textcolor{red}{categorical}$ |
| 1         |  claimamount| Höhe Schadensaufwände | float64 | float64 |

Übersicht **df_freq**:

| Column    | Featurename | Inhalt         | Datentyp (ist) | Datentyp (soll) |
|-----------|-------------|-----------------|--------|-----------------|
| 0         |  idpol      | ID des Vertrags | float64 | $\textcolor{red}{categorical}$ |
| 1         |  claimnb    | Anzahl Schäden | float64 | $\textcolor{red}{integer}$ |
| 2         |  exposure   | Länge Versicherungszeitraum | float64 | float64 |
| 3         |  area       | Area Code | categorical | categorical | 
| 4         |  vehpower   | Kfz-Leistung | float64 | $\textcolor{red}{integer}$ |
| 5         |  vehage     | Kfz-Alter | float64 | $\textcolor{red}{integer}$ |
| 6         |  drivage    | Alter der Versicherungsnehmer | float64 | $\textcolor{red}{integer}$ |
| 7         |  bonusmalus | Schadenfreiheitsrabatt  | float64 | $\textcolor{red}{integer}$ |
| 8         |  vehbrand   | Kfz-Marke | categorical | categorical |
| 9         |  vehgas     | Kfz-Antrieb  | categorical | categorical |
| 10        |  density    | Bevölkerungsdichte in Wohnort | float64 | $\textcolor{red}{integer}$ |
| 11        |  region     | Wohnregion | categorical | categorical |

In [3]:
df_freq.head(5)

Unnamed: 0,IDpol,ClaimNb,Exposure,Area,VehPower,VehAge,DrivAge,BonusMalus,VehBrand,VehGas,Density,Region
0,1.0,1.0,0.1,'D',5.0,0.0,55.0,50.0,'B12',Regular,1217.0,'R82'
1,3.0,1.0,0.77,'D',5.0,0.0,55.0,50.0,'B12',Regular,1217.0,'R82'
2,5.0,1.0,0.75,'B',6.0,2.0,52.0,50.0,'B12',Diesel,54.0,'R22'
3,10.0,1.0,0.09,'B',7.0,0.0,46.0,50.0,'B12',Diesel,76.0,'R72'
4,11.0,1.0,0.84,'B',7.0,0.0,46.0,50.0,'B12',Diesel,76.0,'R72'


In [73]:
#TODO: Anpassung der Datentypen

In [76]:
#TODO: Entfernen der Anführungszeichen

In [78]:
#TODO: vor dem Merging : Duplikate-Handling?

In [None]:
#TODO: 