<a href="https://colab.research.google.com/github/aderdouri/ActuarialThesis/blob/master/Notebooks/myAutoPortfolioEncoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clone la branche GIT contenant le dataset BASEAUTO.

In [None]:
!rm -rf ActuarialThesis
!git clone https://github.com/aderdouri/ActuarialThesis.git
%ls -ltr ActuarialThesis

Cloning into 'ActuarialThesis'...
remote: Enumerating objects: 857, done.[K
remote: Counting objects: 100% (140/140), done.[K
remote: Compressing objects: 100% (103/103), done.[K
remote: Total 857 (delta 66), reused 98 (delta 36), pack-reused 717[K
Receiving objects: 100% (857/857), 123.09 MiB | 23.82 MiB/s, done.
Resolving deltas: 100% (356/356), done.
total 96
drwxr-xr-x 2 root root  4096 Oct 26 06:32 [0m[01;34mAllstateClaimPredictionChallenge[0m/
drwxr-xr-x 2 root root  4096 Oct 26 06:32 [01;34mAllstateClaimsSeverity[0m/
drwxr-xr-x 2 root root  4096 Oct 26 06:32 [01;34mData[0m/
drwxr-xr-x 2 root root  4096 Oct 26 06:32 [01;34mEMTboost[0m/
drwxr-xr-x 2 root root  4096 Oct 26 06:32 [01;34mFrenchMotorThirdPartyLiabilityClaims[0m/
drwxr-xr-x 2 root root  4096 Oct 26 06:32 [01;34mModels[0m/
-rw-r--r-- 1 root root    54 Oct 26 06:32 README.md
drwxr-xr-x 2 root root  4096 Oct 26 06:32 [01;34mPlots[0m/
drwxr-xr-x 2 root root  4096 Oct 26 06:32 [01;34mNotebooks[0m/
drwxr

# Importer les librairies de base Python et sklearn pour le preprocessing du dataset.

In [None]:
# The basics
import pandas as pd
import numpy as np

# Data preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv('ActuarialThesis/Data/BASEAUTO.csv')

In [None]:
df.columns

Index(['PERMIS', 'ACV', 'SEX', 'STATUT', 'CSP', 'USAGE', 'AGECOND', 'K8000',
       'RM', 'CAR', 'CLA', 'ALI', 'ENE', 'VIT', 'SEGM', 'CHARGE', 'GARAGE'],
      dtype='object')

## Encoding des différentes variables de notre dataset

Les deux variables `PERMIS` et `AGECOND` présentent une forte correlation, je supprime la variable `PERMIS`.


## La variable `PERMIS`

In [None]:
df.drop('PERMIS', axis=1, inplace=True)

## La variable `USAGE`

La variable `USAGE` est bien évidemment une variables catégorique, et nous devons les transformons ainsi.

In [None]:
df['USAGE'].unique()

array([2, 1, 3, 4])

In [None]:
df['USAGE'] = df['USAGE'].astype(object)

 ## La variable `GARAGE`

La variable GARAGE est bien évidemment une variables catégorique, et nous devons les transformons ainsi.

In [None]:
df['GARAGE'].unique()

array([3, 2, 1])

In [None]:
df['GARAGE'] = df['GARAGE'].astype(object)

In [None]:
df['GARAGE'].unique()

array([3, 2, 1], dtype=object)

In [None]:
df['ENE'].unique()

array(['ES', 'GO', 'EL'], dtype=object)

## La variable `CSP`

In [None]:
df['CSP'].unique()

array([50, 26, 37, 55,  1, 48, 60, 42,  2, 46, 47, 40, 77, 66,  3, 49, 91,
       74, 20, 22,  6, 21,  7, 65])

In [None]:
100*df['CSP'].value_counts(normalize=True)

50    57.650534
60    12.385671
55     8.955793
1      5.297256
48     3.439405
42     2.743902
26     2.620046
37     1.991235
46     1.876905
66     1.591082
49     0.409680
6      0.181021
47     0.152439
2      0.133384
40     0.123857
77     0.114329
22     0.095274
20     0.085747
3      0.066692
21     0.038110
91     0.019055
74     0.009527
7      0.009527
65     0.009527
Name: CSP, dtype: float64

Plus que 57\% des observations ont une CSP égale à 50, soit on supprime cette variable, soit on la transforme de la manière suivante: on garde les catégories 50, 60, 55, 1 et on regroupe toutes les autres catégries en une seule catégrie nommée `CSP_AUTRE`. En suite on applique une transformation One-Hot-Encoding sur cette variable. On peut revenir sur ce split et l'affiner si lors de la phase d'apprentissage, la caractéristique `CSP` s'avère importante.

In [None]:
# Transformation de la variable CSP
df['CSP'] = df['CSP'].apply(lambda CSP : str(CSP) if CSP in [50, 60, 55, 1] else 'AUTRE')

In [None]:
df['CSP'].unique()

array(['50', 'AUTRE', '55', '1', '60'], dtype=object)

## La variable `CAR`

In [None]:
100*df['CAR'].value_counts(normalize=True)

BER      71.036585
CTE       5.173399
BRK       5.020960
MSPHF     4.373095
TT        4.249238
CPE       3.687119
CAB       3.029726
MSPF      2.924924
BUS       0.504954
Name: CAR, dtype: float64

Plus que 70% des observations ont la variable CAR à BERLING, soit on supprime cette variable, soit on la transforme de la manière suivante: On garde les catégories `BER`, `CTE`, `BRK` et regroupe les autres catégories en une catégorie nommée `CAR_AUTRE`.

In [None]:
df['CAR'] = df['CAR'].apply(lambda CAR : CAR if CAR in ['BER', 'CTE', 'BRK'] else 'AUTRE')

In [None]:
df.head()

Unnamed: 0,ACV,SEX,STATUT,CSP,USAGE,AGECOND,K8000,RM,CAR,CLA,ALI,ENE,VIT,SEGM,CHARGE,GARAGE
0,10,F,C,50,2,40,N,64,BER,A,CAR,ES,130-140,A,0.0,3
1,10,F,A,50,1,63,N,50,BER,A,CAR,ES,001-130,0,0.0,3
2,10,F,C,AUTRE,2,20,N,95,BER,A,CAR,ES,001-130,0,0.0,3
3,10,F,A,50,1,56,N,50,BER,A,CAR,ES,001-130,0,0.0,3
4,10,F,A,50,1,29,N,95,BER,A,CAR,ES,140-150,B,0.0,3


On n'opère aucune transformation supplémentaire à ce stade et on sauvegarde le dataset ainsi.

In [None]:
df.to_csv('ActuarialThesis/Data/transformedBASEAUTO.csv', index=False)

In [None]:
# Restaurer le dadaset ainsi sauvegardé.
transformed_df = pd.read_csv('ActuarialThesis/Data/transformedBASEAUTO.csv')
transformed_df.head()

Unnamed: 0,ACV,SEX,STATUT,CSP,USAGE,AGECOND,K8000,RM,CAR,CLA,ALI,ENE,VIT,SEGM,CHARGE,GARAGE
0,10,F,C,50,2,40,N,64,BER,A,CAR,ES,130-140,A,0.0,3
1,10,F,A,50,1,63,N,50,BER,A,CAR,ES,001-130,0,0.0,3
2,10,F,C,AUTRE,2,20,N,95,BER,A,CAR,ES,001-130,0,0.0,3
3,10,F,A,50,1,56,N,50,BER,A,CAR,ES,001-130,0,0.0,3
4,10,F,A,50,1,29,N,95,BER,A,CAR,ES,140-150,B,0.0,3


In [None]:
!ls -ltr ActuarialThesis/Data

total 9844
-rw-r--r-- 1 root root  562212 Oct 26 06:32 BASEAUTO.csv
-rw-r--r-- 1 root root  201188 Oct 26 06:32 df_models.csv
-rw-r--r-- 1 root root 3263186 Oct 26 06:32 encodedBASEAUTO.csv
-rw-r--r-- 1 root root 3273326 Oct 26 06:32 encodedCategoricalBASEAUTO.csv
-rw-r--r-- 1 root root 2236112 Oct 26 06:32 encodedTDboostEMTDboost.csv
-rw-r--r-- 1 root root  529386 Oct 26 06:33 transformedBASEAUTO.csv


In [None]:
transformed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10496 entries, 0 to 10495
Data columns (total 16 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   ACV      10496 non-null  int64  
 1   SEX      10496 non-null  object 
 2   STATUT   10496 non-null  object 
 3   CSP      10496 non-null  object 
 4   USAGE    10496 non-null  int64  
 5   AGECOND  10496 non-null  int64  
 6   K8000    10496 non-null  object 
 7   RM       10496 non-null  int64  
 8   CAR      10496 non-null  object 
 9   CLA      10496 non-null  object 
 10  ALI      10496 non-null  object 
 11  ENE      10496 non-null  object 
 12  VIT      10496 non-null  object 
 13  SEGM     10496 non-null  object 
 14  CHARGE   10496 non-null  float64
 15  GARAGE   10496 non-null  int64  
dtypes: float64(1), int64(5), object(10)
memory usage: 1.3+ MB


## Transformation des variables `AGECOND` et `ACV` comme variables catégorielles.

Egalement, on peut considérer un autre encoding pour les variables `AGECOND` et `ACV`.
* La caractéristique `AGECOND` sera scindé en plusieurs groupes:
`(18,21], (21,25], (25,35], (35,45], (45,55], (55,70], (70, ].`
* De même la caractéristique `ACV` sera scindé en groupes : `(0,1], (1,4], (4,10], (10,].`

On distingue les deux cas suivants et considérer entrainer nos modèles sur les différentes transformations pour pouvoir les comparer:
- Cas 1: Les caractéristiques `AGECOND` et `ACV` sont considérés comme des variables continues.
- Cas 2: Les caractéristiques `AGECOND` et `ACV` sont considérés comme des  variables catégorielles.

In [None]:
def AGETransform(age):
  if age <=21:
    return '<21'
  elif age >21 and age <=25:
    return '21_25'
  elif age >25 and age <=35:
    return '25_35'
  elif age >35 and age <=45:
    return '35_45'
  elif age >45 and age <=55:
    return '45_55'
  elif age >55 and age <=70:
    return '55_70'
  elif age >70:
    return '>70'

In [None]:
df['AGECOND_CAT'] = df['AGECOND'].apply(AGETransform)

In [None]:
df['AGECOND_CAT'].unique()

array(['35_45', '55_70', '<21', '25_35', '45_55', '>70', '21_25'],
      dtype=object)

In [None]:
def ACVTransform(acv):
  if acv <=1:
    return '0_1'
  elif acv >1 and acv <=4:
    return '2_4'
  elif acv >4 and acv <=9:
    return '5_9'
  elif acv >=10:
    return '>=10'

In [None]:
df['ACV_CAT'] = df['ACV'].apply(ACVTransform)

In [None]:
df['ACV_CAT'].unique()

array(['>=10', '5_9', '2_4', '0_1'], dtype=object)

In [None]:
cat_cols = ['CAR', 'USAGE', 'CLA', 'ALI', 'ENE', 'VIT', 'SEGM', 'GARAGE', 'CSP', 'AGECOND_CAT', 'ACV_CAT']

In [None]:
bin_cols = ['SEX', 'STATUT', 'K8000']

In [None]:
num_cols = ['AGECOND', 'ACV', 'RM']

In [None]:
# Separating target from dataset
Y = df['CHARGE']
X = df.drop('CHARGE', axis=1)

# Mapping et encoding de l'ensemble des variables.

In [None]:
newX = X.copy()
newX.head()

Unnamed: 0,ACV,SEX,STATUT,CSP,USAGE,AGECOND,K8000,RM,CAR,CLA,ALI,ENE,VIT,SEGM,GARAGE,AGECOND_CAT,ACV_CAT
0,10,F,C,50,2,40,N,64,BER,A,CAR,ES,130-140,A,3,35_45,>=10
1,10,F,A,50,1,63,N,50,BER,A,CAR,ES,001-130,0,3,55_70,>=10
2,10,F,C,AUTRE,2,20,N,95,BER,A,CAR,ES,001-130,0,3,<21,>=10
3,10,F,A,50,1,56,N,50,BER,A,CAR,ES,001-130,0,3,55_70,>=10
4,10,F,A,50,1,29,N,95,BER,A,CAR,ES,140-150,B,3,25_35,>=10


In [None]:
# Mapping SEX
sex_map = {'M': 0,
           'F': 1}
newX.SEX = newX.SEX.map(sex_map)

In [None]:
# Mapping STATUT
statut_map = {'A': 0,
              'C': 1}
newX.STATUT = newX.STATUT.map(statut_map)

In [None]:
# Mapping K8000
k8000_map = {'N': 0,
             'O': 1}
newX.K8000 = newX.K8000.map(k8000_map)

In [None]:
newX.head()

Unnamed: 0,ACV,SEX,STATUT,CSP,USAGE,AGECOND,K8000,RM,CAR,CLA,ALI,ENE,VIT,SEGM,GARAGE,AGECOND_CAT,ACV_CAT
0,10,1,1,50,2,40,0,64,BER,A,CAR,ES,130-140,A,3,35_45,>=10
1,10,1,0,50,1,63,0,50,BER,A,CAR,ES,001-130,0,3,55_70,>=10
2,10,1,1,AUTRE,2,20,0,95,BER,A,CAR,ES,001-130,0,3,<21,>=10
3,10,1,0,50,1,56,0,50,BER,A,CAR,ES,001-130,0,3,55_70,>=10
4,10,1,0,50,1,29,0,95,BER,A,CAR,ES,140-150,B,3,25_35,>=10


In [None]:
# Appliquer le OneHotEncding à l'ensemble des variables catégoriques
ohe = OneHotEncoder()
ohe.fit(newX[cat_cols])
transformed = ohe.transform(newX[cat_cols]).toarray()
ohe_df = pd.DataFrame(transformed, columns=ohe.get_feature_names_out())
# Concat with original data
newX = pd.concat([newX, ohe_df], axis=1).drop(cat_cols, axis=1)

In [None]:
newX.head(5)

Unnamed: 0,ACV,SEX,STATUT,AGECOND,K8000,RM,CAR_AUTRE,CAR_BER,CAR_BRK,CAR_CTE,...,AGECOND_CAT_25_35,AGECOND_CAT_35_45,AGECOND_CAT_45_55,AGECOND_CAT_55_70,AGECOND_CAT_<21,AGECOND_CAT_>70,ACV_CAT_0_1,ACV_CAT_2_4,ACV_CAT_5_9,ACV_CAT_>=10
0,10,1,1,40,0,64,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,10,1,0,63,0,50,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,10,1,1,20,0,95,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,10,1,0,56,0,50,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4,10,1,0,29,0,95,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [None]:
newX.columns

Index(['ACV', 'SEX', 'STATUT', 'AGECOND', 'K8000', 'RM', 'CAR_AUTRE',
       'CAR_BER', 'CAR_BRK', 'CAR_CTE', 'USAGE_1', 'USAGE_2', 'USAGE_3',
       'USAGE_4', 'CLA_A', 'CLA_B', 'CLA_C', 'CLA_D', 'CLA_E', 'CLA_F',
       'CLA_G', 'CLA_H', 'CLA_I', 'CLA_J', 'CLA_K', 'CLA_L', 'CLA_M', 'CLA_N',
       'CLA_O', 'CLA_P', 'CLA_Q', 'CLA_R', 'CLA_S', 'CLA_T', 'CLA_U', 'CLA_V',
       'CLA_W', 'CLA_X', 'CLA_Y', 'CLA_Z', 'ALI_CAR', 'ALI_ELC', 'ALI_IDS',
       'ALI_INJ', 'ALI_INS', 'ENE_EL', 'ENE_ES', 'ENE_GO', 'VIT_001-130',
       'VIT_130-140', 'VIT_140-150', 'VIT_150-160', 'VIT_160-170',
       'VIT_170-180', 'VIT_180-190', 'VIT_190-200', 'VIT_200-220', 'VIT_S220',
       'SEGM_0', 'SEGM_A', 'SEGM_B', 'SEGM_H', 'SEGM_M1', 'SEGM_M2',
       'GARAGE_1', 'GARAGE_2', 'GARAGE_3', 'CSP_1', 'CSP_50', 'CSP_55',
       'CSP_60', 'CSP_AUTRE', 'AGECOND_CAT_21_25', 'AGECOND_CAT_25_35',
       'AGECOND_CAT_35_45', 'AGECOND_CAT_45_55', 'AGECOND_CAT_55_70',
       'AGECOND_CAT_<21', 'AGECOND_CAT_>70', 'AC

In [None]:
# Appliquer le StandarScaling à l'ensemble des variables numériques.
scaler = StandardScaler()
scaler.fit(X[num_cols])
scaledx = scaler.transform(X[num_cols])

In [None]:
scaledx

array([[-0.45766311,  1.13492859, -0.09406372],
       [ 1.10116548,  1.13492859, -0.81727943],
       [-1.81316623,  1.13492859,  1.50734251],
       ...,
       [ 0.55896424, -1.67661061, -0.81727943],
       [ 0.08453814, -1.39545669, -0.81727943],
       [-0.18656248,  0.85377467, -0.81727943]])

In [None]:
scaledx_df = pd.DataFrame(data = scaledx, columns = num_cols)
scaledx_df.head()

Unnamed: 0,AGECOND,ACV,RM
0,-0.457663,1.134929,-0.094064
1,1.101165,1.134929,-0.817279
2,-1.813166,1.134929,1.507343
3,0.626739,1.134929,-0.817279
4,-1.20319,1.134929,1.507343


In [None]:
newX.head()

Unnamed: 0,ACV,SEX,STATUT,AGECOND,K8000,RM,CAR_AUTRE,CAR_BER,CAR_BRK,CAR_CTE,...,AGECOND_CAT_25_35,AGECOND_CAT_35_45,AGECOND_CAT_45_55,AGECOND_CAT_55_70,AGECOND_CAT_<21,AGECOND_CAT_>70,ACV_CAT_0_1,ACV_CAT_2_4,ACV_CAT_5_9,ACV_CAT_>=10
0,10,1,1,40,0,64,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,10,1,0,63,0,50,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,10,1,1,20,0,95,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,10,1,0,56,0,50,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4,10,1,0,29,0,95,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [None]:
newX.AGECOND = scaledx_df.AGECOND
newX.ACV = scaledx_df.ACV
newX.RM = scaledx_df.RM
newX.head()

Unnamed: 0,ACV,SEX,STATUT,AGECOND,K8000,RM,CAR_AUTRE,CAR_BER,CAR_BRK,CAR_CTE,...,AGECOND_CAT_25_35,AGECOND_CAT_35_45,AGECOND_CAT_45_55,AGECOND_CAT_55_70,AGECOND_CAT_<21,AGECOND_CAT_>70,ACV_CAT_0_1,ACV_CAT_2_4,ACV_CAT_5_9,ACV_CAT_>=10
0,1.134929,1,1,-0.457663,0,-0.094064,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1.134929,1,0,1.101165,0,-0.817279,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,1.134929,1,1,-1.813166,0,1.507343,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,1.134929,1,0,0.626739,0,-0.817279,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1.134929,1,0,-1.20319,0,1.507343,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


## Sauvegarder le dataset ainsi transformé (newX, Y) séparément.

- Cas 1: Les caractéristiques `AGECOND` et `ACV`
 sont considérés comme des variables continues.

In [None]:
AGECOND_CAT_cols = [col for col in newX if col.startswith('AGECOND_')]
AGECOND_CAT_cols

['AGECOND_CAT_21_25',
 'AGECOND_CAT_25_35',
 'AGECOND_CAT_35_45',
 'AGECOND_CAT_45_55',
 'AGECOND_CAT_55_70',
 'AGECOND_CAT_<21',
 'AGECOND_CAT_>70']

In [None]:
ACV_CAT_cols = [col for col in newX if col.startswith('ACV_')]
ACV_CAT_cols

['ACV_CAT_0_1', 'ACV_CAT_2_4', 'ACV_CAT_5_9', 'ACV_CAT_>=10']

In [None]:
newX_CONT = newX.drop(AGECOND_CAT_cols, axis=1)
newX_CONT = newX_CONT.drop(ACV_CAT_cols, axis=1)

In [None]:
# Sauvegarder le dataset ainsi transformé.
encoded_df = pd.concat([newX_CONT, Y], axis=1)
encoded_df.to_csv('ActuarialThesis/Data/encodedBASEAUTO.csv', index=False)

In [None]:
# Restaurer le dadaset ainsi sauvegardé.
encoded_df = pd.read_csv('ActuarialThesis/Data/encodedBASEAUTO.csv')
encoded_df.head()

Unnamed: 0,ACV,SEX,STATUT,AGECOND,K8000,RM,CAR_AUTRE,CAR_BER,CAR_BRK,CAR_CTE,...,SEGM_M2,GARAGE_1,GARAGE_2,GARAGE_3,CSP_1,CSP_50,CSP_55,CSP_60,CSP_AUTRE,CHARGE
0,1.134929,1,1,-0.457663,0,-0.094064,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
1,1.134929,1,0,1.101165,0,-0.817279,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
2,1.134929,1,1,-1.813166,0,1.507343,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.134929,1,0,0.626739,0,-0.817279,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
4,1.134929,1,0,-1.20319,0,1.507343,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0


In [None]:
# Separation de la variable target du dataset
Y = encoded_df['CHARGE']
X = encoded_df.drop('CHARGE', axis=1)

In [None]:
# Splitting dataset to obtain train/test sets: 80:20

train, test = train_test_split(encoded_df, test_size=0.2, random_state=42)
train.to_csv('ActuarialThesis/Data/encodedTrainBASEAUTO.csv', index=False)
test.to_csv('ActuarialThesis/Data/encodedTestBASEAUTO.csv', index=False)

Cas 2: Les caractéristiques `AGECOND` et `ACV` sont considérés comme des variables catégorielles.

In [None]:
newX_CAT = newX.drop(['AGECOND', 'ACV'], axis=1)

In [None]:
# Sauvegarder le dataset ainsi transformé.
encoded_categorical_df = pd.concat([newX_CAT, Y], axis=1)
encoded_categorical_df.to_csv('ActuarialThesis/Data/encodedCategoricalBASEAUTO.csv', index=False)

In [None]:
# Restaurer le dadaset ainsi sauvegardé.
new_encoded_categorical_df = pd.read_csv('ActuarialThesis/Data/encodedCategoricalBASEAUTO.csv')
new_encoded_categorical_df.head()

Unnamed: 0,SEX,STATUT,K8000,RM,CAR_AUTRE,CAR_BER,CAR_BRK,CAR_CTE,USAGE_1,USAGE_2,...,AGECOND_CAT_35_45,AGECOND_CAT_45_55,AGECOND_CAT_55_70,AGECOND_CAT_<21,AGECOND_CAT_>70,ACV_CAT_0_1,ACV_CAT_2_4,ACV_CAT_5_9,ACV_CAT_>=10,CHARGE
0,1,1,0,-0.094064,0.0,1.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1,0,0,-0.817279,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1,1,0,1.507343,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1,0,0,-0.817279,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1,0,0,1.507343,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [None]:
def get_categories(category_name):
  return [col for col in encoded_df if col.startswith(category_name)]

In [None]:
ENE_CATEGORIES = get_categories('ENE')
ENE_CATEGORIES

['ENE_EL', 'ENE_ES', 'ENE_GO']

In [None]:
df['ENE'].unique()

array(['ES', 'GO', 'EL'], dtype=object)

In [None]:
encoded_df[ENE_CATEGORIES].head()

Unnamed: 0,ENE_EL,ENE_ES,ENE_GO
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,1.0,0.0


In [None]:
SEGM_CATEGORIES = get_categories('SEGM')
SEGM_CATEGORIES

['SEGM_0', 'SEGM_A', 'SEGM_B', 'SEGM_H', 'SEGM_M1', 'SEGM_M2']

In [None]:
df['SEGM'].unique()

array(['A', '0', 'B', 'M1', 'H', 'M2'], dtype=object)

In [None]:
encoded_df[SEGM_CATEGORIES].head()

Unnamed: 0,SEGM_0,SEGM_A,SEGM_B,SEGM_H,SEGM_M1,SEGM_M2
0,0.0,1.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0


In [None]:
CLA_CATEGORIES = get_categories('CLA')
CLA_CATEGORIES

['CLA_A',
 'CLA_B',
 'CLA_C',
 'CLA_D',
 'CLA_E',
 'CLA_F',
 'CLA_G',
 'CLA_H',
 'CLA_I',
 'CLA_J',
 'CLA_K',
 'CLA_L',
 'CLA_M',
 'CLA_N',
 'CLA_O',
 'CLA_P',
 'CLA_Q',
 'CLA_R',
 'CLA_S',
 'CLA_T',
 'CLA_U',
 'CLA_V',
 'CLA_W',
 'CLA_X',
 'CLA_Y',
 'CLA_Z']

In [None]:
df['CLA'].unique()

array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
       'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
      dtype=object)

In [None]:
encoded_df[CLA_CATEGORIES].head()

Unnamed: 0,CLA_A,CLA_B,CLA_C,CLA_D,CLA_E,CLA_F,CLA_G,CLA_H,CLA_I,CLA_J,...,CLA_Q,CLA_R,CLA_S,CLA_T,CLA_U,CLA_V,CLA_W,CLA_X,CLA_Y,CLA_Z
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
VIT_CATEGORIES = get_categories('VIT')
VIT_CATEGORIES

['VIT_001-130',
 'VIT_130-140',
 'VIT_140-150',
 'VIT_150-160',
 'VIT_160-170',
 'VIT_170-180',
 'VIT_180-190',
 'VIT_190-200',
 'VIT_200-220',
 'VIT_S220']

In [None]:
df['VIT'].unique()

array(['130-140', '001-130', '140-150', '150-160', '160-170', '170-180',
       '180-190', '190-200', '200-220', 'S220'], dtype=object)

In [None]:
encoded_df[VIT_CATEGORIES].head()

Unnamed: 0,VIT_001-130,VIT_130-140,VIT_140-150,VIT_150-160,VIT_160-170,VIT_170-180,VIT_180-190,VIT_190-200,VIT_200-220,VIT_S220
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
df['AGECOND'].unique()

array([ 40,  63,  20,  56,  29,  43,  21,  35,  31,  48,  19,  39,  36,
        60,  57,  41,  58,  30,  32,  72,  27,  37,  45,  26,  25,  52,
        80,  62,  24,  65,  50,  33,  49,  47,  34,  22,  51,  55,  28,
        64,  76,  53,  38,  42,  54,  23,  70,  61,  44,  79,  46,  67,
        78,  59,  77,  71,  82,  66,  68,  74,  69,  75,  73,  87,  81,
        94,  86,  83,  85,  84, 103,  88,  91,  89,  90,  97,  95,  92,
        98])

In [None]:
df['ACV'].unique()

array([10,  9,  7,  5,  2,  4,  1,  3,  0])

In [None]:
df['RM'].unique()

array([ 64,  50,  95,  90,  72,  60,  57, 100,  61,  85,  51,  80,  52,
        68,  54,  76,  53,  69,  58, 118,  73,  55, 106,  62,  59, 125,
        56,  66, 121,  63, 112,  97,  65, 174,  67,  84,  71, 156,  74,
        93,  78,  77, 166,  92,  79, 116,  81, 132, 139,  83, 104,  86,
       164,  70, 148, 110,  96,  82, 140,  87,  88, 105, 220, 133, 258,
        75, 150,  91,  89, 147, 102,  94, 119, 126, 185,  98, 108,  99,
       115, 144, 120, 101, 149])

In [None]:
encoded_df[['AGECOND', 'ACV', 'RM']].head()

Unnamed: 0,AGECOND,ACV,RM
0,-0.457663,1.134929,-0.094064
1,1.101165,1.134929,-0.817279
2,-1.813166,1.134929,1.507343
3,0.626739,1.134929,-0.817279
4,-1.20319,1.134929,1.507343
