# Prédire si un député va voter ou non 
### Utilisation d'une Random Forest avec comme target la variable 'Présence' (ie. député présent au vote ou non), les features utilisées seront des données sur le profil des députés

On choisit de découper la base selon les grands thèmes des scrutins, on pourra ainsi prédire si pour un thème donné les députés votent ou non (une random forest par thème)
On s'entraine donc sur un thème et on choisira quelques thèmes à étudier à la fin

## Création de la variable 'Présence' pour chaque député sur chaque scrutin (base entière)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:

df_votants = pd.read_csv('database_deputes.csv', index_col=0)
df_votes2 = pd.read_csv('database_votes2.csv', index_col=0)

In [3]:
df_votes = df_votes2[['idScrutin', 'idVotant']]

In [4]:
copy = df_votes2.copy()
copy = copy.pivot(index = 'idVotant', columns = 'idScrutin', values = 'vote')

In [5]:
#on remplace les NaN par de l'abstention (en confondant abstention et non-votant)
copy = copy.fillna('0')

copy = copy.replace(['Pour'],'1')
copy = copy.replace(['Contre'],'1')
copy = copy.replace(['Non-votant'],'1')

In [6]:
copy2=copy.unstack()

In [7]:
copy2 = copy2.reset_index()
copy2 = copy2.rename(columns = {0 : 'présence'})

In [8]:
copy2 = copy2.sort_values(['idScrutin', 'idVotant'])
copy2


Unnamed: 0,idScrutin,idVotant,présence
0,0,PA1008,0
1,0,PA1012,0
2,0,PA1029,0
3,0,PA1198,0
4,0,PA1206,0
...,...,...,...
1792270,3116,PA774962,0
1792271,3116,PA856,0
1792272,3116,PA923,0
1792273,3116,PA942,0


#### On merge la variable Présence sur la base complète

In [14]:
bjr = df_votes2.merge(copy2, on = ['idScrutin', 'idVotant'], how = 'outer')

### Importations des données scrappées sur les députés (région d'origine, CSP, genre, age, ..)

In [15]:
df_votants = pd.read_csv('database_votants.csv', index_col=0)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [16]:
df_votants2=df_votes2.drop_duplicates('idVotant')[['idVotant','Département','Numéro de circonscription','Profession','Groupe politique (complet)','Groupe politique (abrégé)','date_naissance']]

In [17]:
bjr=df_votants2.merge(bjr, on='idVotant',how='outer')

In [18]:
df_scrutins2 = df_votes2.drop_duplicates('idScrutin')[['idScrutin','titre','demandeur','date_scrutin']]

In [19]:
bjr=df_scrutins2.merge(bjr, on='idScrutin',how='outer')

In [20]:
df_régions = pd.read_csv('database_deputes.csv', index_col=0)
df_régions = df_régions.drop_duplicates('idVotant')[['idVotant','Région']]
bjr = bjr.merge(df_régions, on ='idVotant', how = 'outer')
df_age = pd.read_csv('age.csv', index_col=0)
bjr = bjr.merge(df_age, on ='idVotant', how = 'outer')
df_accord = pd.read_csv('indice_accord.csv', index_col=0)
bjr = bjr.merge(df_accord, on ='idVotant', how = 'outer')
df_vote = pd.read_csv('proportion_vote.csv', index_col=0)
bjr = bjr.merge(df_vote, on ='idVotant', how = 'outer')
df_genre = pd.read_csv('genre.csv', index_col=0)
bjr = bjr.merge(df_genre, on ='idVotant', how = 'outer')
df_csp = pd.read_excel('csp.xlsx')
bjr = bjr.merge(df_csp, on ='idVotant', how = 'outer')

In [21]:
df_randomforest = pd.read_csv('df_randomforest.csv', index_col=0,low_memory=False)
df_randomforest = df_randomforest.drop_duplicates('idScrutin')[['idScrutin','ratioabstention','cluster']]
df_randomforest = df_randomforest.merge(bjr, on ='idScrutin', how='outer')

### df_randomforest est notre dataframe avec toutes les infos souhaitées, nous transformons la date en format datetime et finissons de nettoyer les duplicats et autres pour récupérer df_randomforest3 (ordonnée par ordre chronologique)

In [23]:
df_randomforest.date_scrutin_x = pd.to_datetime(df_randomforest.date_scrutin_x)

In [26]:
df_randomforest=df_randomforest.sort_values('date_scrutin_x')

In [27]:
df_randomforest3=df_randomforest.drop(['organeRefGroupe','idScrutin','vote','abstention','titre_x','Profession_x','date_naissance_x','Région_x','date_naissance','resultat','demandeur_x','Prénom_x','Nom_x','Groupe politique (abrégé)_x','wikilink','Département_x','Numéro de circonscription_x','code_type_vote','non_votants','type_organe','type_mandat','non_votants_volontaires','qualite_mandat','organe_ref','pour','contre','votants'],axis=1)

In [28]:
df_randomforest3['Région2']=df_randomforest3['Région_y']

In [29]:
df_randomforest3.drop(df_randomforest3.filter(regex='_y$').columns.tolist(),axis=1, inplace=True)

In [30]:
df_randomforest3 = df_randomforest3.drop((['idVotant']),axis=1).reset_index()

### Nous choisissons sur quel cluster (ie, thème) travailler en ne gardant que les scrutins correspondant à ce thème

In [31]:
df_randomforest_19=df_randomforest3[df_randomforest3['cluster']==33]


In [32]:
df_randomforest_19_présence=df_randomforest_19['présence']

#### Features finales : 

In [33]:
df_randomforest_19.columns

Index(['index', 'ratioabstention', 'cluster', 'date_scrutin_x',
       'Groupe politique (complet)_x', 'présence', 'age', 'IndiceAccord',
       'Parti', 'Nom_Parti', 'Contre', 'Pour', 'Genre', 'CSP', 'Région2'],
      dtype='object')

#### Séparation entre features et target :

In [28]:
df_randomforest_19_présence=df_randomforest_19_présence.astype(int).to_frame()
df_randomforest_19=df_randomforest_19.drop('présence', axis =1)
df_randomforest_19=df_randomforest_19.drop('Groupe politique (complet)_x', axis=1)
df_randomforest_19

Unnamed: 0,index,ratioabstention,cluster,date_scrutin_x,age,IndiceAccord,Parti,Nom_Parti,Contre,Pour,Genre,CSP,Région2
33350,33350,0.820870,33,2019-09-11,58,24.434180,PO730964,LREM,5.034642,70.161663,M.,Cadres d'entreprise,Bretagne
33351,33351,0.820870,33,2019-09-11,42,28.533333,PO730964,LREM,19.466667,19.733333,M.,"Cadres de la fonction publique, professions in...",Ile-de-France
33352,33352,0.820870,33,2019-09-11,53,74.000000,PO730964,LREM,64.571429,35.428571,M.,"Cadres de la fonction publique, professions in...",Auvergne-Rhône-Alpes
33353,33353,0.820870,33,2019-09-11,51,71.171171,PO730964,LREM,60.360360,39.189189,M.,Cadres d'entreprise,Normandie
33354,33354,0.820870,33,2019-09-11,55,76.120959,PO730964,LREM,69.968717,30.031283,Mme,Sans profession dÃ©clarÃ©e,Provence-Alpes-Côte d'Azur
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1689345,1689345,0.914783,33,2019-06-27,62,45.161290,PO758835,Socialistes,41.935484,58.064516,Mme,"Cadres de la fonction publique, professions in...",Normandie
1689346,1689346,0.914783,33,2019-06-27,42,88.461538,PO730964,LREM,73.076923,26.923077,Mme,Cadres d'entreprise,Ile-de-France
1689347,1689347,0.914783,33,2019-06-27,35,40.000000,PO730940,Gauche Dem et Rep,80.000000,20.000000,Mme,"Professions intermÃ©diaires de l'enseignement,...",Réunion
1689348,1689348,0.914783,33,2019-06-27,72,58.333333,PO730934,LR,75.000000,25.000000,Mme,Anciens employÃ©s et ouvriers,Provence-Alpes-Côte d'Azur


In [29]:
df_randomforest_19_dummies=pd.get_dummies(df_randomforest_19)

df_randomforest_19_dummies

Unnamed: 0,index,ratioabstention,cluster,date_scrutin_x,IndiceAccord,Contre,Pour,age_ a,age_27,age_28,...,Région2_Nouvelle-Aquitaine,Région2_Nouvelle-Calédonie,Région2_Occitanie,Région2_Pays de la Loire,Région2_Polynésie française,Région2_Provence-Alpes-Côte d'Azur,Région2_Réunion,Région2_Saint-Barthélemy et Saint-Martin,Région2_Saint-Pierre-et-Miquelon,Région2_Wallis-et-Futuna
33350,33350,0.820870,33,2019-09-11,24.434180,5.034642,70.161663,0,0,0,...,0,0,0,0,0,0,0,0,0,0
33351,33351,0.820870,33,2019-09-11,28.533333,19.466667,19.733333,0,0,0,...,0,0,0,0,0,0,0,0,0,0
33352,33352,0.820870,33,2019-09-11,74.000000,64.571429,35.428571,0,0,0,...,0,0,0,0,0,0,0,0,0,0
33353,33353,0.820870,33,2019-09-11,71.171171,60.360360,39.189189,0,0,0,...,0,0,0,0,0,0,0,0,0,0
33354,33354,0.820870,33,2019-09-11,76.120959,69.968717,30.031283,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1689345,1689345,0.914783,33,2019-06-27,45.161290,41.935484,58.064516,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1689346,1689346,0.914783,33,2019-06-27,88.461538,73.076923,26.923077,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1689347,1689347,0.914783,33,2019-06-27,40.000000,80.000000,20.000000,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1689348,1689348,0.914783,33,2019-06-27,58.333333,75.000000,25.000000,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [30]:
df_randomforest_19_dummies= df_randomforest_19_dummies.drop('index',axis=1)


In [31]:
label_19 = df_randomforest_19_présence['présence']
label_19

33350      1
33351      1
33352      0
33353      0
33354      0
          ..
1689345    0
1689346    0
1689347    0
1689348    0
1689349    0
Name: présence, Length: 17825, dtype: int32

In [32]:
from sklearn.model_selection import train_test_split


In [33]:
train_features, test_features, train_labels, test_labels = train_test_split(df_randomforest_19_dummies, label_19, test_size = 0.25)

### Les données de la variable Présence sont déséquilibrées (plus de 0 que de 1), on les réequilibre

In [34]:
from sklearn.utils import resample,shuffle
test_label_counts = test_labels.value_counts()
test_features_absent = test_features[test_labels==0]
test_labels_absent = test_labels[test_labels==0]
features_test_less, labels_test_less = resample(test_features_absent,test_labels_absent,n_samples=test_label_counts[1],replace=False)
features_test_ = pd.concat([features_test_less,test_features[test_labels==1]])
labels_test_ = pd.concat([labels_test_less,test_labels[test_labels==1]])
test_features_, test_labels_ = shuffle(features_test_,labels_test_)
print(test_labels_.value_counts())

1    459
0    459
Name: présence, dtype: int64


In [35]:
train_label_counts = train_labels.value_counts()
train_features_absent = (train_features[train_labels==0])
train_labels_absent = (train_labels[train_labels==0])
features_train_less, labels_train_less = resample(train_features_absent,train_labels_absent,n_samples=train_label_counts[1],replace=False)
features_train_ = pd.concat([features_train_less,train_features[train_labels==1]])
labels_train_ = pd.concat([labels_train_less,train_labels[train_labels==1]])
train_features_, train_labels_ = shuffle(features_train_,labels_train_)
print(train_labels_.value_counts())

1    1341
0    1341
Name: présence, dtype: int64


### On fait tourner les modèles 

In [35]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 500, max_depth=10)

rf.fit(train_features_, train_labels_)

importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(train_features_.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))


NameError: name 'train_features_' is not defined

In [36]:
from sklearn.model_selection import GridSearchCV
param_grid = {'max_depth':[5,6,7,8,10], 'n_estimators' : [100,200,300,500,1000]}
predictor= GridSearchCV(RandomForestClassifier(random_state=0),param_grid=param_grid)
predictor.fit(train_features_,train_labels_)
print('Paramètre sélectionné:',predictor.best_params_)
print('Score d\'apprentissage: ',predictor.score(train_features_,train_labels_))
print('Score de test: ',predictor.score(test_features_,test_labels_))

NameError: name 'train_features_' is not defined

In [None]:
print("Feature ranking:")
for i, data_class in enumerate(df_randomforest_19_dummies.columns):
    print("{}. {} ({})".format(i + 1, data_class, importances[i]))


In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 100, max_depth=3)

rf.fit(train_features_, train_labels_);

In [None]:
predictions_test = rf.predict(test_features_)

### Essai XGBoost

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import recall_score, f1_score

test_acc = accuracy_score(test_labels_, predictions_test)
print(test_acc)



In [None]:
model_boosting = GradientBoostingClassifier(loss="deviance",
    learning_rate=0.2,
    max_depth=5,
    max_features="sqrt",
    subsample=0.95,
    n_estimators=200)


model_boosting.fit(train_features_, train_labels_)


predictions_test_xgb = model_boosting.predict(test_features_)
predictions_train_xgb = model_boosting.predict(train_features_)


train_acc = accuracy_score(train_labels_, predictions_train_xgb)
print(train_acc)

test_acc = accuracy_score(test_labels_, predictions_test_xgb)
print(test_acc)

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {'max_depth':[5,6,7,8,10], 'n_estimators' : [300,500,1000]}
predictor= GridSearchCV(GradientBoostingClassifier(random_state=0),param_grid=param_grid)
predictor.fit(train_features_,train_labels_)
print('Paramètre sélectionné:',predictor.best_params_)
print('Score d\'apprentissage: ',predictor.score(train_features_,train_labels_))
print('Score de test: ',predictor.score(test_features_,test_labels_))

### Scores

In [None]:
# Calcul du recall pour Random Forest

recall = recall_score(test_labels_, predictions_test, average='macro')
print('Recall: %.3f' % recall)

# Calcul du recall pour XGBoost

recall = recall_score(test_labels_, predictions_test_xgb, average='macro')
print('Recall: %.3f' % recall)

In [None]:
# Calcul du F1-Score pour Random Forest

f1 = f1_score(test_labels_, predictions_test, average='macro')
print('F1-Score: %.3f' % f1)

# Calcul du F1-Score pour XGBoost

f1 = f1_score(test_labels_, predictions_test_xgb, average='macro')
print('F1-Score: %.3f' % f1)