# Fin du prétraitement : des input de même longueur 

Pour pouvoir exécuter notre réseau de neurone (CNN), nous avons besoin de traiter encore nos données. 

En effet, notre modèle va apprendre les caractéristiques générales de plusieurs vols entre deux Water_Washes, pour chaque avion. Cependant, pour que nous puissions effectuer un réseau de neurones, nous avons besoin d'avoir des intervalles "de même longueur, c'est à dire avec le même nombre de vols pour chaque intervalles. Pour l'instant, certains avions ont plus de 9000 vols entre deux Water-Washes tandis que d'autres en ont très peu, et c'est pourquoi nous devons les traiter.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:
#Importation des données pré-traitées
path_df = r'D:/Données/ENSAE/2A/S2/Séminaire de modélisation statistique/pretraitement.csv'
safran=pd.read_csv(path_df ,sep=',', encoding='latin-1')

In [3]:
safran

Unnamed: 0.1,date,Unnamed: 0,engine_serial_number,engine_family,engine_series,cycles,cycles_counter,egt_margin,var_mot_1,flight_leg_hours,...,Interpolate_flight_leg_hours,Interpolate_SV_rank,Interpolate_Config_B_rank,Interpolate_WW_rank,Interpolate_var_env_1,Interpolate_var_env_2,Interpolate_var_env_3,Interpolate_var_env_4,Interpolate_var_env_5,Interpolate_egt_slope
0,2019-04-29 06:29:58,1,ESN_1,Engine_family_1,Engine_series_1,14.699402,14,0.881646,-0.313549,0.857778,...,0.857778,0.0,0.0,0.0,-0.261068,0.193871,0.448627,0.0,0.601803,-0.029193
1,2019-04-29 08:10:00,2,ESN_1,Engine_family_1,Engine_series_1,15.284274,15,0.792029,0.006330,0.794167,...,0.794167,0.0,0.0,0.0,-0.064202,0.273855,1.500848,0.0,-1.056965,-0.029193
2,2019-04-29 09:55:00,3,ESN_1,Engine_family_1,Engine_series_1,15.898185,16,0.706729,-0.286324,0.736667,...,0.736667,0.0,0.0,0.0,-0.292673,0.193871,0.764293,0.0,0.149412,-0.029193
3,2019-04-29 11:36:53,4,ESN_1,Engine_family_1,Engine_series_1,16.493874,17,0.702078,0.430174,0.802500,...,0.802500,0.0,0.0,0.0,0.070056,0.273855,1.500848,0.0,-1.056965,-0.029193
4,2019-04-30 04:28:40,5,ESN_1,Engine_family_1,Engine_series_1,22.409543,18,0.645941,0.299420,0.817500,...,0.817500,0.0,0.0,0.0,-0.463185,0.193871,0.448627,0.0,0.601803,-0.029193
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2450270,2022-10-26 07:39:15,2911298,ESN_1369,Engine_family_1,Engine_series_6,34.993550,50,0.531868,-0.731730,2.654444,...,2.654444,0.0,0.0,0.0,0.728625,-0.356159,-0.708817,0.0,-1.207762,0.000968
2450271,2022-10-26 11:36:39,2911299,ESN_1369,Engine_family_1,Engine_series_6,35.190820,51,0.973045,0.364383,2.501667,...,2.501667,0.0,0.0,0.0,0.421328,-0.351647,-0.077484,0.0,0.199678,0.000968
2450272,2022-10-27 03:55:34,2911301,ESN_1369,Engine_family_1,Engine_series_6,36.001950,53,0.800778,0.949444,2.165000,...,2.165000,0.0,0.0,0.0,1.092857,-0.351647,0.343405,0.0,-0.403511,0.000968
2450273,2022-10-27 07:33:12,2911302,ESN_1369,Engine_family_1,Engine_series_6,36.182090,54,0.619281,-0.748008,2.536667,...,2.536667,0.0,0.0,0.0,1.383080,-0.356159,-0.708817,0.0,-0.705105,0.000968


In [4]:
safran.columns

Index(['date', 'Unnamed: 0', 'engine_serial_number', 'engine_family',
       'engine_series', 'cycles', 'cycles_counter', 'egt_margin', 'var_mot_1',
       'flight_leg_hours', 'event_rank', 'egt_slope', 'SV_indicator',
       'SV_rank', 'Config_B_indicator', 'Config_B_rank', 'WW_indicator',
       'WW_rank', 'config_A', 'config_B', 'var_env_1', 'var_env_2',
       'var_env_3', 'var_env_4', 'var_env_5', 'Interpolate_egt_margin',
       'Interpolate_var_mot_1', 'Interpolate_flight_leg_hours',
       'Interpolate_SV_rank', 'Interpolate_Config_B_rank',
       'Interpolate_WW_rank', 'Interpolate_var_env_1', 'Interpolate_var_env_2',
       'Interpolate_var_env_3', 'Interpolate_var_env_4',
       'Interpolate_var_env_5', 'Interpolate_egt_slope'],
      dtype='object')

In [5]:
#Pour plus de clarté, je ne garde que les colonnes qui ont déja été travaillées et celles nécessaire au traitement des intervalles
safran_2 = safran[['date',"engine_serial_number",'engine_series', 'cycles', 'cycles_counter','Interpolate_egt_margin',
       'Interpolate_var_mot_1', 'Interpolate_flight_leg_hours',
       'Interpolate_SV_rank', 'Interpolate_Config_B_rank',
       'Interpolate_WW_rank', 'Interpolate_var_env_1', 'Interpolate_var_env_2',
       'Interpolate_var_env_3', 'Interpolate_var_env_4',
       'Interpolate_var_env_5', 'Interpolate_egt_slope']]

## 1) Choix de la taille de l'intervalle

In [6]:
#Pour plus de lisibilité, je ne conserve qu'une colonne
safran_3 = safran_2[["engine_serial_number", "Interpolate_WW_rank", "Interpolate_egt_slope"]]

In [7]:
# On a un .count(), donc on peut prend n'importe quelle colonne pour vérifier combien il y a de vols
safran_3.groupby(by=["engine_serial_number", "Interpolate_WW_rank"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Interpolate_egt_slope
engine_serial_number,Interpolate_WW_rank,Unnamed: 2_level_1
ESN_1,0.0,9884
ESN_1,1.0,2116
ESN_1,2.0,3552
ESN_1,3.0,1000
ESN_1,4.0,2584
...,...,...
ESN_998,0.0,505
ESN_998,1.0,38
ESN_999,0.0,465
ESN_999,1.0,442


On remarque qu'on a souvent plus de données dans l'intervalle avant le premier WaterWash (Interpolate_WW_rank=0), puis un peu moins pour les intervalles suivants. 

- **Intervalles avec le moins de données**

Regardons maintenant les cas où l'on a le moins de données pour un même intervalle.

In [8]:
safran_3.groupby(by=["engine_serial_number", "Interpolate_WW_rank"]).count().sort_values(by="Interpolate_egt_slope").head(120)

Unnamed: 0_level_0,Unnamed: 1_level_0,Interpolate_egt_slope
engine_serial_number,Interpolate_WW_rank,Unnamed: 2_level_1
ESN_201,4.0,1
ESN_137,1.0,1
ESN_1073,2.0,2
ESN_164,3.0,2
ESN_1043,2.0,2
...,...,...
ESN_766,5.0,29
ESN_765,5.0,29
ESN_1078,2.0,29
ESN_300,8.0,29


In [9]:
saf_group = safran_3.groupby(by=["engine_serial_number", "Interpolate_WW_rank"]).count().sort_values(by="Interpolate_egt_slope")
saf_group = saf_group.rename(columns = {'Interpolate_egt_slope': 'Nb_vols_entre_WW'})

#On a encore engine_serial_number et Interpolate_WW_rank en index, on les enlève
saf_group = saf_group.reset_index()

#On réindexe, pour avoir une colonne avec les identifiants de chaque intervalle, 
#qu'on appelle id_int pour identifiant intervalle
saf_group['id_int'] = saf_group.index
saf_group.sort_values(by="Nb_vols_entre_WW")
saf_group


Unnamed: 0,engine_serial_number,Interpolate_WW_rank,Nb_vols_entre_WW,id_int
0,ESN_201,4.0,1,0
1,ESN_137,1.0,1,1
2,ESN_1073,2.0,2,2
3,ESN_164,3.0,2,3
4,ESN_1043,2.0,2,4
...,...,...,...,...
6345,ESN_5,0.0,2672,6345
6346,ESN_15,0.0,2676,6346
6347,ESN_10,0.0,2827,6347
6348,ESN_1,2.0,3552,6348


In [10]:
saf_group.dtypes

engine_serial_number     object
Interpolate_WW_rank     float64
Nb_vols_entre_WW          int64
id_int                    int64
dtype: object

In [11]:
#De cette manière, on peut visualiser les intervalles où le nombre de vols est inférieur à un certain seuil
saf_group[(saf_group.Nb_vols_entre_WW <= 25)]

Unnamed: 0,engine_serial_number,Interpolate_WW_rank,Nb_vols_entre_WW,id_int
0,ESN_201,4.0,1,0
1,ESN_137,1.0,1,1
2,ESN_1073,2.0,2,2
3,ESN_164,3.0,2,3
4,ESN_1043,2.0,2,4
...,...,...,...,...
101,ESN_941,0.0,24,101
102,ESN_824,2.0,24,102
103,ESN_680,2.0,24,103
104,ESN_1311,1.0,24,104


In [12]:
#test 25
print("Le nombre d'intervalles de temps avec moins de 25 vols correspond à", round((saf_group[(saf_group.Nb_vols_entre_WW <= 25)].shape[0]/saf_group.shape[0])*100,2), "% de nos données")
print("Le nombre d'intervalles de temps avec moins de 50 vols correspond à", round((saf_group[(saf_group.Nb_vols_entre_WW <= 50)].shape[0]/saf_group.shape[0])*100,2), "% de nos données")
print("Le nombre d'intervalles de temps avec moins de 100 vols correspond à", round((saf_group[(saf_group.Nb_vols_entre_WW <= 100)].shape[0]/saf_group.shape[0])*100,2), "% de nos données")
print("Le nombre d'intervalles de temps avec moins de 150 vols correspond à", round((saf_group[(saf_group.Nb_vols_entre_WW <= 150)].shape[0]/saf_group.shape[0])*100,2), "% de nos données")

Le nombre d'intervalles de temps avec moins de 25 vols correspond à 1.67 % de nos données
Le nombre d'intervalles de temps avec moins de 50 vols correspond à 3.26 % de nos données
Le nombre d'intervalles de temps avec moins de 100 vols correspond à 8.02 % de nos données
Le nombre d'intervalles de temps avec moins de 150 vols correspond à 14.25 % de nos données


Arbitrairement, je choisis de partir avec 25 vols pour perdre le moins d'intervalles possibles, mais cette valeur peut être revue à la hausse plus tard : 


## 2) L'échantillonnage

In [13]:
saf_group.sort_values(by="Nb_vols_entre_WW")

Unnamed: 0,engine_serial_number,Interpolate_WW_rank,Nb_vols_entre_WW,id_int
0,ESN_201,4.0,1,0
1,ESN_137,1.0,1,1
2,ESN_1073,2.0,2,2
3,ESN_164,3.0,2,3
4,ESN_1043,2.0,2,4
...,...,...,...,...
6345,ESN_5,0.0,2672,6345
6346,ESN_15,0.0,2676,6346
6347,ESN_10,0.0,2827,6347
6348,ESN_1,2.0,3552,6348


In [14]:
safran_2_int = pd.merge(safran_2, saf_group, left_on = ["engine_serial_number", "Interpolate_WW_rank"], right_on = ["engine_serial_number", "Interpolate_WW_rank"], how="inner")
safran_2_int

Unnamed: 0,date,engine_serial_number,engine_series,cycles,cycles_counter,Interpolate_egt_margin,Interpolate_var_mot_1,Interpolate_flight_leg_hours,Interpolate_SV_rank,Interpolate_Config_B_rank,Interpolate_WW_rank,Interpolate_var_env_1,Interpolate_var_env_2,Interpolate_var_env_3,Interpolate_var_env_4,Interpolate_var_env_5,Interpolate_egt_slope,Nb_vols_entre_WW,id_int
0,2019-04-29 06:29:58,ESN_1,Engine_series_1,14.699402,14,0.881646,-0.313549,0.857778,0.0,0.0,0.0,-0.261068,0.193871,0.448627,0.0,0.601803,-0.029193,9884,6349
1,2019-04-29 08:10:00,ESN_1,Engine_series_1,15.284274,15,0.792029,0.006330,0.794167,0.0,0.0,0.0,-0.064202,0.273855,1.500848,0.0,-1.056965,-0.029193,9884,6349
2,2019-04-29 09:55:00,ESN_1,Engine_series_1,15.898185,16,0.706729,-0.286324,0.736667,0.0,0.0,0.0,-0.292673,0.193871,0.764293,0.0,0.149412,-0.029193,9884,6349
3,2019-04-29 11:36:53,ESN_1,Engine_series_1,16.493874,17,0.702078,0.430174,0.802500,0.0,0.0,0.0,0.070056,0.273855,1.500848,0.0,-1.056965,-0.029193,9884,6349
4,2019-04-30 04:28:40,ESN_1,Engine_series_1,22.409543,18,0.645941,0.299420,0.817500,0.0,0.0,0.0,-0.463185,0.193871,0.448627,0.0,0.601803,-0.029193,9884,6349
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2450270,2022-10-26 07:39:15,ESN_1369,Engine_series_6,34.993550,50,0.531868,-0.731730,2.654444,0.0,0.0,0.0,0.728625,-0.356159,-0.708817,0.0,-1.207762,0.000968,34,137
2450271,2022-10-26 11:36:39,ESN_1369,Engine_series_6,35.190820,51,0.973045,0.364383,2.501667,0.0,0.0,0.0,0.421328,-0.351647,-0.077484,0.0,0.199678,0.000968,34,137
2450272,2022-10-27 03:55:34,ESN_1369,Engine_series_6,36.001950,53,0.800778,0.949444,2.165000,0.0,0.0,0.0,1.092857,-0.351647,0.343405,0.0,-0.403511,0.000968,34,137
2450273,2022-10-27 07:33:12,ESN_1369,Engine_series_6,36.182090,54,0.619281,-0.748008,2.536667,0.0,0.0,0.0,1.383080,-0.356159,-0.708817,0.0,-0.705105,0.000968,34,137


In [15]:
#Je garde cette cellule pour l'instant, mais on peut voir après pour la supprimer, et effectuer tout en une seule étape

df_mauvais = safran_2_int[safran_2_int["Nb_vols_entre_WW"].between(0, 25)] # Les vols qu'on supprime
df_keep= safran_2_int[~safran_2_int["Nb_vols_entre_WW"].between(0,25)] #Les vols qu'on garde
df_keep
#On conserve uniquement les intervalles où on a plus de 25 données

Unnamed: 0,date,engine_serial_number,engine_series,cycles,cycles_counter,Interpolate_egt_margin,Interpolate_var_mot_1,Interpolate_flight_leg_hours,Interpolate_SV_rank,Interpolate_Config_B_rank,Interpolate_WW_rank,Interpolate_var_env_1,Interpolate_var_env_2,Interpolate_var_env_3,Interpolate_var_env_4,Interpolate_var_env_5,Interpolate_egt_slope,Nb_vols_entre_WW,id_int
0,2019-04-29 06:29:58,ESN_1,Engine_series_1,14.699402,14,0.881646,-0.313549,0.857778,0.0,0.0,0.0,-0.261068,0.193871,0.448627,0.0,0.601803,-0.029193,9884,6349
1,2019-04-29 08:10:00,ESN_1,Engine_series_1,15.284274,15,0.792029,0.006330,0.794167,0.0,0.0,0.0,-0.064202,0.273855,1.500848,0.0,-1.056965,-0.029193,9884,6349
2,2019-04-29 09:55:00,ESN_1,Engine_series_1,15.898185,16,0.706729,-0.286324,0.736667,0.0,0.0,0.0,-0.292673,0.193871,0.764293,0.0,0.149412,-0.029193,9884,6349
3,2019-04-29 11:36:53,ESN_1,Engine_series_1,16.493874,17,0.702078,0.430174,0.802500,0.0,0.0,0.0,0.070056,0.273855,1.500848,0.0,-1.056965,-0.029193,9884,6349
4,2019-04-30 04:28:40,ESN_1,Engine_series_1,22.409543,18,0.645941,0.299420,0.817500,0.0,0.0,0.0,-0.463185,0.193871,0.448627,0.0,0.601803,-0.029193,9884,6349
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2450270,2022-10-26 07:39:15,ESN_1369,Engine_series_6,34.993550,50,0.531868,-0.731730,2.654444,0.0,0.0,0.0,0.728625,-0.356159,-0.708817,0.0,-1.207762,0.000968,34,137
2450271,2022-10-26 11:36:39,ESN_1369,Engine_series_6,35.190820,51,0.973045,0.364383,2.501667,0.0,0.0,0.0,0.421328,-0.351647,-0.077484,0.0,0.199678,0.000968,34,137
2450272,2022-10-27 03:55:34,ESN_1369,Engine_series_6,36.001950,53,0.800778,0.949444,2.165000,0.0,0.0,0.0,1.092857,-0.351647,0.343405,0.0,-0.403511,0.000968,34,137
2450273,2022-10-27 07:33:12,ESN_1369,Engine_series_6,36.182090,54,0.619281,-0.748008,2.536667,0.0,0.0,0.0,1.383080,-0.356159,-0.708817,0.0,-0.705105,0.000968,34,137


In [16]:
#ATTENTION cette cellule peut prendre 2-3 minutes à s'éxécuter
df_ech = df_keep.groupby("id_int").sample(25) #On utilise sample(25) pour avoir 25 données aléatoire par intervalle
df_ech

Unnamed: 0,date,engine_serial_number,engine_series,cycles,cycles_counter,Interpolate_egt_margin,Interpolate_var_mot_1,Interpolate_flight_leg_hours,Interpolate_SV_rank,Interpolate_Config_B_rank,Interpolate_WW_rank,Interpolate_var_env_1,Interpolate_var_env_2,Interpolate_var_env_3,Interpolate_var_env_4,Interpolate_var_env_5,Interpolate_egt_slope,Nb_vols_entre_WW,id_int
2416411,2022-10-20 19:21:42,ESN_1170,Engine_series_1,374.547400,390,0.560847,-0.943725,3.970833,0.0,0.0,2.0,-0.459548,-0.364009,0.027738,0.0,1.154726,0.003373,26,106
2416406,2022-10-19 12:19:56,ESN_1170,Engine_series_1,369.792800,384,0.371915,0.734473,2.147500,0.0,0.0,2.0,-0.502017,-0.324236,-1.234928,0.0,0.853132,0.003373,26,106
2416413,2022-10-22 11:32:55,ESN_1170,Engine_series_1,380.534200,395,0.125859,0.273162,1.969444,0.0,0.0,2.0,-0.604503,-0.211525,-1.866260,1.0,0.853132,0.003373,26,106
2416407,2022-10-20 02:43:10,ESN_1170,Engine_series_1,372.019300,386,0.307502,0.947934,3.203056,0.0,0.0,2.0,-0.593356,-0.266357,-1.819222,0.0,0.416712,0.003373,26,106
2416408,2022-10-20 08:02:21,ESN_1170,Engine_series_1,372.818900,387,0.183692,0.906828,0.841944,0.0,0.0,2.0,-0.608678,-0.211525,-1.550594,0.0,-0.353245,0.003373,26,106
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8555,2019-10-17 06:27:17,ESN_1,Engine_series_1,1200.122426,1203,-0.031293,0.332825,0.907500,0.0,0.0,0.0,1.209277,0.159720,-1.445372,0.0,1.104460,-0.012429,9884,6349
1770,2020-02-17 03:20:57,ESN_1,Engine_series_1,1847.259673,1854,-0.422398,0.620519,0.790833,0.0,2.0,0.0,-0.585515,0.344733,-0.708817,0.0,1.154726,-0.010711,9884,6349
5010,2019-05-09 05:04:22,ESN_1,Engine_series_1,98.392466,82,0.551587,0.307188,0.786389,0.0,0.0,0.0,-0.469093,0.193871,0.553849,0.0,0.400740,-0.029193,9884,6349
4867,2020-10-18 08:42:16,ESN_1,Engine_series_1,2511.942936,2522,-0.479514,0.764486,0.764722,1.0,4.0,0.0,-0.258746,-0.245680,-1.234928,1.0,1.154726,-0.015459,9884,6349


In [17]:
print("On s'attend à avoir une base de données avec",25*(6350-106),"lignes")
print("Notre base de données a", df_ech.shape[0],"lignes")
if df_ech.shape[0] == 25*(6350-106):
    print("C'est génial, youpi!!!")

On s'attend à avoir une base de données avec 156100 lignes
Notre base de données a 156100 lignes
C'est génial, youpi!!!


Remarque pour améliorer l'échantillonnage : 
on pourrait faire de la data_augmentation, ça peut demander un peu plus de temps, mais ça permettrait de garder plus de données. Dans mon notebook, on ne garde que 25 données par intervalles, si on veut, on peut facilement en garder 100. Pour en garder plus, il faudrait après regarder les intervalles où il n'y a pas assez de données et en ajouter artificiellement, comme ça on pourrait quand même nous en servir. 
Qu'en dis tu? 
Bisous! 

### B) 50 vols minimum par intervalles
- si on choisit 50 vols min par intervalles, ça donne ça : 
    

In [18]:
#Je garde cette cellule pour l'instant, mais on peut voir après pour la supprimer, et effectuer tout en une seule étape

df_mauvais_50 = safran_2_int[safran_2_int["Nb_vols_entre_WW"].between(0, 50)] # Les vols qu'on supprime
df_keep_50= safran_2_int[~safran_2_int["Nb_vols_entre_WW"].between(0,50)] #Les vols qu'on garde
df_keep_50
#On conserve uniquement les intervalles où on a plus de 25 données

Unnamed: 0,date,engine_serial_number,engine_series,cycles,cycles_counter,Interpolate_egt_margin,Interpolate_var_mot_1,Interpolate_flight_leg_hours,Interpolate_SV_rank,Interpolate_Config_B_rank,Interpolate_WW_rank,Interpolate_var_env_1,Interpolate_var_env_2,Interpolate_var_env_3,Interpolate_var_env_4,Interpolate_var_env_5,Interpolate_egt_slope,Nb_vols_entre_WW,id_int
0,2019-04-29 06:29:58,ESN_1,Engine_series_1,14.699402,14,0.881646,-0.313549,0.857778,0.0,0.0,0.0,-0.261068,0.193871,0.448627,0.0,0.601803,-0.029193,9884,6349
1,2019-04-29 08:10:00,ESN_1,Engine_series_1,15.284274,15,0.792029,0.006330,0.794167,0.0,0.0,0.0,-0.064202,0.273855,1.500848,0.0,-1.056965,-0.029193,9884,6349
2,2019-04-29 09:55:00,ESN_1,Engine_series_1,15.898185,16,0.706729,-0.286324,0.736667,0.0,0.0,0.0,-0.292673,0.193871,0.764293,0.0,0.149412,-0.029193,9884,6349
3,2019-04-29 11:36:53,ESN_1,Engine_series_1,16.493874,17,0.702078,0.430174,0.802500,0.0,0.0,0.0,0.070056,0.273855,1.500848,0.0,-1.056965,-0.029193,9884,6349
4,2019-04-30 04:28:40,ESN_1,Engine_series_1,22.409543,18,0.645941,0.299420,0.817500,0.0,0.0,0.0,-0.463185,0.193871,0.448627,0.0,0.601803,-0.029193,9884,6349
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2450219,2022-10-27 03:49:41,ESN_1360,Engine_series_1,254.604500,274,0.919735,0.869409,1.854167,0.0,0.0,0.0,0.064519,-0.368762,0.238182,0.0,0.451006,-0.004884,210,1407
2450220,2022-10-27 12:43:08,ESN_1360,Engine_series_1,256.982600,275,0.686130,0.279454,1.973611,0.0,0.0,0.0,-0.223848,-0.311284,-0.919261,0.0,-1.056965,-0.004884,210,1407
2450221,2022-10-27 15:30:42,ESN_1360,Engine_series_1,257.729600,276,0.721306,0.346712,1.688056,0.0,0.0,0.0,-0.334779,-0.303992,-1.971483,0.0,0.551538,-0.004884,210,1407
2450222,2022-10-27 18:33:59,ESN_1360,Engine_series_1,258.546700,277,0.675019,0.313983,1.695556,0.0,0.0,0.0,1.174560,-0.311284,0.132960,0.0,-2.363873,-0.004884,210,1407


In [20]:
#ATTENTION cette cellule peut prendre 2-3 minutes à s'éxécuter
df_ech = df_keep_50.groupby("id_int").sample(50) #On utilise sample(25) pour avoir 25 données aléatoire par intervalle
df_ech

Unnamed: 0,date,engine_serial_number,engine_series,cycles,cycles_counter,Interpolate_egt_margin,Interpolate_var_mot_1,Interpolate_flight_leg_hours,Interpolate_SV_rank,Interpolate_Config_B_rank,Interpolate_WW_rank,Interpolate_var_env_1,Interpolate_var_env_2,Interpolate_var_env_3,Interpolate_var_env_4,Interpolate_var_env_5,Interpolate_egt_slope,Nb_vols_entre_WW,id_int
1935190,2022-10-26 13:45:26,ESN_670,Engine_series_1,1940.234000,2090,0.770653,0.014156,2.316389,0.0,0.0,4.0,-0.069661,-0.365579,-2.287149,0.0,0.149412,-0.038316,51,207
1935154,2022-10-18 12:40:33,ESN_670,Engine_series_1,1902.085000,2047,0.798438,-1.189130,4.230278,0.0,0.0,4.0,-0.324595,-0.366001,-2.497593,0.0,-0.101916,-0.038316,51,207
1935192,2022-10-27 01:41:06,ESN_670,Engine_series_1,1942.591000,2093,0.881217,1.025370,2.168333,0.0,0.0,4.0,-0.147171,-0.368209,-0.393150,0.0,0.702335,-0.038316,51,207
1935175,2022-10-23 04:29:39,ESN_670,Engine_series_1,1924.178000,2071,0.874190,-0.180022,2.260556,0.0,0.0,4.0,-0.236126,-0.367801,-0.708817,0.0,-0.353245,-0.038316,51,207
1935173,2022-10-22 20:58:35,ESN_670,Engine_series_1,1922.694000,2069,0.780252,-0.485308,1.341667,0.0,0.0,4.0,-0.516367,-0.057779,0.027738,0.0,-1.559622,-0.038316,51,207
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6846,2020-03-05 18:21:38,ESN_1,Engine_series_1,1988.873190,1997,-0.619881,0.184972,0.844444,0.0,2.0,0.0,-0.401689,-0.034603,-0.393150,0.0,0.652069,-0.010711,9884,6349
9080,2020-02-04 04:08:39,ESN_1,Engine_series_1,1743.008054,1750,-0.362595,0.475404,0.986389,0.0,2.0,0.0,0.522306,0.344733,-0.603595,0.0,0.551538,-0.010711,9884,6349
8704,2019-11-15 09:28:20,ESN_1,Engine_series_1,1348.079572,1360,-0.125952,-0.403763,3.236667,0.0,0.0,0.0,-0.301352,0.207607,-1.129705,0.0,-1.258028,-0.012429,9884,6349
1287,2019-11-14 21:26:28,ESN_1,Engine_series_1,1345.483573,1356,-0.256136,-0.224207,1.600833,0.0,0.0,0.0,0.025375,0.207607,-1.971483,0.0,0.199678,-0.012429,9884,6349


In [22]:
print("On s'attend à avoir une base de données avec",50*(6350-saf_group[(saf_group.Nb_vols_entre_WW <= 50)].shape[0]),"lignes")
print("Notre base de données a", df_ech.shape[0],"lignes")
if df_ech.shape[0] == 50*(6350-saf_group[(saf_group.Nb_vols_entre_WW <= 50)].shape[0]):
    print("C'est génial, youpi!!!")

On s'attend à avoir une base de données avec 307150 lignes
Notre base de données a 307150 lignes
C'est génial, youpi!!!
