# Nettoyage et Préparation des données

In [153]:
import pandas as pd
import datetime

data = pd.read_csv("../data/data_concatenated.csv")
data.head()

Unnamed: 0,date,airline,ch_code,num_code,dep_time,from,time_taken,stop,arr_time,to,price,class
0,11-02-2022,Air India,AI,868,18:00,Delhi,02h 00m,non-stop,20:00,Mumbai,25612,business
1,11-02-2022,Air India,AI,624,19:00,Delhi,02h 15m,non-stop,21:15,Mumbai,25612,business
2,11-02-2022,Air India,AI,531,20:00,Delhi,24h 45m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,20:45,Mumbai,42220,business
3,11-02-2022,Air India,AI,839,21:25,Delhi,26h 30m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,23:55,Mumbai,44450,business
4,11-02-2022,Air India,AI,544,17:15,Delhi,06h 40m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,23:55,Mumbai,46690,business


On commence par supprimer les doublons

In [154]:
data.drop_duplicates(inplace=True, ignore_index=True)

On supprime aussi les colonnes inutiles

In [155]:
data.drop(columns=["ch_code", "stop"], inplace=True)

On convertis la colonne des prix en valeurs numériques

In [156]:
data["price"] = data["price"].str.replace(",", "")
data["price"] = pd.to_numeric(data["price"], errors="coerce")

À présent on convertis les colonnes des dates, des heures d'arrivée et de départ en Objet datetime

In [157]:
data["date"] = pd.to_datetime(data["date"], format="%d-%m-%Y", errors="coerce")
data["dep_time"] = pd.to_datetime(data["dep_time"], format="%H:%M", errors="coerce").dt.time
data["arr_time"] = pd.to_datetime(data["arr_time"], format="%H:%M", errors="coerce").dt.time

Transformation de la variable time_taken en Objet datetime

In [158]:
data["time_taken"] = pd.to_datetime(data["time_taken"], format="%Hh %Mm", errors="coerce").dt.time
data.loc[data["time_taken"].isnull() == True, "time_taken"] = datetime.time(hour=23)

Enfin on transformera les colonnes ```time_taken```, ```dep_time``` et ```arr_time``` en variables catégorielles

Classement par catégorie :
- Aube : 04:00 - 06:00
- Matin : 06:00 - 12:00
- Après-midi : 12:00 - 18:00
- Soir : 18:00 - 21:00
- Nuit : 21:00 - 04:00

Pour la durée on aura :
- t_court: 0h à 2h
- court: 2h à 4h
- moyen: 4h à 8h
- long: 8h à 12h
- t_long: +12h

In [159]:
def classification_time(series):
    if datetime.time(hour=4) <= series < datetime.time(hour=6):
        return "aube"
    elif datetime.time(hour=6) <= series < datetime.time(hour=12):
        return "matin"
    elif datetime.time(hour=12) <= series < datetime.time(hour=18):
        return "apres-midi"
    elif datetime.time(hour=18) <= series < datetime.time(hour=21):
        return "soir"
    else:
        return "nuit"

for i in range(len(data)):
    data.loc[i, "dep_time"] = classification_time(data.loc[i, "dep_time"])
    data.loc[i, "arr_time"] = classification_time(data.loc[i, "arr_time"])

In [160]:
data[["dep_time", "arr_time"]]

Unnamed: 0,dep_time,arr_time
0,soir,soir
1,soir,nuit
2,soir,soir
3,nuit,nuit
4,apres-midi,nuit
...,...,...
300254,matin,soir
300255,matin,soir
300256,apres-midi,matin
300257,matin,matin


In [161]:
def classification_duree(series):
    if datetime.time(hour=0) <= series < datetime.time(hour=2):
        return "t_court"
    elif datetime.time(hour=2) <= series < datetime.time(hour=4):
        return "court"
    elif datetime.time(hour=4) <= series < datetime.time(hour=8):
        return "moyen"
    elif datetime.time(hour=8) <= series < datetime.time(hour=12):
        return "long"
    else:
        return "t_long"

for i in range(len(data)):
    data.loc[i, "time_taken"] = classification_duree(data.loc[i, "time_taken"])

In [163]:
data["time_taken"].head(20)

0      court
1      court
2     t_long
3     t_long
4      moyen
5      court
6     t_long
7     t_long
8     t_long
9      court
10      long
11    t_long
12      long
13    t_long
14    t_long
15    t_long
16     moyen
17    t_long
18    t_long
19    t_long
Name: time_taken, dtype: object

In [167]:
data.describe()

Unnamed: 0,date,num_code,price
count,300259,300259.0,300259.0
mean,2022-03-08 00:06:31.342141696,1417.776883,20883.800386
min,2022-02-11 00:00:00,101.0,1105.0
25%,2022-02-25 00:00:00,637.0,4783.0
50%,2022-03-08 00:00:00,818.0,7425.0
75%,2022-03-20 00:00:00,927.0,42521.0
max,2022-03-31 00:00:00,9991.0,123071.0
std,,1974.519951,22695.96223


In [168]:
data.isnull().sum()

date          0
airline       0
num_code      0
dep_time      0
from          0
time_taken    0
arr_time      0
to            0
price         0
class         0
dtype: int64

DataFrame nettoyé

In [165]:
data.head(20)

Unnamed: 0,date,airline,num_code,dep_time,from,time_taken,arr_time,to,price,class
0,2022-02-11,Air India,868,soir,Delhi,court,soir,Mumbai,25612,business
1,2022-02-11,Air India,624,soir,Delhi,court,nuit,Mumbai,25612,business
2,2022-02-11,Air India,531,soir,Delhi,t_long,soir,Mumbai,42220,business
3,2022-02-11,Air India,839,nuit,Delhi,t_long,nuit,Mumbai,44450,business
4,2022-02-11,Air India,544,apres-midi,Delhi,moyen,nuit,Mumbai,46690,business
5,2022-02-11,Vistara,985,soir,Delhi,court,nuit,Mumbai,50264,business
6,2022-02-11,Air India,479,nuit,Delhi,t_long,apres-midi,Mumbai,50669,business
7,2022-02-11,Air India,473,soir,Delhi,t_long,apres-midi,Mumbai,51059,business
8,2022-02-11,Vistara,871,soir,Delhi,t_long,apres-midi,Mumbai,51731,business
9,2022-02-11,Vistara,977,soir,Delhi,court,nuit,Mumbai,53288,business


### Sauvegarde

In [166]:
data.to_csv("../data/data_cleaned.csv", index=False)