# Nettoyage et Manipulation des données

In [58]:
import numpy as np
import pandas as pd

data = pd.read_csv("../data/tmdb_movies_data.csv")

On commence par la suppression du doublon détecté

In [59]:
data.drop_duplicates(inplace=True)

On supprime à présent les colonnes inutiles

In [60]:
list_columns_to_delete = ['id', 'imdb_id', 'budget', 'revenue', 'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview', 'production_companies']

data.drop(columns=list_columns_to_delete, inplace=True)

In [61]:
data.head()

Unnamed: 0,popularity,original_title,runtime,genres,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,32.985763,Jurassic World,124,Action|Adventure|Science Fiction|Thriller,6/9/2015,5562,6.5,2015,137999939.3,1392446000.0
1,28.419936,Mad Max: Fury Road,120,Action|Adventure|Science Fiction|Thriller,5/13/2015,6185,7.1,2015,137999939.3,348161300.0
2,13.112507,Insurgent,119,Adventure|Science Fiction|Thriller,3/18/2015,2480,6.3,2015,101199955.5,271619000.0
3,11.173104,Star Wars: The Force Awakens,136,Action|Adventure|Science Fiction|Fantasy,12/15/2015,5292,7.5,2015,183999919.0,1902723000.0
4,9.335014,Furious 7,137,Action|Crime|Thriller,4/1/2015,2947,7.3,2015,174799923.1,1385749000.0


Revérifions s'il y a des valeurs manquantes

In [62]:
data.isnull().sum()

popularity         0
original_title     0
runtime            0
genres            23
release_date       0
vote_count         0
vote_average       0
release_year       0
budget_adj         0
revenue_adj        0
dtype: int64

La colonne ```genres``` contient 23 valeurs manquantes, ce qui n'est pas conséquent sur notre échantillon de 10866 individus. Donc, on peut les supprimer.

In [63]:
data.dropna(inplace=True)
data.shape

(10842, 10)

Suppression réussie !

In [64]:
data.isnull().sum()

popularity        0
original_title    0
runtime           0
genres            0
release_date      0
vote_count        0
vote_average      0
release_year      0
budget_adj        0
revenue_adj       0
dtype: int64

Tout est bon !!!

Mais n'oublions pas que pour la colonne ```genres``` les valeurs sont multiples et séparées par des barres. Ce que nous pouvons faire c'est de garder pour chaque individu la première valeur du genre et supprimer les autres

In [65]:
data["genres"] = data["genres"].str.split("|")
list_genres = []
for row in data["genres"]:
    list_genres.append(row[0])
data["genres"] = list_genres

data.head()

Unnamed: 0,popularity,original_title,runtime,genres,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,32.985763,Jurassic World,124,Action,6/9/2015,5562,6.5,2015,137999939.3,1392446000.0
1,28.419936,Mad Max: Fury Road,120,Action,5/13/2015,6185,7.1,2015,137999939.3,348161300.0
2,13.112507,Insurgent,119,Adventure,3/18/2015,2480,6.3,2015,101199955.5,271619000.0
3,11.173104,Star Wars: The Force Awakens,136,Action,12/15/2015,5292,7.5,2015,183999919.0,1902723000.0
4,9.335014,Furious 7,137,Action,4/1/2015,2947,7.3,2015,174799923.1,1385749000.0


Reste à transformer notre colonne ```release_date``` en datetime

In [66]:
data["release_date"] = pd.to_datetime(data["release_date"], format="%m/%d/%Y", errors="coerce")

In [67]:
data.head()

Unnamed: 0,popularity,original_title,runtime,genres,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,32.985763,Jurassic World,124,Action,2015-06-09,5562,6.5,2015,137999939.3,1392446000.0
1,28.419936,Mad Max: Fury Road,120,Action,2015-05-13,6185,7.1,2015,137999939.3,348161300.0
2,13.112507,Insurgent,119,Adventure,2015-03-18,2480,6.3,2015,101199955.5,271619000.0
3,11.173104,Star Wars: The Force Awakens,136,Action,2015-12-15,5292,7.5,2015,183999919.0,1902723000.0
4,9.335014,Furious 7,137,Action,2015-04-01,2947,7.3,2015,174799923.1,1385749000.0


Maintenant notre DataFrame est clean!

### Sauvegarde des données

In [68]:
data.to_csv("../data/data_cleaned.csv", index=False)