# Pré-traitement sur Dataset DF_tmdb_full

###**Objectif**
Nettoyer et préparer les données de Dataset `DF_tmdb_full` pour qu'elles soient prêtes à l’analyse.

###**Recap Pré-traitement réalisé sur ce DF:**

- on garde que la durée "runtimeMinutes"> 40
- on garde que les filmes déjà sortis : colonne "Status" filtre sur = "Released"
- on supprime les film pour Adultes (garde que les "isAdult" = 0)
- on remplace les valeurs \N par NaN
- on Nettoye chaque valeur de la colonne "genre","production_companies_name", "spoken spoken_languages"
- on Supprime la colonne "video"car 98% sont sans video


In [None]:
pip install gdown

Note: you may need to restart the kernel to use updated packages.


## 1️⃣ Chargement des bibliothèques et du DataFrame

In [None]:
import pandas as pd
import numpy as np
import datetime as dt

In [None]:
# gdown bibliothèque Python qui permet de télécharger des fichiers à partir de Google Drive
import gdown
import pandas as pd
# URL du fichier CSV sur Google Drive
url = 'https://drive.google.com/uc?id=1VB5_gl1fnyBDzcIOXZ5vUSbCY68VZN1v'
# Télécharger le fichier
output = 'tmdb_full.csv'
gdown.download(url, output, quiet=False)
# Lire le fichier CSV
df_tmdb = pd.read_csv(output)

Downloading...
From (original): https://drive.google.com/uc?id=1VB5_gl1fnyBDzcIOXZ5vUSbCY68VZN1v
From (redirected): https://drive.google.com/uc?id=1VB5_gl1fnyBDzcIOXZ5vUSbCY68VZN1v&confirm=t&uuid=7659fc48-b6be-462f-a430-bcc12dff29f2
To: C:\Users\vika_\Projet_2\tmdb_full.csv
100%|██████████| 157M/157M [00:16<00:00, 9.67MB/s] 
  df_tmdb = pd.read_csv(output)


## 2️⃣ Compréhension des données brutes
- Aperçu des colonnes (`df.info()`, `df.head()`)
- Description rapide : dimensions, types, aperçu des valeurs

In [None]:
df_tmdb.shape

(309572, 25)

In [None]:
df_tmdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309572 entries, 0 to 309571
Data columns (total 25 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   adult                         309572 non-null  bool   
 1   backdrop_path                 151760 non-null  object 
 2   budget                        309572 non-null  int64  
 3   genres                        309572 non-null  object 
 4   homepage                      44262 non-null   object 
 5   id                            309572 non-null  int64  
 6   imdb_id                       309572 non-null  object 
 7   original_language             309572 non-null  object 
 8   original_title                309572 non-null  object 
 9   overview                      282512 non-null  object 
 10  popularity                    309572 non-null  float64
 11  poster_path                   264159 non-null  object 
 12  production_countries          309572 non-nul

In [None]:
df_tmdb.head(5)

Unnamed: 0,adult,backdrop_path,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,production_companies_name,production_companies_country
0,False,/dvQj1GBZAZirz1skEEZyWH2ZqQP.jpg,0,['Comedy'],,3924,tt0029927,en,Blondie,Blondie and Dagwood are about to celebrate the...,...,70,['en'],Released,The favorite comic strip of millions at last o...,Blondie,False,7.214,7,['Columbia Pictures'],['US']
1,False,,0,['Adventure'],,6124,tt0011436,de,Der Mann ohne Namen,,...,420,[],Released,,"Peter Voss, Thief of Millions",False,0.0,0,[],[]
2,False,/uJlc4aNPF3Y8yAqahJTKBwgwPVW.jpg,0,"['Drama', 'Romance']",,8773,tt0055747,fr,L'Amour à vingt ans,Love at Twenty unites five directors from five...,...,110,"['it', 'ja', 'pl', 'fr', 'de']",Released,The Intimate Secrets of Young Lovers,Love at Twenty,False,6.7,41,"['Ulysse Productions', 'Unitec Films', 'Cinese...","['', 'NZ', 'IT', 'JP', 'DE', 'PL', '']"
3,False,/hQ4pYsIbP22TMXOUdSfC2mjWrO0.jpg,0,"['Drama', 'Comedy', 'Crime']",,2,tt0094675,fi,Ariel,Taisto Kasurinen is a Finnish coal miner whose...,...,73,['fi'],Released,,Ariel,False,7.046,248,['Villealfa Filmproductions'],['FI']
4,False,/l94l89eMmFKh7na2a1u5q67VgNx.jpg,0,"['Drama', 'Comedy', 'Romance']",,3,tt0092149,fi,Varjoja paratiisissa,"An episode in the life of Nikander, a garbage ...",...,76,['en'],Released,,Shadows in Paradise,False,7.182,269,['Villealfa Filmproductions'],['FI']


## 3️⃣ Filtrage des données non pertinentes

In [None]:
# On selectionne les filmes donts la durée est > à 40 min:
df_tmdb = df_tmdb[df_tmdb['runtime'] > 40]

In [None]:
df_tmdb.shape

(221698, 25)

In [None]:
df_tmdb['video'].value_counts(normalize=True) * 100
# => à supprimer car 98% sont sans video

video
False    98.117259
True      1.882741
Name: proportion, dtype: float64

In [None]:
df_tmdb['status'].value_counts(normalize=True) * 100
# => à filtrer sur les filmes déjà sortis : "Released"

status
Released           99.773566
Post Production     0.148400
In Production       0.067209
Planned             0.009472
Canceled            0.000902
Rumored             0.000451
Name: proportion, dtype: float64

In [None]:
df_tmdb = df_tmdb[df_tmdb['status'] == "Released"]

In [None]:
df_tmdb = df_tmdb[df_tmdb['adult'] !=True]

In [None]:
df_tmdb.shape

(221195, 25)

## 4️⃣ Nettoyage des colonnes spécifiques

In [None]:
# Nettoyer chaque valeur de la colonne "genre":
def nettoyer_genres(genres):
      genre_str1 = genres.strip("[]'")                                                             # Supprimer les crochets
      genre_str2 = genre_str1.replace("'", "")                                                     # Supprimer les apostrophes
      return genre_str2

df_tmdb.loc[:, 'genres'] = df_tmdb['genres'].apply(nettoyer_genres)

In [None]:
# Nettoyer chaque valeur de la colonne "production_companies_name":
def nettoyer_production_companies_name(production_companies_name):
      production_companies_name_str1 = production_companies_name.strip("[]'")                      # Supprimer les crochets
      production_companies_name_str2 = production_companies_name_str1.replace("'", "")             # Supprimer les apostrophes
      return production_companies_name_str2

df_tmdb.loc[:, 'production_companies_name'] = df_tmdb['production_companies_name'].apply(nettoyer_production_companies_name)

In [None]:
# Nettoyer chaque valeur de la colonne "spoken_languages":
def nettoyer_spoken_languages(spoken_languages):
      spoken_languages_str1 = spoken_languages.strip("[]'")                                        # Supprimer les crochets
      spoken_languages_str2 = spoken_languages_str1.replace("'", "")                               # Supprimer les apostrophes
      return spoken_languages_str2

df_tmdb.loc[:,'spoken_languages'] = df_tmdb['spoken_languages'].apply(nettoyer_spoken_languages)

In [None]:
df_tmdb.isna().sum()

adult                                0
backdrop_path                    92512
budget                               0
genres                               0
homepage                        185548
id                                   0
imdb_id                              0
original_language                    0
original_title                       0
overview                          6621
popularity                           0
poster_path                      18684
production_countries                 0
release_date                      1539
revenue                              0
runtime                              0
spoken_languages                     0
status                               0
tagline                         153911
title                                0
video                                0
vote_average                         0
vote_count                           0
production_companies_name            0
production_companies_country    114547
dtype: int64

## 5️⃣ Suppression des colonnes inutiles

In [None]:
# Supprimer la colonnes "video" et 'status'
df_tmdb = df_tmdb.drop(['video','status','adult'], axis=1)

In [None]:
# Dimensions du DataFrame final
df_tmdb.shape

(221195, 22)

In [None]:
df_tmdb_full_clean = df_tmdb

## Export (optionnel) : Enregistrement du DataFrame nettoyé

In [None]:
df_tmdb_full_clean.to_csv("df_tmdb_full_clean.csv")