In [1]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt



## 0. Loading the data

In [2]:
#load data/moviesummaries/character.metadata.tsv
character_metadata = pd.read_csv('../data/moviesummaries/character.metadata.tsv', sep='\t', header=None)

#load data/moviesummaries/plot_summaries.txt
plot_summaries = pd.read_csv('../data/moviesummaries/plot_summaries.txt', sep='\t', header=None)

#load data/moviesummaries/movie.metadata.tsv
movie_metadata = pd.read_csv('../data/moviesummaries/movie.metadata.tsv', sep='\t', header=None)

#load data/moviesummaries/name.clusters.txt
name_clusters = pd.read_csv('../data/moviesummaries/name.clusters.txt', sep='\t', header=None)


In [3]:
# Rename columns of each dataset to match documentation
character_metadata.columns = ["Wikipedia movie ID", "Freebase movie ID", "Movie release date", "Character name", "Actor date of birth", "Actor gender", 
                              "Actor height", "Actor ethnicity", "Actor name", "Actor age", "Freebase character/actor map ID", 
                              "Freebase character ID", "Freebase actor ID"]

plot_summaries.columns = ["Wikipedia movie ID", "Summary"]

movie_metadata.columns = ["Wikipedia movie ID", "Freebase movie ID", "Movie name", "Movie release date", "Movie revenue", "Movie runtime",
                          "Movie languages", "Movie countries", "Movie genres"]

name_clusters.columns = ["Character name", "Freebase character/actor map ID"]

We get 4 different dataframes; we'll merge movie_metadata and plot_summaries together since it makes sense to get the plot information directly linked with the movie metadata, and keep the others as is.

## 1. Preprocessing plot and metadata about movies

We can see from the columns name that we can simply add the plot summaries of the movies to the movie metadata dataframe. Let's first take a look at how many data we have in each dataset :

In [4]:
# Print the size of each dataset
print("Number of data in the metadata dataframe :", movie_metadata.shape[0])
print("Number of data in the plot summaries dataframe :", plot_summaries.shape[0])

Number of data in the metadata dataframe : 81741
Number of data in the plot summaries dataframe : 42303


We can see that approximately half of the movies in the metadata have a plot description. Let's now join the two dataset on the ID column :

In [8]:
# Merge the movie_metadata and plot_summaries dataframes on the Wikipedia movie ID, without dropping the rows whitout summary, but dropping
# the plot without a matching movie
all_movies = movie_metadata.merge(plot_summaries, on="Wikipedia movie ID", how="left")
all_movies.head()

Unnamed: 0,Wikipedia movie ID,Freebase movie ID,Movie name,Movie release date,Movie revenue,Movie runtime,Movie languages,Movie countries,Movie genres,Summary
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science...","Set in the second half of the 22nd century, th..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp...",
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...",
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic...",A series of murders of rich young women throug...
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}","Eva, an upper class housewife, becomes frustra..."


In [9]:
# Check the number and percentage of null values in each column of plot_summaries as well as the number of different values in each column.
all_movies_null = pd.DataFrame(all_movies.isnull().sum(), columns=['Number of null values'])
all_movies_null['Percentage of null values'] = all_movies_null['Number of null values'] / len(all_movies)
all_movies_null['Number of unique values'] = all_movies.nunique()

all_movies_null

Unnamed: 0,Number of null values,Percentage of null values,Number of unique values
Wikipedia movie ID,0,0.0,81741
Freebase movie ID,0,0.0,81741
Movie name,0,0.0,75478
Movie release date,6902,0.084437,20389
Movie revenue,73340,0.897224,7362
Movie runtime,20450,0.25018,597
Movie languages,0,0.0,1817
Movie countries,0,0.0,2124
Movie genres,0,0.0,23817
Summary,39537,0.483686,42196


As mentioned above, we see that approximately 48% percent of the movie dataset doesn't have a corresponding plot summary!

But what about dupplicated plot summaries? Let's check that :

In [43]:
# Show plot summary duplicates which are not NaNs
duplicate_plot_movies = all_movies[all_movies.duplicated(subset=['Summary'], keep = False) & all_movies['Summary'].notnull()]

# Print number of plot summaries having at least one duplicate
print("Number of plot summaries having at least one duplicate :", duplicate_plot_movies['Summary'].nunique())

duplicate_plot_movies[['Wikipedia movie ID', 'Movie name', 'Summary']].sort_values(by=['Summary'])


Number of plot summaries having at least one duplicate : 5


Unnamed: 0,Wikipedia movie ID,Movie name,Summary
4551,14055212,The Trial of Madame X,A woman is thrown out of her home by her jealo...
18993,14022275,Madame X,A woman is thrown out of her home by her jealo...
49381,14037732,Madame X,A woman is thrown out of her home by her jealo...
57569,14051944,Madame X,A woman is thrown out of her home by her jealo...
65014,14053389,Madame X,A woman is thrown out of her home by her jealo...
28621,29481480,Drohi,An orphan Raghav turns into a ruthless contrac...
67464,25493367,Antham,An orphan Raghav turns into a ruthless contrac...
15783,14616220,The Warrens of Virginia,"As the American Civil War begins, Ned Burton l..."
57508,28852030,The Warrens of Virginia,"As the American Civil War begins, Ned Burton l..."
22185,19609453,Amar Deep,Raja was adopted by a criminal don at a very ...


We see that there are 5 different plot summaries that have at least one duplicate. 

In some cases, the duplicate has the same movie name but in other cases, the duplicate has a different movie name. 

Therefore, we can't choose which duplicate is the "correct" one.

We therefore choose here either to discard all duplicates or to keep them all.

# ICI PEUT ETRE LES ENLEVER? 
duplicate de ligne, mais pas les autres

Let's now take a look at the movies release dates. By looking at the data, we see 4 cases :
 - The release date is a year 
 - The release date is a month and a year 
 - The release date is a day, a month and a year 
 - The release date is not a number

Let's convert every entry to only a year when it's possible :

CHANGER AVEC LE CODE DE CLARA

In [11]:
#  Converting the column 'Movie released date' to datetime
all_movies['Movie release date'] = pd.to_datetime(all_movies['Movie release date'], errors='coerce')
all_movies['Movie release date'] = all_movies['Movie release date'].dt.year


In [66]:
all_movies.sample(10)

Unnamed: 0,Wikipedia movie ID,Freebase movie ID,Movie name,Movie release date,Movie revenue,Movie runtime,Movie languages,Movie countries,Movie genres,Summary
35691,3907764,/m/0b67pw,The Moon and the Son: An Imagined Conversation,,,,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02hmvc"": ""Short Film"", ""/m/07s9rl0"": ""Dra...",
27595,10597501,/m/02qjtpn,Ming Ming,,,105.0,"{""/m/02k30q"": ""Shanghainese"", ""/m/03115z"": ""Ma...","{""/m/03h64"": ""Hong Kong""}","{""/m/02l7c8"": ""Romance Film"", ""/m/02kdv5l"": ""A...",Fiery Ming Ming has always been the kind to t...
7739,23797592,/m/06_v1nm,The Baccahe,,,88.0,{},"{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}",
30930,807979,/m/03dhpv,Swiss Family Robinson,1960.0,40000000.0,126.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/04xvh5"": ""Costume drama"", ""/m/0hqxf"": ""Fa...",A family on their way to New Guinea is chased ...
23581,9085771,/m/027x41y,I Stole a Million,1939.0,,80.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/02n4kr"": ""My...",The plot was summarized by a reviewer thus: R...
48649,31156225,/m/0gh91_5,Sree Krishnaleela,,,,"{""/m/0999q"": ""Malayalam Language""}","{""/m/03rk0"": ""India""}","{""/m/05p553"": ""Comedy film"", ""/m/07s9rl0"": ""Dr...",
32514,9423251,/m/0288856,English Babu Desi Mem,1996.0,,150.0,"{""/m/03k50"": ""Hindi Language"", ""/m/02hxcvy"": ""...","{""/m/03rk0"": ""India""}","{""/m/07s9rl0"": ""Drama"", ""/m/03q4nz"": ""World ci...",Hari and Vikram are brothers of the head of M...
60527,15928986,/m/03qh49_,The Black Secret,1919.0,,,"{""/m/06ppq"": ""Silent film"", ""/m/02h40lc"": ""Eng...","{""/m/09c7w0"": ""United States of America""}","{""/m/06ppq"": ""Silent film"", ""/m/03k9fj"": ""Adve...",
40420,19409166,/m/04mxzqw,Quiet Night In,,,87.0,"{""/m/02h40lc"": ""English Language""}","{""/m/0ctw_b"": ""New Zealand""}","{""/m/01z4y"": ""Comedy""}",
73496,26021982,/m/0b6c_bf,Tora-san Goes North,1987.0,,107.0,{},"{""/m/03_3d"": ""Japan""}","{""/m/0gw5n2f"": ""Japanese Movies""}","When his travels take him to rural Hokkaido, T..."


Now, we have a proper dataframe containing metadata and plot summaries about movies.

## 2. Preprocessing characters metadata

In [41]:
character_metadata.sample(10)

Unnamed: 0,Wikipedia movie ID,Freebase movie ID,Movie release date,Character name,Actor date of birth,Actor gender,Actor height,Actor ethnicity,Actor name,Actor age,Freebase character/actor map ID,Freebase character ID,Freebase actor ID
352417,26388629,/m/0bbv0qf,1956.0,,1922-05-24,M,1.98,,Don Megowan,34.0,/m/0gcx76g,,/m/0c3yhsk
235395,13333148,/m/03c20cm,1941.0,,1906-07-03,M,1.9,,George Sanders,34.0,/m/040kk4l,,/m/02cj_f
424984,21997257,/m/05n_kl5,1952.0,Stableman,1903-07-09,M,,,Jack Hendricks,48.0,/m/0n1tw0_,/m/0n1xn9s,/m/0n1tw12
447219,28071647,/m/0cm8qby,,,1963-09-27,M,,,Fu Biao,35.0,/m/0cmxxqt,,/m/07p27m
147696,1901270,/m/064wly,,,1902-11-19,M,1.9,,Richard Alexander,28.0,/m/0c6_xv0,,/m/02qpvjm
399144,34602448,/m/0hhggk5,,,1979-11-25,M,1.626,,Jerry Ferrara,33.0,/m/0hjb4ss,,/m/0b1xzp
212281,23956617,/m/076zl8h,,Matka Gorzelaka,1916-03-01,F,,,Krystyna Feldman,,/m/0n5cgr2,/m/0n5cgr5,/m/027xn2t
247938,1994850,/m/06cm6h,2004.0,,1978,F,,,Kyoko Hasegawa,26.0,/m/0gcbx28,,/m/071g7x
101299,30520872,/m/0g9w9s3,1925.0,,1891-02-09,M,1.77,,Ronald Colman,,/m/0gw2tvz,,/m/01201_
268934,24132517,/m/07kc0gx,1956.0,,,,,,Peter Mosbacher,,/m/0gcgy26,,/m/0gc4mby


Taking a quick look, we see that we encounter the same problem with the "Movie release date" and the "Actor date of birth" columns as seen before. Let's convert every entry to only a year when it's possible :

In [22]:
# Converting the columns 'Movie released date' and 'Actor date of birth' to datetime
character_metadata['Movie release date'] = pd.to_datetime(character_metadata['Movie release date'], errors='coerce')
character_metadata['Movie release date'] = character_metadata['Movie release date'].dt.year
character_metadata['Actor date of birth'] = pd.to_datetime(character_metadata['Actor date of birth'], errors='coerce')
character_metadata['Actor date of birth'] = character_metadata['Actor date of birth'].dt.year



In [45]:
character_metadata.sample(10)

Unnamed: 0,Wikipedia movie ID,Freebase movie ID,Movie release date,Character name,Actor date of birth,Actor gender,Actor height,Actor ethnicity,Actor name,Actor age,Freebase character/actor map ID,Freebase character ID,Freebase actor ID
219944,28967077,/m/0dgs5yf,1937.0,,1885-01-12,F,,,Maire O'Neill,,/m/0gds9qm,,/m/074tdb
217212,27750710,/m/0cc51m2,,,1948-03-07,M,,,Ruperto Ares,56.0,/m/0gbz2g1,,/m/0gbz2g3
417708,19410544,/m/04mxnxc,,,1908-11-02,M,,,Reginald Beckwith,52.0,/m/0cpnj0v,,/m/02r6t2p
301438,21651686,/m/05msp_b,,Christine Vole,1938-07-20,F,1.74,,Diana Rigg,43.0,/m/0gm1y1y,/m/0h3b4yq,/m/01bqmx
172144,4302862,/m/0bw0sw,2003.0,Louis Stevens,1986-06-11,M,1.759,/m/041rx,Shia LaBeouf,17.0,/m/0k5025,/m/05vc9qh,/m/04w391
323901,32172821,/m/05f512r,2007.0,,,F,,,Jeena,,/m/0k55jc4,,/m/0k55jc7
305821,576198,/m/02rqd_,1987.0,Sue Ann,1961-07-15,F,1.72,/m/026cybk,Lolita Davidovich,25.0,/m/0220zfd,/m/02nw9w5,/m/04393b
388582,22730023,/m/05zxtjr,2002.0,,1967-10-10,M,,/m/0dryh9k,Ali,34.0,/m/0jmx855,,/m/02rzmzk
294624,14183099,/m/03cx47d,2008.0,Simran,1978-08-21,F,1.6,/m/0dryh9k,Bhumika Chawla,29.0,/m/040lwml,/m/0h33ps2,/m/04bfn2
397797,20672258,/m/05pdh86,2009.0,Paul,1985-01-10,M,1.8,,Alex Meraz,24.0,/m/07grkpd,/m/05lwx7h,/m/05szh9x


# À faire

 - Merge movie metadata et character metadata??? à demander
 - Changer date avec code de clara -> Clara
 - Filtrer les datas: enlever colonnes inutiles??? à demander
 - Comprendre les dictionary structure -> Faye


 - Plot nb films/années -> dire qu'on aura plus de données pour récemment -> Clara
 - Bar plot pays -> notre étude sera plus représentative des US -> Clara
 - Bar plot genre ???


Partie de romain:
 - sortir % NaN des revenue -> Faye


