# Setup


In [3]:
import pandas as pd
import json

# METADATA

1. `movie.metadata.tsv.gz` [3.4 M]


Metadata for 81,741 movies, extracted from the Noverber 4, 2012 dump of Freebase.  Tab-separated; columns:

1. Wikipedia movie ID
2. Freebase movie ID
3. Movie name
4. Movie release date
5. Movie box office revenue
6. Movie runtime
7. Movie languages (Freebase ID:name tuples)
8. Movie countries (Freebase ID:name tuples)
9. Movie genres (Freebase ID:name tuples)



2. `character.metadata.tsv.gz` [14 M]

Metadata for 450,669 characters aligned to the movies above, extracted from the Noverber 4, 2012 dump of Freebase.  Tab-separated; columns:

1. Wikipedia movie ID
2. Freebase movie ID
3. Movie release date
4. Character name
5. Actor date of birth
6. Actor gender
7. Actor height (in meters)
8. Actor ethnicity (Freebase ID)
9. Actor name
10. Actor age at movie release
11. Freebase character/actor map ID
12. Freebase character ID
13. Freebase actor ID

# Importing the datasets


In [4]:
movies_path = "data/movie.metadata.tsv"
character_path = "data/character.metadata.tsv"

movie_column_names = ["WikiID", "FreeID", "Title", "RelDate", "Revenue", "Runtime", "Languages", "Countries", "Genres"]
character_column_names = ["WikiID", "FreeID", "MovieRelDate", "CharName", "DOB", "Gender", "Height", "Ethnicity", "Actor", "Age", "FreeMapID", "FreeCharID", "FreeActorID"]

movies = pd.read_csv(movies_path, sep='\t', header=None, names=movie_column_names)
characters = pd.read_csv(character_path, sep='\t', header=None, names=character_column_names)

# Looking at the data

In [5]:
display(movies.head())
display(characters.head())

Unnamed: 0,WikiID,FreeID,Title,RelDate,Revenue,Runtime,Languages,Countries,Genres
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"


Unnamed: 0,WikiID,FreeID,MovieRelDate,CharName,DOB,Gender,Height,Ethnicity,Actor,Age,FreeMapID,FreeCharID,FreeActorID
0,975900,/m/03vyhn,2001-08-24,Akooshay,1958-08-26,F,1.62,,Wanda De Jesus,42.0,/m/0bgchxw,/m/0bgcj3x,/m/03wcfv7
1,975900,/m/03vyhn,2001-08-24,Lieutenant Melanie Ballard,1974-08-15,F,1.78,/m/044038p,Natasha Henstridge,27.0,/m/0jys3m,/m/0bgchn4,/m/0346l4
2,975900,/m/03vyhn,2001-08-24,Desolation Williams,1969-06-15,M,1.727,/m/0x67,Ice Cube,32.0,/m/0jys3g,/m/0bgchn_,/m/01vw26l
3,975900,/m/03vyhn,2001-08-24,Sgt Jericho Butler,1967-09-12,M,1.75,,Jason Statham,33.0,/m/02vchl6,/m/0bgchnq,/m/034hyc
4,975900,/m/03vyhn,2001-08-24,Bashira Kincaid,1977-09-25,F,1.65,,Clea DuVall,23.0,/m/02vbb3r,/m/0bgchp9,/m/01y9xg




*   Note : `RelDate`doesn't have a standard format



### Exploring length of the two datasets

In [6]:
n_mov = len(movies)
n_char = len(characters)
print('Number of rows in the movie.metadata dataset :', n_mov)
print('Number of rows in the movie.metadata dataset :', n_char)

Number of rows in the movie.metadata dataset : 81741
Number of rows in the movie.metadata dataset : 450669


# Cleaning Dictionaries
Columns `Languages`, `Countries`, `Genres` contain wikipedia ID + actual name for each entry. For the moment, keeping just the name for the sake of clarity

In [7]:
def extract_values(column):
  values = []
  column=json.loads(column)
  for key in column:
    values.append(column[key])
  return values

movies.Languages = movies.Languages.apply(extract_values)
movies.Countries = movies.Countries.apply(extract_values)
movies.Genres = movies.Genres.apply(extract_values)

In [8]:
display(movies[["Languages", "Countries", "Genres"]].head(2))

Unnamed: 0,Languages,Countries,Genres
0,[English Language],[United States of America],"[Thriller, Science Fiction, Horror, Adventure,..."
1,[English Language],[United States of America],"[Mystery, Biographical film, Drama, Crime Drama]"


# Duplicates

In [9]:
for column in movies.columns:
  duplicated_rows = movies[column].duplicated().sum()
  print("{} has {} duplicated rows".format(column, duplicated_rows))

WikiID has 0 duplicated rows
FreeID has 0 duplicated rows
Title has 6263 duplicated rows
RelDate has 61351 duplicated rows
Revenue has 74378 duplicated rows
Runtime has 81143 duplicated rows
Languages has 79924 duplicated rows
Countries has 79617 duplicated rows
Genres has 57924 duplicated rows


`WikiID`& `FreeID`-> Good

`Title`-> Bit concerning -> Check with Runtime


> ~~Runtime not enough. There are :~~

1.   ~~Movies with same runtime + Title but actually different~~
2.   ~~Movies with same runtime + Title but actually the same~~





In [10]:
movies[movies.Title=="Hunting Season"]

Unnamed: 0,WikiID,FreeID,Title,RelDate,Revenue,Runtime,Languages,Countries,Genres
62836,29666067,/m/0fphzrf,Hunting Season,1010-12-02,12160978.0,140.0,"[Turkish Language, English Language]",[Turkey],"[Crime Fiction, Mystery, Drama, Thriller]"


Wrong input for release date (1010->2010)

In [11]:
movies[movies.Title=="Harlow"]

Unnamed: 0,WikiID,FreeID,Title,RelDate,Revenue,Runtime,Languages,Countries,Genres
623,3670013,/m/09thsq,Harlow,1965-06-23,1000000.0,109.0,[English Language],[United States of America],"[Biographical film, Biography, Drama, Black-an..."
1223,27171821,/m/0bwklv0,Harlow,1965,,109.0,[],[United States of America],[Biographical film]


Actually these are 2 different movies BUT first movie `WikiID`=3670013 has wrong rutime (correct =125).

So far couldn't find an example of a duplicated movie + No duplicated Wiki/Freebase IDs -> Lets trust the dataset 🙃

# Missing values

### Check for missing entries :
In which columns are they present, and for those columns : what is the percentage how missing entries.

In [12]:
movies.isna().any()

WikiID       False
FreeID       False
Title        False
RelDate       True
Revenue       True
Runtime       True
Languages    False
Countries    False
Genres       False
dtype: bool

In [13]:
print('Percentage of missing entries in the movie dataset:\n', 100*movies[['RelDate', 'Revenue', 'Runtime']].isna().sum() / n_mov)

Percentage of missing entries in the movie dataset:
 RelDate     8.443743
Revenue    89.722416
Runtime    25.018045
dtype: float64


There are 90% of revenues that are non specified. We probably won't be able to use this feature.  
8% of the release dates and 25% of the runtimes are missing, we can fill them if we find the correct ones.

In [14]:
characters.isna().any()

WikiID          False
FreeID          False
MovieRelDate     True
CharName         True
DOB              True
Gender           True
Height           True
Ethnicity        True
Actor            True
Age              True
FreeMapID       False
FreeCharID       True
FreeActorID      True
dtype: bool

In [15]:
print('Percentage of missing entries in the character dataset:\n', 100*characters[['FreeID', 'MovieRelDate', 'CharName',
                                                                                   'DOB', 'Gender', 'Height', 'Ethnicity',
                                                                                   'Actor', 'Age', 'FreeMapID', 'FreeCharID',
                                                                                   'FreeActorID']].isna().sum() / n_char)

Percentage of missing entries in the character dataset:
 FreeID           0.000000
MovieRelDate     2.217814
CharName        57.220488
DOB             23.552763
Gender          10.120288
Height          65.645740
Ethnicity       76.466542
Actor            0.272484
Age             35.084064
FreeMapID        0.000000
FreeCharID      57.218269
FreeActorID      0.180842
dtype: float64


# DATA

This part deals with plot summaries data.

1. `plot_summaries.txt.gz` [29 M]

Plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia.  Each line contains the Wikipedia movie ID (which indexes into `movie.metadata.tsv`) followed by the summary.


2. `corenlp_plot_summaries.tar.gz` [628 M, separate download]

The plot summaries from above, run through the Stanford CoreNLP pipeline (tagging, parsing, NER and coref). Each filename begins with the Wikipedia movie ID (which indexes into `movie.metadata.tsv`).


# TEST DATA

1. `tvtropes.clusters.txt`

72 character types drawn from tvtropes.com, along with 501 instances of those types.  The ID field indexes into the Freebase character/actor map ID in `character.metadata.tsv`.

2. `name.clusters.txt`


970 unique character names used in at least two different movies, along with 2,666 instances of those types.  The ID field indexes into the Freebase character/actor map ID in `character.metadata.tsv`.


In [16]:
tvtropes_path = "data/tvtropes.clusters.txt"
name_path = "data/name.clusters.txt"

##### Mapping: put clusters info into character dataset:

---