# Merging CMU dataset with the IMDb dataset

## Motivation

Why do we want to do that? To get clean metadata.

<div class='alert-warning'> Motivate the merge. </div>

## Methods

### Naive approach

To merge the CMU dataset with the IMDb dataset, we have to have a combination of common columns that uniquely identifies a movie on both datasets. 

The easiest is to merge them on the movie title.

<div class='alert-warning'> Here we show why merging on the title is problematic. </div>

The combination of the movie title and the year can be unique.

<div class='alert-warning'> Here we show why this is also problematic. </div>

### Crawling Wikipedia and querying Wikidata

The movies in the CMU dataset are linked with the Wikipedia ID and the Freebase ID. In the IMDb dataset, they are identified by the IMDb IDs. If only there existed a mapping between Wikipedia ID / Freebase ID to IMDb IDs, the merging could be done. First, we check that these IDs can uniquely identify a movie:

<div class='alert-warning'> ... </div>

We notice that the same string also appears in the URL of the movie page on the IMDb website. With the help of an external library, [wikipedia](https://wikipedia.readthedocs.org/en/latest/), we can get the content of the Wikipedia page of a movie from its Wikipedia ID in the CMU dataset. We also notice that the IMDb page of the movie is most of the times referenced in the Wikipedia pages of the movies, meaning that we can link the two by crawling Wikipedia. However, this approach might fail if the IMDb page is not included anywhere in the Wikipedia page, or the Wikipedia page is not retrievable from its page ID.

For these cases, we follow an alternative approach. In the Wikidata page of a movie, both the Freebase ID and the IMDb ID are listed. We can use the [Wikidata Query Service](https://query.wikidata.org) to match these two together.

The `helpers.crawl_wikipedia` and `helpers.crawl_wikidata` methods are used in `helpers.extract_cmu_imdb_mapping` to get such mappings. The whole process takes around 24 hours but can be improved by engineering the requests sent to Wikipedia by the external library. Since we only need to do this once, we opt to focus on the other parts and stick with the implementation of the external library. `helpers.extract_cmu_imdb_mapping` generates such mappings by the two methods, aggregates them, and stores the final mapping (and the most complete one) in `./generated/wp2imdb.csv`. The file is available in the repository but can be regenerated simply by running this function.

We use this mapping in the rest of this section to merge the two datasets.

In [1]:
import requests
import pandas as pd
from time import sleep
import wikipedia

In [15]:
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

## Analyze the generated mappings

In [2]:
mapping_01 = pd.read_csv('generated/wp2imdb_01.csv')
mapping_02 = pd.read_csv('generated/wp2imdb_02.csv')
mapping = pd.read_csv('generated/wp2imdb.csv')
cmu_movies = pd.read_csv('data/MovieSummaries/movie.metadata.tsv', sep='\t', usecols=[0, 1, 2], names=['wikipedia', 'freebase', 'title'])

In [3]:
len(mapping_01), len(mapping_02), len(mapping.imdb.unique()), len(cmu_movies)

(72180, 73894, 76885, 81741)

There are 57 IMDb movies that are duplicated in the first method, and 11 in the second one:

In [12]:
duplicates_01 = mapping_01[mapping_01.imdb.duplicated(keep=False)]
duplicates_02 = mapping_02[mapping_02.imdb.duplicated(keep=False)]

len(duplicates_01.imdb.unique()), len(duplicates_02.imdb.unique())

(57, 11)

Let's analyze the two mappings separately. First, we merge the duplicated mappings with the CMU table to get the titles. Then we add the URL of the Wikipedia page to check them more easily.

In [17]:
duplicates_01 = pd.merge(left=duplicates_01, right=cmu_movies.drop('freebase', axis=1), on='wikipedia', how='left').sort_values(by='imdb')
duplicates_01['url'] = duplicates_01.wikipedia.apply(lambda pageid: wikipedia.page(pageid=pageid).url)
display(duplicates_01, display_id=False)

Unnamed: 0,wikipedia,imdb,title,url
26,29912713,tt0011325,If I Were King,https://en.wikipedia.org/wiki/If_I_Were_King_(1920_film)
101,1364238,tt0011325,If I Were King,https://en.wikipedia.org/wiki/If_I_Were_King
6,7971186,tt0021644,Laughing Gravy,https://en.wikipedia.org/wiki/Laughing_Gravy
90,7531222,tt0021644,Be Big!,https://en.wikipedia.org/wiki/Be_Big!
34,26192132,tt0025472,Marie Galante,https://en.wikipedia.org/wiki/Marie_Galante_(film)
77,1416847,tt0025472,The Power and the Glory,https://en.wikipedia.org/wiki/The_Power_and_the_Glory_(1933_film)
53,35030671,tt0026191,The Flying Fleet,https://en.wikipedia.org/wiki/The_Flying_Fleet
28,14711494,tt0026191,Ceiling Zero,https://en.wikipedia.org/wiki/Ceiling_Zero
92,22610953,tt0030337,A Woman's Face,https://en.wikipedia.org/wiki/A_Woman%27s_Face_(1938_film)
72,11635934,tt0030337,A Woman's Face,https://en.wikipedia.org/wiki/A_Woman%27s_Face


[tt0104536](https://www.imdb.com/title/tt0104536/): Wikipedia page ([17864265](https://en.wikipedia.org/wiki/Itsy_Bitsy_Spider_(film))) corresponds to the right movie *Itsy Bitsy Spider*. Wikipedia page ([1380383](https://en.wikipedia.org/wiki/Bebe%27s_Kids)) corresponds to a movie called *Babe's Kids*. The reason for finding the IMDb ID is that in this Wikipedia page, there is an external link to *Itsy Bitsy Spider* in the References section, which has made this confusion.

We realize two things from this case:
1. There could be such mistakes in the first method for matching Wikipedia ID with IMDb ID.
2. It is more safe to stick with the second method.

> Most of the cases are like this.

> Easy fix is to loop over the external links in reverse order.

Now we do the same thing for the mapping from method 2 and we get the Wikidata pages:

In [31]:
Q = """
SELECT ?item WHERE {
  ?item wdt:P646 '%s' .
}
"""
URL = 'https://query.wikidata.org/sparql'

def get_wikidata_items(freebaseid):
    items = [
        b['item']['value']
        for b in requests.get(URL, params = {'format': 'json', 'query': Q % freebaseid}).json()['results']['bindings']
    ]
    sleep(5)
    return items

In [32]:
duplicates_02 = pd.merge(left=duplicates_02, right=cmu_movies.drop('wikipedia', axis=1), on='freebase', how='left').sort_values(by='imdb')
duplicates_02['url'] = duplicates_02.freebase.apply(get_wikidata_items)
display(duplicates_02, display_id=False)

Unnamed: 0,freebase,imdb,title_x,title_y,title,url
0,/m/0gwygmm,tt0003886,Enoch Arden,Enoch Arden,Enoch Arden,[http://www.wikidata.org/entity/Q5379267]
1,/m/04csqxh,tt0003886,Enoch Arden,Enoch Arden,Enoch Arden,[http://www.wikidata.org/entity/Q629580]
2,/m/05zkmzh,tt0004047,Half Breed,Half Breed,Half Breed,[http://www.wikidata.org/entity/Q4979898]
3,/m/05zm_7t,tt0004047,The Conflicts of Life,The Conflicts of Life,The Conflicts of Life,[http://www.wikidata.org/entity/Q7435149]
4,/m/06zm9kt,tt0044592,Era lei che lo voleva,Era lei che lo voleva,Era lei che lo voleva,[http://www.wikidata.org/entity/Q3730996]
5,/m/06zp6s1,tt0044592,Oggi sposi,Oggi sposi,Oggi sposi,[http://www.wikidata.org/entity/Q3730996]
6,/m/0283_5p,tt0080422,Toothache,Toothache,Toothache,[http://www.wikidata.org/entity/Q2744030]
7,/m/0283_05,tt0080422,Dental Hygiene,Dental Hygiene,Dental Hygiene,[http://www.wikidata.org/entity/Q2744030]
8,/m/0gxvwy,tt0102359,Surprise,Surprise,Surprise,[http://www.wikidata.org/entity/Q16944380]
9,/m/0gxvw7,tt0102359,Light & Heavy,Light & Heavy,Light & Heavy,[http://www.wikidata.org/entity/Q6545959]


By checking the Wikidata pages of these items, we realize that all these movies are closely related and are actually the same movie in IMDb most of the time, with different Freebase IDs. We conclude that this method is reliable.

## Merge

<div class='alert-warning'> Let's use method 2 for now. We will get 3k more movies but we need to re-run the first method (15 hours). </div>

In [49]:
mapping_02 = pd.read_csv('generated/wp2imdb_02.csv')
movies = pd.merge(left=cmu_movies, right=mapping_02, on='freebase', how='left')

In [50]:
movies.sample(10)

Unnamed: 0,wikipedia,freebase,title,imdb
41649,3838436,/m/0b2mcl,Follow Me Quietly,tt0041378
4344,2685725,/m/07xxt4,Sinbad and the Eye of the Tiger,tt0076716
74168,2388874,/m/078j5m,Jumborg Ace & Giant,tt3263160
37244,17061845,/m/041737v,Business and Pleasure,tt0021702
12603,35436294,/m/0j9mryz,Jail Yatra,tt0390139
30437,27850610,/m/0cc5fmz,Make a Wish,
52172,5127062,/m/0d3_lf,Once Upon a Texas Train,tt0095781
69689,24485822,/m/080k97p,Skiptracers,tt1270847
34724,1160944,/m/04cd1m,Buddy the Gob,tt0024924
47034,18916948,/m/04jkrxj,The Last Target,tt0068834
