## Motivation

Why do we want to do that? To get clean metadata.

<div class='alert-warning'> Motivate the merge. </div>

## Methods

### Naive approach

To merge the CMU dataset with the IMDb dataset, we have to have a combination of common columns that uniquely identifies a movie on both datasets. 

The easiest is to merge them on the movie title.

<div class='alert-warning'> Here we show why merging on the title is problematic. </div>

The combination of the movie title and the year can be unique.

<div class='alert-warning'> Here we show why this is also problematic. </div>

### Crawling Wikipedia and querying Wikidata

The movies in the CMU dataset are linked with the Wikipedia ID and the Freebase ID. In the IMDb dataset, they are identified by the IMDb IDs. If only there existed a mapping between Wikipedia ID / Freebase ID to IMDb IDs, the merging could be done. First, we check that these IDs can uniquely identify a movie:

<div class='alert-warning'> ... </div>

We notice that the same string also appears in the URL of the movie page on the IMDb website. With the help of an external library, [wikipedia](https://wikipedia.readthedocs.org/en/latest/), we can get the content of the Wikipedia page of a movie from its Wikipedia ID in the CMU dataset. We also notice that the IMDb page of the movie is most of the times referenced in the Wikipedia pages of the movies, meaning that we can link the two by crawling Wikipedia. However, this approach might fail if the IMDb page is not included anywhere in the Wikipedia page, or the Wikipedia page is not retrievable from its page ID.

For these cases, we follow an alternative approach. In the Wikidata page of a movie, both the Freebase ID and the IMDb ID are listed. We can use the [Wikidata Query Service](https://query.wikidata.org) to match these two together.

The `helpers.crawl_wikipedia` and `helpers.crawl_wikidata` methods are used in `helpers.extract_cmu_imdb_mapping` to get such mappings. The whole process takes around 24 hours but can be improved by engineering the requests sent to Wikipedia by the external library. Since we only need to do this once, we opt to focus on the other parts and stick with the implementation of the external library. `helpers.extract_cmu_imdb_mapping` generates such mappings by the two methods, aggregates them, and stores the final mapping (and the most complete one) in `./generated/wp2imdb.csv`. The file is available in the repository but can be regenerated simply by running this function.

We use this mapping in the rest of this section to merge the two datasets.

In [156]:
import tqdm
import pandas as pd
from time import sleep
import wikipedia
from helpers import crawl_wikidata

### Analyze the generated mapping

In [134]:
mapping_01 = pd.read_csv('generated/wp2imdb_01.csv')
mapping_02 = pd.read_csv('generated/wp2imdb_02.csv')
mapping = pd.read_csv('generated/wp2imdb.csv')
cmu_movies = pd.read_csv('data/MovieSummaries/movie.metadata.tsv', sep='\t', usecols=[0, 1, 2], names=['wikipedia', 'freebase', 'title'])

In [135]:
len(mapping_01), len(mapping_02), len(mapping.imdb.unique()), len(cmu_movies)

(72180, 73894, 76885, 81741)

There are 57 IMDb movies that are duplicated in the first method, and 11 in the second one:

In [142]:
duplicates_01 = mapping_01[mapping_01.imdb.duplicated(keep=False)]
duplicates_02 = mapping_02[mapping_02.imdb.duplicated(keep=False)]

len(duplicates_01.imdb.unique()), len(duplicates_02.imdb.unique())

(57, 11)

Let's analyze the two mappings separately. First, we merge the duplicated mappings with the CMU table to get the titles.

In [155]:
pd.merge(left=duplicates_01, right=cmu_movies.drop('freebase', axis=1), on='wikipedia', how='left').sort_values(by='imdb')

Unnamed: 0,wikipedia,imdb,title
26,29912713,tt0011325,If I Were King
101,1364238,tt0011325,If I Were King
6,7971186,tt0021644,Laughing Gravy
90,7531222,tt0021644,Be Big!
34,26192132,tt0025472,Marie Galante
...,...,...,...
84,13734892,tt1535550,G.I. Joe: The Rise of Cobra
58,36281191,tt1954470,Gangs of Wasseypur Part 2
47,31439778,tt1954470,Gangs of Wasseypur
88,27019957,tt2382698,Pulliman


By checking the Wikipedia page ([17864265](https://en.wikipedia.org/wiki/Itsy_Bitsy_Spider_(film))) of the first movie, *Itsy Bitsy Spider*, we realize that although the found IMDb ID [tt0104536](https://www.imdb.com/title/tt0104536/) is right, the other Wikipedia ID ([1380383](https://en.wikipedia.org/wiki/Bebe%27s_Kids)) corresponds to a movie called *Babe's Kids*. The reason for finding the IMDb ID is that in this Wikipedia page, there is an external link to *Itsy Bitsy Spider* in the References section, which has made this confusion.

We realize two things from this case:
1. There could be such mistakes in the first method for matching Wikipedia ID with IMDb ID.
2. It is more safe to stick with the second method.

<div class='alert-warning'> Is it the only possible situation?</div>

In [163]:
# TODO: Check more cases (57 in total)

wikipedia.page(pageid=29912713).url, wikipedia.page(pageid=1364238).url

('https://en.wikipedia.org/wiki/If_I_Were_King_(1920_film)',
 'https://en.wikipedia.org/wiki/If_I_Were_King')

Now we do the same thing for the mapping from method 2:

In [164]:
pd.merge(left=duplicates_02, right=cmu_movies.drop('wikipedia', axis=1), on='freebase', how='left').sort_values(by='imdb')

Unnamed: 0,freebase,imdb,title
17,/m/0gwygmm,tt0003886,Enoch Arden
14,/m/04csqxh,tt0003886,Enoch Arden
19,/m/05zkmzh,tt0004047,Half Breed
20,/m/05zm_7t,tt0004047,The Conflicts of Life
21,/m/06zm9kt,tt0044592,Era lei che lo voleva
12,/m/06zp6s1,tt0044592,Oggi sposi
3,/m/0283_5p,tt0080422,Toothache
7,/m/0283_05,tt0080422,Dental Hygiene
5,/m/0gxvwy,tt0102359,Surprise
18,/m/0gxvw7,tt0102359,Light & Heavy


In [165]:
# TODO: Check more cases (11 in total)
crawl_wikidata(values=['/m/0gwygmm', '/m/04csqxh'], by='freebase')

[{'freebase': '/m/04csqxh', 'imdb': 'tt0003886'},
 {'freebase': '/m/0gwygmm', 'imdb': 'tt0003886'},
 {'freebase': '/m/04csqxh', 'imdb': 'tt0001593'},
 {'freebase': '/m/04csqxh', 'imdb': 'tt0001594'}]

Use the following query in [Wikidata Query Service](https://query.wikidata.org) to get the Wikidata pages and see what is happening:

```SQL
SELECT ?item ?attr ?imdbid WHERE {
  ?item wdt:P345 ?imdbid .
  ?item wdt:P646 ?attr
  FILTER(?attr IN ('/m/0gwygmm', '/m/04csqxh'))
}
```

------

<div class="alert-warning"> UPDATE THE CELLS BELOW </div>

What should we do for the cases that we got the IMDb ID from the both methods? Which one should we merge on? It would not matter in principle but what if in these cases, we find a mismatch between the wikipedia-to-freebase mapping and the one that we have in the CMU dataset?

Let's check that. We first merge the mapping with the movies dataframe on `wikipedia`. Then we look at the cases when `freebase` is different in the two columns of the merge. Ideally, they should always be the same, but as we will see, this is not the case.

In [48]:
merge_01 = pd.merge(left=cmu_movies, right=mapping, on='freebase', how='left')
mismatch = merge_01[~merge_01.wikipedia_y.isna() & (merge_01.wikipedia_x != merge_01.wikipedia_y)]

In [49]:
display(mismatch)

Unnamed: 0,wikipedia_x,freebase,title,wikipedia_y,imdb
2282,17864265,/m/047cn91,Itsy Bitsy Spider,1380383.0,tt0104536
4982,32136310,/m/0gx13n5,Tokyo Koen,30855569.0,tt1783792
5870,20903211,/m/05b13d8,Maya Machhindra,18849274.0,tt0242660
6992,28089857,/m/0cm83hd,Istanbul,28090138.0,tt0050552
7117,3963507,/m/025sxwm,The Year of the Quiet Sun,32755136.0,tt0088009
...,...,...,...,...,...
76844,22521800,/m/05zm_7t,The Conflicts of Life,22522087.0,tt0004047
77762,14748669,/m/03gwgp0,Ro.Go.Pa.G.,1952819.0,tt0056171
78610,22611251,/m/05zwphf,Deaf Sam-yong,4471177.0,tt0388760
79179,14712775,/m/03gt_yd,Follow the Boys,23951493.0,tt0057066


For the movie *Tokyo Koen*, we realize that the provided Wikipedia page in the CMU dataset (32136310) does not exist. The found IMDb [tt1783792](https://www.imdb.com/title/tt1783792/) belongs to a movie called *Tokyo Park*. However, when we check the Wikipedia page [30855569](https://en.wikipedia.org/wiki/Tokyo_Park) that is listed this URL, we can see that the movie has two titles: *Tokyo Koen* and *Tokyo Park*.

From this case, we relize that some Wikipedia pages listed in the CMU dataset can be outdated or changed.

There are 99 movies with this situation. Let's zoom on a few of them to see what is happening. Having a different `wikipedia_y` means that, we found the same IMDb ID as the one that was found based on the Freebase ID, from a different Wikipedia page as listed in the CMU dataset.

<div class='alert-warning'> Actually, this makes the first method rubbish. I think we should try to improve that by checking the "External Links" section and run it again (for 15 hours), or to just stick with the second method, which limits us with only 73k movies instead of 76k. But let's include all of these analyses for P2.</div>

In [25]:
import wikipedia

In [59]:
wikipedia.page(pageid=28090138).url

'https://en.wikipedia.org/wiki/Singapore_(1947_film)'

### Merge

...