# Conversion of CMU movies to Imdb IDs
We would like to convert the references from the CMU movies dataset to Imdb ID (`ttconst` values), so that we can link the ratings value and the number of ratings for each movies. We will go to two approaches:
1. Use the Wikipedia ID of each movie entry to access the Wikidata value, and then use the Wikidata value to access the Imdb ID. We used the Wikimedia API to access these values.
2. We use the freebase IDs from the CMU dataset and wikipedia queries to access the Imdb ID.

We then compare the comversion success rate of the two approaches.

In [66]:
import requests
import pandas as pd
import numpy as np
from IPython.display import clear_output

In [5]:
# Import token from config.py
from config import WIKI_API_TOKEN

In [6]:
raw_dir = './raw_data/'
processed_dir = './processed_data/'

# Import the movie data
movies_dir = raw_dir + 'CMU/movie.metadata.tsv'

# Read the file into a DataFrame, add headers
movie_df = pd.read_csv(movies_dir, sep='\t', header=None)

# Add column names deduced from README
movie_df.columns = ['wiki_ID', 'free_ID', 'mov_name', 'release', 'revenue', 'runtime', 'languages', 'countries', 'genres']

# Set the index to wiki_ID
movie_df.set_index('wiki_ID', inplace=True)
display(movie_df)

Unnamed: 0_level_0,free_ID,mov_name,release,revenue,runtime,languages,countries,genres
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"
...,...,...,...,...,...,...,...,...
35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}"
34980460,/m/0g4pl34,Knuckle,2011-01-21,,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0..."
9971909,/m/02pygw1,Another Nice Mess,1972-09-22,,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}"
913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ..."


## Method 1: From Wikipedia page ID to IMDB ID
### Part 1: Wikipedia ID to Wikidata ID
Fecth the wikidata ID of the wikipedia page ID using the wikipedia API, for all movies in the CMU database

#### Test for single ID
We are requesting for a single movie wikipedia ID to see if authentication is working.

In [5]:
# Request setup
base_url = 'https://en.wikipedia.org/w/api.php'

headers = {
    "Authorization": "Bearer {}".format(WIKI_API_TOKEN)
}

In [6]:
test_id = '975900'

# Request parameters
params = {
    "action": "query",
    "format": "json",
    "prop": "pageprops",
    "pageids": test_id
}

# Request the pageprops for a page
response = requests.get(base_url, headers=headers, params=params).json()

print('Title: {} - WikidataID: {}'.format(response['query']['pages'][test_id]['title'], response['query']['pages'][test_id]['pageprops']['wikibase_item']))

Title: Ghosts of Mars - WikidataID: Q261700


#### Fetch Wikidata IDs for all movies
Create relation between the wiki_IDs from the CMU dataset to the WikiData IDs. We realized with a couple of test quesries that some wikipedia pages IDs are missing (indicated by a `missing` key), so we planned for that accordingly.

In [7]:
# Parameters
max_url_batch = 50

In [8]:
# We group all the IDs in batches of 50 to make the requests
batch_ids = ['|'.join(map(str,movie_df.index[i:i + max_url_batch])) for i in range(0, len(movie_df.index), max_url_batch)]
print("Batch requests: {}\nTotal movie count: {}".format(len(batch_ids), len(movie_df.index)))

Batch requests: 1635
Total movie count: 81741


In [9]:
iter = 0
wiki_ids = []
wdata_ids = []

# Sanity check: keep track of all wiki_ids that are missing a wikidata_id
missing_wdata_ids = []

for batch in batch_ids:

    params = {
        "action": "query",
        "format": "json",
        "prop": "pageprops",
        "pageids": batch
    }

    # Request the pageprops for all wiki_ids
    response = requests.get(base_url, headers=headers, params=params).json()

    # For each key in query.pages
    for key in response['query']['pages'].keys():
        wiki_ids.append(key)

        wdata_id = ''
        try:
            wdata_id = response['query']['pages'][key]['pageprops']['wikibase_item']
        except:
            missing_wdata_ids.append(key)
            pass
        wdata_ids.append(wdata_id)

    iter += 1
    print("Batch {} of {}".format(iter, len(batch_ids)))



Batch 1 of 1635
Batch 2 of 1635
Batch 3 of 1635
Batch 4 of 1635
Batch 5 of 1635
Batch 6 of 1635
Batch 7 of 1635
Batch 8 of 1635
Batch 9 of 1635
Batch 10 of 1635
Batch 11 of 1635
Batch 12 of 1635
Batch 13 of 1635
Batch 14 of 1635
Batch 15 of 1635
Batch 16 of 1635
Batch 17 of 1635
Batch 18 of 1635
Batch 19 of 1635
Batch 20 of 1635
Batch 21 of 1635
Batch 22 of 1635
Batch 23 of 1635
Batch 24 of 1635
Batch 25 of 1635
Batch 26 of 1635
Batch 27 of 1635
Batch 28 of 1635
Batch 29 of 1635
Batch 30 of 1635
Batch 31 of 1635
Batch 32 of 1635
Batch 33 of 1635
Batch 34 of 1635
Batch 35 of 1635
Batch 36 of 1635
Batch 37 of 1635
Batch 38 of 1635
Batch 39 of 1635
Batch 40 of 1635
Batch 41 of 1635
Batch 42 of 1635
Batch 43 of 1635
Batch 44 of 1635
Batch 45 of 1635
Batch 46 of 1635
Batch 47 of 1635
Batch 48 of 1635
Batch 49 of 1635
Batch 50 of 1635
Batch 51 of 1635
Batch 52 of 1635
Batch 53 of 1635
Batch 54 of 1635
Batch 55 of 1635
Batch 56 of 1635
Batch 57 of 1635
Batch 58 of 1635
Batch 59 of 1635
Batch 

In [10]:
# Sanity check for alignement
print(len(wiki_ids))
print(len(wdata_ids))

# Collect all the missing wiki_ids
missing_test = []
for i in range(len(wiki_ids)):
    if wdata_ids[i] == '':
        missing_test.append(wiki_ids[i])

# Check if all missing wiki_ids are in the missing_wdata_ids list
for i in range(len(missing_test)):
    if missing_test[i] not in missing_wdata_ids:
        print('Missing wiki_id not in missing_wdata_ids: {}'.format(missing_test[i]))

81741
81741


In [11]:
# Create table to store relations between wiki_ID and WikidataID
wpedia_wdata_df = pd.DataFrame(columns=['wiki_ID', 'wikidata_ID'])
wpedia_wdata_df['wiki_ID'] = wiki_ids
wpedia_wdata_df['wikidata_ID'] = wdata_ids

# Set the index to wiki_ID
wpedia_wdata_df.set_index('wiki_ID', inplace=True)

# Set all empty wikidata_ID to NaN
wpedia_wdata_df['wikidata_ID'].replace('', np.nan, inplace=True)

display(wpedia_wdata_df)

Unnamed: 0_level_0,wikidata_ID
wiki_ID,Unnamed: 1_level_1
18998739,
9997961,
20604092,
31025505,
77856,Q209170
...,...
31422084,Q8073901
32468537,Q965863
34474142,Q5505996
34980460,Q12125420


In [12]:
# Compute statistics
print("Total movies: {}".format(len(wpedia_wdata_df.index)))
print("Movies with WikidataID: {}".format(len(wpedia_wdata_df.dropna().index)))
print("Movies without WikidataID: {}".format(len(wpedia_wdata_df[wpedia_wdata_df['wikidata_ID'].isnull()].index)))
print("=> {}% of movies have a WikidataID".format(round(len(wpedia_wdata_df.dropna().index) / len(wpedia_wdata_df.index) * 100, 2)))

Total movies: 81741
Movies with WikidataID: 76572
Movies without WikidataID: 5169
=> 93.68% of movies have a WikidataID


In [13]:
# Save the table to a csv file
wpedia_wdata_df.to_csv(processed_dir + 'wpedia_wdata.csv')

In [14]:
# Checking length of both dataframes
assert len(movie_df.index) == len(wpedia_wdata_df.index), "Dataframes have different lengths"

With 93.7% of the movies being linakble to a wikidata ID, we assume that some of the wikipedia pages were removed or updated since 2012.

This was our first attempt at the conversion of the wikipedia IDs to wikidata IDs and achieved decent results. However we will try to improve this conversion rate by going through the freebase IDs to get to the wikidata IDs.

### Part 2: Widikata to IMDB
We now fecth the IMDB ID (tconst) from the wikidata page's properties. We again use the REST API to do so instea of SPARQL queries.

In [74]:
# Import the wikipedia to wikidata table
wiki_to_wdata_df = pd.read_csv(processed_dir + 'wpedia_wdata.csv', index_col='wiki_ID')
display(wiki_to_wdata_df)

Unnamed: 0_level_0,wikidata_ID
wiki_ID,Unnamed: 1_level_1
18998739,
9997961,
20604092,
31025505,
77856,Q209170
...,...
31422084,Q8073901
32468537,Q965863
34474142,Q5505996
34980460,Q12125420


In [75]:
# The IMDB ID is stored as property P345 in Wikidata
IMDB_claim = 'P345'

#### Test for a single ID

In [76]:
# Request setup
base_url = 'https://www.wikidata.org/w/api.php'

headers = {
    "Authorization": "Bearer {}".format(WIKI_API_TOKEN)
}

In [77]:
# Test fetching for a single wikidata imdb_id property
test_wikidata_id = 'Q261700'

# Request parameters
params = {
    "action": "wbgetentities",
    "ids": test_wikidata_id,
    "format": "json",
    "props": "claims"
}

# Request the pageprops for a page
response = requests.get(base_url, headers=headers, params=params).json()

print('WikidataID: {} - IMDB tconst: {}'.format(test_wikidata_id, response['entities'][test_wikidata_id]['claims'][IMDB_claim][0]['mainsnak']['datavalue']['value']))

WikidataID: Q261700 - IMDB tconst: tt0228333


#### Fetch for all wikidata IDs

In [78]:
# Parameters
max_url_batch = 50

In [79]:
# We group all the wikidata IDs in batches of 50 to make the requests (dropping the NaN values)
batch_wikidata_ids = ['|'.join(map(str, wikidata_ids[i:i + max_url_batch])) for i in range(0, len(wikidata_ids), max_url_batch)]
print("Batch requests: {}\nTotal movie count: {}".format(len(batch_wikidata_ids), len(wikidata_ids)))

Batch requests: 1532
Total movie count: 76572


In [84]:
i = 0
wdata_ids = []
ttconsts = []

missing_ids_log = []

for batch in batch_wikidata_ids:
    print("Batch {} of {} (processed {} entries)".format(i + 1, len(batch_wikidata_ids), len(wdata_ids)))

    params = {
        "action": "wbgetentities",
        "ids": batch,
        "format": "json",
        "props": "claims"
    }

    # Request the pageprops for all wiki_ids
    response = requests.get(base_url, headers=headers, params=params).json()

    # For each key in entities
    for key in response['entities'].keys():
        wdata_ids.append(key)

        ttconst = ''
        try:
            ttconst = response['entities'][key]['claims'][IMDB_claim][0]['mainsnak']['datavalue']['value']
        except:
            missing_ids_log.append(key)
            pass
        ttconsts.append(ttconst)

    clear_output(wait=True)

    i += 1
    # if i >= 2:
    #     break

Batch 1532 of 1532 (processed 76550 entries)


In [85]:
# Create table to store relations between wikidataID and ttconst
wdata_ttconst_df = pd.DataFrame(columns=['wikidata_ID', 'ttconst'])
wdata_ttconst_df['wikidata_ID'] = wdata_ids
wdata_ttconst_df['ttconst'] = ttconsts

# Set the index to wikidata_ID
wdata_ttconst_df.set_index('wikidata_ID', inplace=True)

# Set all empty ttconst to NaN
wdata_ttconst_df['ttconst'].replace('', np.nan, inplace=True)

display(wdata_ttconst_df)

Unnamed: 0_level_0,ttconst
wikidata_ID,Unnamed: 1_level_1
Q209170,tt0058331
Q607122,tt0255819
Q114115,tt0097499
Q729807,tt0020823
Q1579725,tt0021335
...,...
Q8073901,tt0120554
Q965863,tt0459759
Q5505996,tt0035905
Q12125420,tt1606259


In [86]:
# Save the table to a csv file
wdata_ttconst_df.to_csv(processed_dir + 'wdata_ttconst.csv')

In [120]:
# Compute statistics
print("Total movies: {}".format(len(wdata_ttconst_df.index)))
print("Movies with IMDB IDs: {}".format(len(wdata_ttconst_df.dropna().index)))
print("Movies without IMDB IDs: {}".format(len(wdata_ttconst_df[wdata_ttconst_df['ttconst'].isnull()].index)))
print("=> {}% of movies have a IMDB ID linked".format(round(len(wdata_ttconst_df.dropna().index) / len(wdata_ttconst_df.index) * 100, 2)))

Total movies: 76572
Movies with IMDB IDs: 74853
Movies without IMDB IDs: 1719
=> 97.76% of movies have a IMDB ID linked


### Conversion Results
Linking the wikipedia IDs to the wikidata IDs and then to the IMDB IDs.

In [122]:
# Import two tables
wiki_to_wdata_df = pd.read_csv(processed_dir + 'wpedia_wdata.csv', index_col='wiki_ID')
wdata_to_ttconst_df = pd.read_csv(processed_dir + 'wdata_ttconst.csv', index_col='wikidata_ID')

# Left join the two tables on wikidata_ID
wiki_to_ttconst_df = wiki_to_wdata_df.join(wdata_to_ttconst_df, how='left', on='wikidata_ID')
display(wiki_to_ttconst_df)

# Compute the overall data loss
ttconst_loss = len(wiki_to_ttconst_df[wiki_to_ttconst_df['ttconst'].isnull()].index)
print("Total conversion success: {}% ({} out of {} movies lost)".format(round((1 - ttconst_loss / len(wiki_to_ttconst_df.index)) * 100, 2), ttconst_loss, len(wiki_to_ttconst_df.index)))

Unnamed: 0_level_0,wikidata_ID,ttconst
wiki_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
18998739,,
9997961,,
20604092,,
31025505,,
77856,Q209170,tt0058331
...,...,...
31422084,Q8073901,tt0120554
32468537,Q965863,tt0459759
34474142,Q5505996,tt0035905
34980460,Q12125420,tt1606259


Total conversion success: 91.57% (6888 out of 81741 movies lost)


# Method 2: Freebase conversion