I am using [`wikipedia`](https://pypi.org/project/wikipedia/) library. You can read the documentation [here](https://wikipedia.readthedocs.io/en/latest/code.html).

In [None]:
!pip install wikipedia

In [60]:
import wikipedia as wp
import pandas as pd

# Wikipedia ID -> IMDb ID

### Example

We want to find the IMDb ID (`tt0228333`) from the Wikipedia page ID (`975900`), which links the [Wikipedia page](https://en.wikipedia.org/wiki/Ghosts_of_Mars) to the [IMDb page](https://www.imdb.com/title/tt0228333/) of the move *Ghosts of Mars*.


We can easily spot the extrnal links in this page:

In [42]:
p = wp.page(pageid='975900')
p.references

['http://www.theofficialjohncarpenter.com/pages/themovies/gm/gmstrk.html',
 'http://www.filmtracks.com/titles/ghosts_mars.html',
 'http://www.soundtrack.net/albums/database/?id=2877',
 'http://www.sbs.com.au/movies/review/sci-fi-fans-will-love-it',
 'http://www.contactmusic.com/ice-cube/news/ice-cube-regrets-turning-down-menace-taking-ghosts-of-mars_1081122',
 'https://web.archive.org/web/20121222230453/http://www.theofficialjohncarpenter.com/pages/themovies/gm/gmstrk.html',
 'https://m.cinemascore.com/',
 'https://www.allmovie.com/movie/v250566',
 'http://ihorrordatabase.blogspot.com/2016/08/on-this-day-in-horror-august-24th.html',
 'https://web.archive.org/web/20040907041359/http://www.theofficialjohncarpenter.com/pages/themovies/gm/gm.html',
 'https://www.imdb.com/title/tt0228333/',
 'https://www.boxofficemojo.com/movies/?id=ghostsofmars.htm',
 'https://www.rottentomatoes.com/m/john_carpenters_ghosts_of_mars',
 'https://www.austinchronicle.com/events/film/2001-08-24/141402/',
 'http

### Function to do this (to be moved to a helper module):

In [77]:
from typing import Union
from wikipedia import PageError, DisambiguationError

In [97]:
def get_refs(pageid: Union[int, str]) -> dict[str, str]:

    imdb_id = None
    wikidata_id = None
    freebase_id = None
    bomojo_id = None

    try:
        wikipedia_page = wp.page(pageid=pageid)
    except (PageError, AttributeError, DisambiguationError):
        return None

    for url in wikipedia_page.references:
        # NOTE: There might be several external links to IMDb pages, needs to be checked
        if 'imdb.com/title' in url:
            imdb_id = url.split('/')[-2]  # TODO: Sometimes gets overwritten by other IMDb URLs, improve it
        # NOTE: There might be several external links to Wikidata pages, needs to be checked
        if 'wikidata.org/wiki' in url:
            wikidata_id = url.split('/')[-1].split('#')[0]
        # NOTE: There might be several external links to Box Office Mojo pages, needs to be checked
        if 'boxofficemojo.com/movies' in url:
            bomojo_id = [item.split('=')[1] for item in url.split('/')[-1].split('.')[0].split('?') if item.split('=')[0] == 'id'][0]

    ids = {
        'wikidata': wikidata_id,
        'imdb': imdb_id,
        'freebase': freebase_id,
        'boxofficemojo': bomojo_id,
    }

    return ids

Voilà:

In [98]:
get_refs('975900')

{'wikidata': 'Q261700',
 'imdb': 'tt0228333',
 'freebase': None,
 'boxofficemojo': ['ghostsofmars']}

### Try on the dataset

In [99]:
movie_column_names = [
    "wikipedia",
    "freebase",
    "title",
    "release",
    "borevenue",
    "runtime",
    "languages",
    "countries",
    "genres",
]
cmu_movies = pd.read_csv('data/MovieSummaries/movie.metadata.tsv', sep='\t', names=movie_column_names)

In [108]:
wikidata_ids = []
imdb_ids = []
freebase_ids = []
bomojo_ids = []

for wp_id in cmu_movies.sample(1000).wikipedia:
    try:
        refs = get_refs(pageid=wp_id)
    except Exception as e:
        print(wp_id, e)
        refs = None

    if refs:
        wikidata_ids.append(refs['wikidata'])
        imdb_ids.append(refs['imdb'])
        freebase_ids.append(refs['freebase'])
        bomojo_ids.append(refs['boxofficemojo'])



  lis = BeautifulSoup(html).find_all('li')


It took 700s for 1000 movies. It means 16 hours for the whole dataset..

We can make it faster by replicating what this library is doing. They send one pageid at a time in their query and that is why it takes so long. We can send the same query with multiple pageids! But is it worth it..?

In [117]:
imdb_ids.count(None)

40

# Freebase ID -> Wikidata ID -> IMDb ID

We want to find the IMDb ID (`tt0228333`) from the Freebase ID (`/m/03vyhn`) using the Wikidata ID (`Q261700`) and [Wikidata page](https://www.wikidata.org/wiki/Q261700).

In [17]:
import requests

## First method (doesn't work properly)

In [50]:
URL = 'https://query.wikidata.org/sparql'
Q_GET_IMDB = """
SELECT ?imdbid WHERE {
  ?item wdt:P646 '%s' .
  ?item wdt:P345 ?imdbid .
}
"""

# PROBLEM: Gets blocked !!
def get_imdbid(freebaseid: str) -> list[str]:
    # NOTE: CAN BE EXTREMELY FASTER IF WE SEND MULTIPLE FREEBASE IDs TOGETHER !!
    r = requests.get(URL, params = {'format': 'json', 'query': (Q_GET_IMDB % freebaseid)})
    imdbids = [binding['imdbid']['value'] for binding in r.json()['results']['bindings']]
    return imdbids


In [51]:
get_imdbid(freebaseid='/m/03vyhn')

['tt0228333']

### Try on the dataset

In [48]:
movie_column_names = [
    "wikipedia",
    "freebase",
    "title",
    "release",
    "borevenue",
    "runtime",
    "languages",
    "countries",
    "genres",
]
cmu_movies = pd.read_csv('data/MovieSummaries/movie.metadata.tsv', sep='\t', names=movie_column_names)

In [58]:
get_imdbid('/m/0g9g9x')

['tt0023196']

In [None]:
imdb_ids = []

for fb_id in cmu_movies.sample(1000).freebase:
    try:
        imdb_ids.append(get_imdbid(fb_id))
    except Exception as e:
        print(f'"{fb_id}": {e}')
        refs = None

## Second method

In [59]:
Q_GET_IMDBS = """
SELECT ?imdbid WHERE {
  ?item wdt:P345 ?imdbid .
  ?item wdt:P646 ?freebaseid
  FILTER(?freebaseid IN (%s))
}
"""

In [100]:
URL = 'https://query.wikidata.org/sparql'
Q_GET_IMDBS = """
SELECT ?freebaseid ?imdbid WHERE {
  ?item wdt:P345 ?imdbid .
  ?item wdt:P646 ?freebaseid
  FILTER(?freebaseid IN (%s))
}
"""

def get_mapping(freebaseids: list[str]) -> list[list[str]]:
    # NOTE: CAN BE EXTREMELY FASTER IF WE SEND MULTIPLE FREEBASE IDs TOGETHER !!
    freebaseids = ', '.join([f"'{fbid}'" for fbid in freebaseids])
    r = requests.get(URL, params = {'format': 'json', 'query': (Q_GET_IMDBS % freebaseids)})
    mapping = [{'imdb': binding['imdbid']['value'], 'freebase': binding['freebaseid']['value']} for binding in r.json()['results']['bindings']]
    return mapping

In [104]:
pd.DataFrame(get_mapping(['/m/03vyhn', '/m/0g9g9x']))

Unnamed: 0,imdb,freebase
0,tt0228333,/m/03vyhn
1,tt0023196,/m/0g9g9x


### Try on the dataset

In [105]:
pd.DataFrame(get_mapping(cmu_movies.sample(10).freebase))

Unnamed: 0,imdb,freebase
0,tt0157210,/m/0hnd9rx
1,tt0806008,/m/04n0h6y
2,tt0091105,/m/02w6t29
3,tt0041758,/m/04grn47
4,tt0075386,/m/04gkp22
5,tt0038148,/m/0ggv_y
6,tt0111301,/m/04nssc
7,tt0023853,/m/0hhv7p8
8,tt1525835,/m/0gls6ck
9,tt0093543,/m/0bt9bn


In [111]:
pd.DataFrame(get_mapping(cmu_movies.sample(100).freebase))

Unnamed: 0,imdb,freebase
0,tt0480001,/m/0glymk
1,tt0154506,/m/047gphb
2,tt0150915,/m/02chhq
3,tt0218867,/m/0gg9qrq
4,tt0190006,/m/05mz1t8
...,...,...
87,tt0061565,/m/05q7qzb
88,tt0294856,/m/0cp0sww
89,tt0030377,/m/04mznxs
90,tt13914002,/m/03cztq8


In [114]:
pd.DataFrame(get_mapping(cmu_movies.sample(200).freebase))

Unnamed: 0,imdb,freebase
0,tt0107076,/m/05j3s1
1,tt0032983,/m/0777s1
2,tt0494834,/m/09s8t4
3,tt0152238,/m/0cp02np
4,tt1448755,/m/0b__g0s
...,...,...
173,tt0216149,/m/05zzv6g
174,tt6447518,/m/02805bs
175,tt14336726,/m/0gvqxnz
176,tt13919770,/m/0jt0fjg


In [116]:
pd.DataFrame(get_mapping(cmu_movies.sample(300).freebase))

Unnamed: 0,imdb,freebase
0,tt0274407,/m/03gvf3x
1,tt0452694,/m/02qp2yn
2,tt0439544,/m/02qlth5
3,tt0816150,/m/02qsqjz
4,tt0073200,/m/0j66v56
...,...,...
271,tt8263552,/m/02pkhq2
272,tt1575576,/m/0kt_sz
273,tt1710335,/m/0j9mxwk
274,tt13683840,/m/0grzzh


Doesn't work for very long queries, maximum 200 or 300 freebase IDs at a time. Will take 5 hours or so for the whole dataset.

In [None]:
# TODO: Send batches of 200 freebase IDs at a time and loop

for _ in range(10):
    ...