I am using [`wikipedia`](https://pypi.org/project/wikipedia/) library. You can read the documentation [here](https://wikipedia.readthedocs.io/en/latest/code.html).

In [None]:
!pip install wikipedia

In [60]:
import wikipedia as wp
import pandas as pd

### Example

We want to find the IMDb ID (`tt0228333`) from the Wikipedia page ID (`975900`), which links the [Wikipedia page](https://en.wikipedia.org/wiki/Ghosts_of_Mars) to the [IMDb page](https://www.imdb.com/title/tt0228333/) of the move *Ghosts of Mars*.


We can easily spot the extrnal links in this page:

In [42]:
p = wp.page(pageid='975900')
p.references

['http://www.theofficialjohncarpenter.com/pages/themovies/gm/gmstrk.html',
 'http://www.filmtracks.com/titles/ghosts_mars.html',
 'http://www.soundtrack.net/albums/database/?id=2877',
 'http://www.sbs.com.au/movies/review/sci-fi-fans-will-love-it',
 'http://www.contactmusic.com/ice-cube/news/ice-cube-regrets-turning-down-menace-taking-ghosts-of-mars_1081122',
 'https://web.archive.org/web/20121222230453/http://www.theofficialjohncarpenter.com/pages/themovies/gm/gmstrk.html',
 'https://m.cinemascore.com/',
 'https://www.allmovie.com/movie/v250566',
 'http://ihorrordatabase.blogspot.com/2016/08/on-this-day-in-horror-august-24th.html',
 'https://web.archive.org/web/20040907041359/http://www.theofficialjohncarpenter.com/pages/themovies/gm/gm.html',
 'https://www.imdb.com/title/tt0228333/',
 'https://www.boxofficemojo.com/movies/?id=ghostsofmars.htm',
 'https://www.rottentomatoes.com/m/john_carpenters_ghosts_of_mars',
 'https://www.austinchronicle.com/events/film/2001-08-24/141402/',
 'http

### Function to do this (to be moved to a helper module):

In [77]:
from typing import Union
from wikipedia import PageError, DisambiguationError

In [97]:
def get_refs(pageid: Union[int, str]) -> dict[str, str]:

    imdb_id = None
    wikidata_id = None
    freebase_id = None
    bomojo_id = None

    try:
        wikipedia_page = wp.page(pageid=pageid)
    except (PageError, AttributeError, DisambiguationError):
        return None

    for url in wikipedia_page.references:
        # NOTE: There might be several external links to IMDb pages, needs to be checked
        if 'imdb.com/title' in url:
            imdb_id = url.split('/')[-2]  # TODO: Sometimes gets overwritten by other IMDb URLs, improve it
        # NOTE: There might be several external links to Wikidata pages, needs to be checked
        if 'wikidata.org/wiki' in url:
            wikidata_id = url.split('/')[-1].split('#')[0]
        # NOTE: There might be several external links to Box Office Mojo pages, needs to be checked
        if 'boxofficemojo.com/movies' in url:
            bomojo_id = [item.split('=')[1] for item in url.split('/')[-1].split('.')[0].split('?') if item.split('=')[0] == 'id']

    ids = {
        'wikidata': wikidata_id,
        'imdb': imdb_id,
        'freebase': freebase_id,
        'boxofficemojo': bomojo_id,
    }

    return ids

Voilà:

In [98]:
get_refs('975900')

{'wikidata': 'Q261700',
 'imdb': 'tt0228333',
 'freebase': None,
 'boxofficemojo': ['ghostsofmars']}

### Try on the dataset

In [99]:
movie_column_names = [
    "wikipedia",
    "freebase",
    "title",
    "release",
    "borevenue",
    "runtime",
    "languages",
    "countries",
    "genres",
]
cmu_movies = pd.read_csv('data/MovieSummaries/movie.metadata.tsv', sep='\t', names=movie_column_names)

In [108]:
wikidata_ids = []
imdb_ids = []
freebase_ids = []
bomojo_ids = []

for wp_id in cmu_movies.sample(1000).wikipedia:
    try:
        refs = get_refs(pageid=wp_id)
    except Exception as e:
        print(wp_id, e)
        refs = None

    if refs:
        wikidata_ids.append(refs['wikidata'])
        imdb_ids.append(refs['imdb'])
        freebase_ids.append(refs['freebase'])
        bomojo_ids.append(refs['boxofficemojo'])



  lis = BeautifulSoup(html).find_all('li')


It took 700s for 1000 movies. It means 16 hours for the whole dataset..

We can make it faster by replicating what this library is doing. They send one pageid at a time in their query and that is why it takes so long. We can send the same query with multiple pageids! But is it worth it..?

In [117]:
imdb_ids.count(None)

40