# OMDB API

Now that I have a list of tv shows from Wikipedia, I can pass that through the APIs I have identified earlier. The first API I'll be accessing is the OMDB API (The Open Movie Database). This API actually has a Python wrapper, so I'll be using that.


Source: http://www.omdbapi.com/

In [1]:
import pickle
import json
import requests
import re
import time
import pandas as pd
import numpy as np 
import pprint
from collections import defaultdict

pp = pprint.PrettyPrinter(indent=2)

# Loading in the list 

In [2]:
with open ('./Assets_&_Data/clean_show_list.pickle', 'rb') as fp:
    clean_show_list = pickle.load(fp)

In [3]:
len(clean_show_list)

9528

# OMDB API Wrapper

Once the list is cleaned, I can put it through the OMDB API wrapper. I have to create an API key in order to access this API - the free tier is limited to 500 pulls a day and there are donation options to unlock higher limits. Because of the length of my initial list, I decided to donate for the first level, which gets me 100,000 requests a day.

In [4]:
import omdb

In [5]:
omdb.api._client.params_map['apikey'] = 'c998cfb7'
omdb.set_default('apikey', 'c998cfb7')

# Getting series info from OMDB

As a first step, I'm iterating through the show list and making a get request for each show. I want to get synopses/plot information for some NLP/topic modeling (fullplot=True) and also want to see if there is enough information from RottenTomatoes to be used in modeling (tomatoes=True). 

In [6]:
def get_omdb_series(link_list):
    omdb_dict = {}
    for count, i in enumerate(link_list):
        omdb_dict[i] = (omdb.get(title=i, 
                                 fullplot=True, 
                                 tomatoes=True)
                       )
        if count % 500 == 0:
            print("Currently pulling: ", count)
    return omdb_dict

In [7]:
omdb_shows = get_omdb_series(clean_show_list)

Currently pulling:  0
Currently pulling:  500
Currently pulling:  1000
Currently pulling:  1500
Currently pulling:  2000
Currently pulling:  2500
Currently pulling:  3000
Currently pulling:  3500
Currently pulling:  4000
Currently pulling:  4500
Currently pulling:  5000
Currently pulling:  5500
Currently pulling:  6000
Currently pulling:  6500
Currently pulling:  7000
Currently pulling:  7500
Currently pulling:  8000
Currently pulling:  8500
Currently pulling:  9000
Currently pulling:  9500


In [8]:
len(omdb_shows)

3733

After iterating through the list of almost 10,000 shows in my original list, I can see that only 3700 were actually searchable. This means there was still a lot of bad links in my original Wikipedia pull, and also means my dataset will be significantly smaller than anticipated.


In any case, I will save this dictionary and continue.

In [9]:
with open('omdb_show_metadata.json', 'w') as fp:
    json.dump(omdb_shows, fp)

In [10]:
omdb_shows['Silicon Valley'].keys()

dict_keys(['title', 'year', 'rated', 'released', 'runtime', 'genre', 'director', 'writer', 'actors', 'plot', 'language', 'country', 'awards', 'poster', 'ratings', 'metascore', 'imdb_rating', 'imdb_votes', 'imdb_id', 'type', 'total_seasons', 'tomato_meter', 'tomato_image', 'tomato_rating', 'tomato_reviews', 'tomato_fresh', 'tomato_rotten', 'tomato_consensus', 'tomato_user_meter', 'tomato_user_rating', 'tomato_user_reviews', 'tomato_url', 'dvd', 'box_office', 'production', 'website', 'response'])

In [11]:
pp.pprint(omdb_shows['Silicon Valley'])

{ 'actors': 'Thomas Middleditch, Josh Brener, Martin Starr, Kumail Nanjiani',
  'awards': 'Nominated for 2 Golden Globes. Another 13 wins & 76 nominations.',
  'box_office': 'N/A',
  'country': 'USA',
  'director': 'N/A',
  'dvd': 'N/A',
  'genre': 'Comedy',
  'imdb_id': 'tt2575988',
  'imdb_rating': '8.6',
  'imdb_votes': '96,430',
  'language': 'English',
  'metascore': 'N/A',
  'plot': 'In the high-tech gold rush of modern Silicon Valley, the people '
          'most qualified to succeed are the least capable of handling '
          "success. A comedy partially inspired by Mike Judge's own "
          'experiences as a Silicon Valley engineer in the late 1980s.',
  'poster': 'https://m.media-amazon.com/images/M/MV5BMTAxNTEyODE5MTNeQTJeQWpwZ15BbWU4MDE3MjM3ODQz._V1_SX300.jpg',
  'production': 'N/A',
  'rated': 'TV-MA',
  'ratings': [{'source': 'Internet Movie Database', 'value': '8.6/10'}],
  'released': '06 Apr 2014',
  'response': 'True',
  'runtime': '28 min',
  'title': 'Silicon V

Information about these features can be found in the data dictionary.

# Creating a dictionary with IMDB ID as the key

Once I have the series data, I can use the data to find out the IMDB IDs for each of the models. I will need this ID to further query this API for show information, and also want to grab this information in case it would make iterating through the other APIs easier.

In [12]:
def create_imdb_keys(omdb_keys):
    imdb_keys = {}

    for i in omdb_keys:
        if not omdb_keys[i]:
            continue
        elif not omdb_keys[i]['imdb_id']:
            continue
        else:
            imdb_keys[omdb_keys[i]['imdb_id']] = omdb_keys[i]
    return imdb_keys

In [13]:
imdb_show_keys = create_imdb_keys(omdb_shows)

In [14]:
len(imdb_show_keys)

2881

# Function to remove movies from the imdb key dictionaries

After looking at the features of this API, I can see that there is a "type" key that indicates if shows are movies or tv shows - I will want to get rid of movies that did not have much recurring effect on the model (although television movie specials would have been alright). As a backup in case the type is mistakenly classified, I've also decided to remove any shows that are missing the "total_seasons" value.

In [15]:
def remove_movies(omdb_keys):
    show_season = {}
    omdb_tv_dict = {}
    for i in omdb_keys:
        if not omdb_keys[i]:
            continue
        elif omdb_keys[i]["type"] == 'movie':
            continue
        elif omdb_keys[i]['total_seasons'] == 'N/A':
            continue
        else:
            omdb_tv_dict[i] = omdb_keys[i]
            show_season[i] = omdb_keys[i]['total_seasons']
    return omdb_tv_dict

In [16]:
shows_omdb_keys = remove_movies(imdb_show_keys)

In [17]:
len(shows_omdb_keys)

1957

# Function to grab IMDB ID as key, Season Number is subkey, episode information is value

Next, I want to grab information about the show's seasons. The parameters for the get request are IMDB id (which I've just gathered) and season number. This is being handled in a bit of a roundabout way - the "Total Seasons" is one of the features of the show information in the previous dictionary I got, so I'm having the function use that number as the max to iterate through (+1 because range() is not inclusive).

In [18]:
def get_episode_count(link_dict):
    episode_count_dict = {} # all episodes are in here
    for i in link_dict:
        subkey = {}
        for x in range(1, int(link_dict[i]['total_seasons'])+1):
            var = omdb.get(imdbid=str(i),
                    season=x,
                    fullplot=True,
                    tomatoes=True)
            subkey[x] = var
            episode_count_dict[i] = subkey
    return episode_count_dict


In [19]:
show_episode_count = get_episode_count(shows_omdb_keys)

In [20]:
len(show_episode_count)

1957

In [21]:
show_episode_count['tt0490532'][1]

{'title': 'Costas Now',
 'season': '1',
 'total_seasons': '4',
 'episodes': [{'title': 'Episode #1.2',
   'released': '2005-06-10',
   'episode': '2',
   'imdb_rating': 'N/A',
   'imdb_id': 'tt0772383'},
  {'title': 'Episode #1.3',
   'released': '2005-07-08',
   'episode': '3',
   'imdb_rating': 'N/A',
   'imdb_id': 'tt0772384'},
  {'title': 'Episode #1.4',
   'released': '2005-08-12',
   'episode': '4',
   'imdb_rating': 'N/A',
   'imdb_id': 'tt0772385'},
  {'title': 'Episode #1.5',
   'released': '2005-09-09',
   'episode': '5',
   'imdb_rating': 'N/A',
   'imdb_id': 'tt0772386'},
  {'title': 'Episode #1.6',
   'released': '2005-10-21',
   'episode': '6',
   'imdb_rating': 'N/A',
   'imdb_id': 'tt0772387'},
  {'title': 'Episode #1.7',
   'released': '2005-11-11',
   'episode': '7',
   'imdb_rating': 'N/A',
   'imdb_id': 'tt0772388'},
  {'title': 'Episode #1.8',
   'released': '2005-12-09',
   'episode': '8',
   'imdb_rating': 'N/A',
   'imdb_id': 'tt0772389'}],
 'response': 'True'}

In [24]:
with open('./Assets_&_Data/show_episode_data.json', 'w') as fp:
    json.dump(show_episode_count, fp)

So now I have some rudimentary information about the spisodes withing the season, as well. Spot-checking a couple random entries, I notice that some episodes seem to be missing and there appear to be a lot of "N/A" values for imdb_rating. I thought this may have been an issue with my function or the API at first, but it turned out that the database is just missing this information altogether.

## OMDB Data Overview

On only a cursory glance, there are a lot of nulls/N/As. I will need to figure out which features can be dropped, which shows will need to be dropped, and how to impute data for the shows I keep that have missing data.

Information to keep from this query:
- Actors
- Director
- Genre
- Episode (just for tracking purposes)
- Plot
- Production
- Ratings
- Released
- Runtime
- Season (same as Episode)
- Writer
- imdbRating
- imdbVotes
- seriesID (same as Episode/Season)

# Next Steps

Once I have obtained all the relevant information from the APIs, I will want to save them as individual dictionaries/sparse dataframes.

In [26]:
with open('./Assets_&_Data/omdb_dict.json', 'w') as fp:
    json.dump(omdb_shows, fp)

In [28]:
with open('./Assets_&_Data/omdb_tv_dict.json', 'w') as fp:
    json.dump(shows_omdb_keys, fp)