<img src="https://raw.githubusercontent.com/ashlitaylor/ashlitaylor.github.io/master/images/Popcorn.jpeg" height="600" width="400">
<header> 
    <h2 align="center"> Scraping, Cleaning, and Graphing The Movie Database "Drama" data </h2>
    <h6 align="center"> Python code to run API </h6>
</header>

The Movie Database [TMDb](https://www.themoviedb.org/) is a popular, user editable database for movies and TV shows. This Python code accesses the API to scrape and parse data for 350 movies that have been released in the 'Drama' genre since 2004 and five movies that are similar with the goal of graphing a network of Drama movies that are similar. 

#### Before Running the code
An API key is needed to access the TMDb data and run the code.
* How to get a [TMDb account](https://www.themoviedb.org/account/signup)
* Request an [API key](https://docs.google.com/document/d/e/2PACX-1vQkWjHiLS1Xu2HZNQ7Egv08l_DdPnugoxUOZ0ugqBNHWY529xWB417QoSS0MbIih6PS9gu1Y1D-NFDT/pub)
* TMDb API [Documentation](https://developers.themoviedb.org/3/getting-started/introduction)

I removed all duplicate movie pairs. That is, if both the pairs A,B and B,A are present, I only keep A,B where the value of title A is less than thevalue of title B. After removing duplicate pairs, I request five similar movies and output the data to a CSV file. I used the data from the CSV file to generate a graph using Gephi that maps movie similarity.

##### Importing the necessary modules and libraries

In [1]:
import http.client
import json
#import time
import sys
#import collections
import urllib
import urllib.request
from urllib.request import urlopen
import csv
from collections import defaultdict
import threading
import urllib.parse

###### Drama Query
The code cell below takes the user's unique API key as the input for when the query is made.

In [None]:
API_KEY = input()

I first created an empty dictionary to store the raw query data, and an empty list to store the list of pages of the query. 

In [3]:
drama_genre_query = {}
drama_list_pages = list()

I used the API to search for movies in the ‘Drama’ genre released in the year 2004 or later. I retrieve the 350 most popular movies in this genre and sort them from most popular to least popular. Multiple API calls are needed to retrieve all movies. Each query returns one page that contains 20 data instances, so I use a 'for' loop to request the first 18 pages that contains data for the top 18 x 20 = 360 movie titles. 

In [4]:
q1b_url = 'https://api.themoviedb.org/3/discover/movie?api_key=' + API_KEY + '&language=en-US&sort_by=popularity.desc&include_adult=false&include_video=false&primary_release_date.gte=2004-01-01&with_genres=18'
pages = list(range(1,19))
for i in pages:
    page_url_append = '&page=' + str(i)
    drama_page = json.load(urllib.request.urlopen(q1b_url+page_url_append))
    drama_results = drama_page['results']
    drama_list_pages.append(drama_results)

I created a dictionary containing movie-ID and title key:value pairs with top 350 movies and saved the data to a csv file. Each line in the file describes one movie, in the following format: movie-ID,movie-name 

In [5]:
for page in range(0, (len(drama_list_pages))):
    for movie in range(0, (len(drama_list_pages[page]))):
        movie_id = drama_list_pages[page][movie]['id']
        movie_name = drama_list_pages[page][movie]['title']
        if len(drama_genre_query) < 350:
            drama_genre_query.update({movie_id: movie_name})
#Writing movies to CSV file
csv_file = 'movie_ID_name.csv'
with open(csv_file, 'w', encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile)    
    writer.writerows(drama_genre_query.items()) 

###### Similar movie retrieval
For each movie I retrieved, I used the API to find its 5 similar movies. The API will return as many as it can find, so I made my code code flexible to work with however many movies the API returned. I looped through each movie title that I collected with my first API call to use a function to obtain the similar movies. The API allows 40 requests every 10 seconds, so I built a timeout interval into my code while looping through the requests.

In [6]:
urlfront = 'https://api.themoviedb.org/3/movie/'
urlend = '/similar?api_key=' + API_KEY + '&language=en-US&page=1'
similar_movie_list = list()

def similarrequest(queryurl):
    similar_movie_query = json.load(urllib.request.urlopen(queryurl))
    similar_movies = similar_movie_query['results']
    if len(similar_movies) > 5:
        similar_movies = similar_movies[0:5]
    for target in range(0, len(similar_movies)):
        target_movie_id = similar_movies[target]['id']
        similar_pair = source, target_movie_id
        similar_movie_list.append(similar_pair)    

for source in drama_genre_query:
    queryurl = urlfront + str(source) + urlend
    requesttimer = threading.Timer(0.25, similarrequest(queryurl))
    requesttimer.start()    
    requesttimer.cancel()

###### Deduplication
After I found all similar movies, I removed all duplicate movie pairs. That is, if both the pairs A,B and B,A are present, I only keep A,B where A < B. I saved the results in a csv file, where each line in the file describes one pair of similar movies in format movie-ID,similar-movie-ID

In [7]:
paircount = defaultdict(int)
test_list = similar_movie_list.copy()
for source, target in test_list:
    pair = (min(source, target), max(source, target))
    paircount[pair] += 1
    #print(list(paircount.keys()))
paircount = defaultdict(int)
update_list = similar_movie_list.copy()
for source, target in update_list:
    pair = (min(source, target), max(source, target))
    paircount[pair] += 1
for source, target in paircount:
    pair = source, target
    if(paircount[pair]) >= 2:
        pair_remove = target, source
        update_list.remove(pair_remove)
similar_dict = dict(update_list)

#Writing to CSV file
csv_file = 'movie_ID_sim_movie.csv'
with open(csv_file, 'w', encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(similar_dict.items())

##### Similar movie network graph

After I pulled the movie-ID:similar-movie-ID pairs, I used [Gephi](https://gephi.org/) to create the graph below that maps the movie similarity. I chose to color and scale the nodes based on the number of in-degree connections

<img src="https://ashlitaylor.github.io/TMDb/graph.png" height="600" width="400">