# Processing the scraped IMDB data

![title](../docs/img/img2.png)

This notebook builds upon the previous `Scraper.ipynb` and processes the returned `JSON` file.

In [140]:
import pandas as pd
import imdb_scraper as scrape
import project_funcs
import json
import os

### Finding the Movies
Because we are working with a user-generated list, we cannot use the IMDB API or any third-party alternatives to obtain the film IDs from this page. Instead, we can perform data munging and return a list of IDs.

In [35]:
def getMovies(url):
    ''' scrapes imdb user created list to get the film IDs
        these IDs get used in later computation.
    '''
    response = scrape.getHTML(url)
    data = json.loads(response.find('script', type='application/ld+json').text)
    data = data['about']['itemListElement']
    df = pd.DataFrame(data)
    
    movies = []
    for i in df['url']:
        movies.append(i[7:-1]) # slice the string to get only the ID
    
    return movies

In [82]:
url = 'https://www.imdb.com/list/ls076439519/'
movies = getMovies(url)
print(movies)

### Building the Dataset
From the list of movies generated above, we can now go ahead and get the required data for each one. The following function parses the list of movie IDs to the `getURL()` function then creates a python list of dictionaries corresponding to an input data row. Then, we create a data frame from that list. This method was chosen over appending to a pre-existing data frame row-by-row because of the huge performance gains, approximately 30x more efficient in regards to the rate of growth given an input.

In [92]:
def scrapeToDataFrame(movies):
    ''' parses list of movie IDs to getURL() function
        creates a list of dictionaries then creates a
        data frame from this list.
    '''
    li = []

    for i in range(len(movies)-1):
        dict1 = {}
        item = json.loads(scrape.getURL(movies[i]))
        dict1.update(imdbID = item['id'], 
                     title  = item['title'], 
                     rating = item['rating'],
                     votes  = item['votes'],
                     rated  = item['rated'])
        li.append(dict1)

    df = pd.DataFrame(li)
    return df

In [94]:
# using movies list created above
df = scrapeToDataFrame(movies)

### Export to File
Now that we have compiled the dataframe, it is time to push this to `.csv` in our `../data/` directory. To do this, we utilise the `change_dir()` function defined in the `project_funcs.py` file. This file contains various functions used throughout the project.

In [150]:
project_funcs.change_dir('data') # change directory to ../data/
path = os.getcwd()
export = df.to_csv(r'{}/df.csv'.format(path))