# IMDB Dataset Description Extraction

This notebook will assist with web scraping the descriptions for the various datasets located on https://www.imdb.com/interfaces/. You must be connected to the Guest Nationwide network or be off of Nationwide network to execute the notebook.

The notebook will parse through the html content and isolate on the class tag called _blurb_. Once the tag is found, we'll parse through html to find the paragraph tag `<p>`. This tag encloses the dataset name. Then performing another prasing exercise, we'll look for the line item tag `<li>`. That tag contains the individual field descriptions.
    
After all the parsing is complete, a CSV file will be generated for each dataset containing the field descriptions for that file. The file can be imported into other notebooks for exploration.

In [58]:
# encoding: utf-8

from bs4 import BeautifulSoup
import requests

# specify the url
url = 'https://www.imdb.com/interfaces/'

# fetch url
response = requests.get(url)

# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(response.content, 'html.parser')

In [59]:
# find the class blurb section of the page
divs = soup.find_all('p', {'class': 'blurb'})

# validate that the class was found.
len(divs)

1

In [60]:
# capture the results
div = divs[0]

In [86]:
# parse through the results and look for the bold attribute - the dataset is encased in bold tags: <b>title</b>
# store the data set names in a list for future use

dataset = []
for section in div.find_all('b'):
    ds = section.find_all(text=True)[0]
    dataset.append(ds)

In [87]:
# delete 'Data Location' from the dictionary
del dataset[0]

In [88]:
# delete 'IMDb Dataset Details' from the dictionary - the list gets reorder
del dataset[0]

In [89]:
# parse through the results and look for the list item attribute - the field descriptions are encased in li tags: <li>title</li>
# store the field descriptions in a list for future use
description = []
for section in div.find_all('li'):
    des = section.find_all(text=True)[0]
    description.append(des)

In [94]:
import re

fldDescription = []

# interate through the field descriptions and break apart the elements
# use regular expression library to locate and replace any en-dash characters

for i in description:
    txt = re.sub(u"\u2013", "-", i) # convert all en-dash to utf-8 dash
    txt = txt.replace('‘', "'") # convert grave accent to utf-8 single quote
    txt = txt.replace('’', "'") # convert acute accent to utf-8 single quote
    txt = txt.split('-', 1) # split txt on the first occurence of a dash
    fldDesc = txt[1].strip()
    txt = txt[0].split(' ', 1) # split txt on the first occurence of a space
    fldName = txt[0].strip()
    fldType = txt[1].strip()
    fldType = fldType.replace('(', '').replace(')', '') # remove the parenthesis around type value
    fldDescription.append([fldName, fldType, fldDesc])

In [95]:
# generate a header row
HEADER = ['fldName', 'fldType', 'fldDesc']

In [96]:
# add header and copy the first 8 indexes from field description for title.akas
file1 = []
file1.append(HEADER)
for i in fldDescription[0:8]:
    file1.append(i)

In [97]:
# add header and copy the next 9 indexes from field description for title.basics
file2 = []
file2.append(HEADER)
for i in fldDescription[8:17]:
    file2.append(i)

In [98]:
# add header and copy the next 3 indexes from field description for title.crew
file3 = []
file3.append(HEADER)
for i in fldDescription[17:20]:
    file3.append(i)

In [99]:
# add header and copy the next 4 indexes from field description for title.episode
file4 = []
file4.append(HEADER)
for i in fldDescription[20:24]:
    file4.append(i)

In [100]:
# add header and  copy the next 6 indexes from field description for title.principals
file5 = []
file5.append(HEADER)
for i in fldDescription[24:30]:
    file5.append(i)

In [101]:
# add header and  copy the next 3 indexes from field description for title.ratings
file6 = []
file6.append(HEADER)
for i in fldDescription[30:33]:
    file6.append(i)

In [102]:
# add header and copy the remaining indexes from field description for names.basics
file7 = []
file7.append(HEADER)
for i in fldDescription[33:]:
    file7.append(i)

In [103]:
import csv

def writeFile(fileName, listin):

    # format the output file name
    fileName = fileName.split('.')
    fileName = fileName[0] + '_' + fileName[1]
    
    # write field description list to a comma delimited seperated file
    with open('./data/' + fileName + '_flddesc.csv', "w", newline='') as f:
        writer = csv.writer(f,delimiter='^',)
        writer.writerows(listin)

In [104]:
# write field descriptions for title.akas to csv file
writeFile(dataset[0], file1)

In [105]:
# write field descriptions for title.basics to csv file
writeFile(dataset[1], file2)

In [106]:
# write field descriptions for title.crew to csv file
writeFile(dataset[2], file3)

In [107]:
# write field descriptions for title.episode to csv file
writeFile(dataset[3], file4)

In [108]:
# write field descriptions for title.principles to csv file
writeFile(dataset[4], file5)

In [109]:
# write field descriptions for title.ratings to csv file
writeFile(dataset[5], file6)

In [110]:
# write field descriptions for name basics to csv file
writeFile(dataset[6], file7)