# Scrape the IMDB website

![title](../docs/img/img3.png)

Collection of functions used to scrape the IMDB website. All functions found in this notebook are also in `imdb_scraper.py` in the same directory.

In [15]:
# import the required packages
from bs4 import BeautifulSoup
import requests
import json

### Process HTML from URL
Simple function which uses beautiful soup to process a website given a URL.

In [4]:
def getHTML(url):
    response = requests.get(url)
    return BeautifulSoup(response.content,'html.parser')

### Parse Persons
Function used to parse cast, writers, directors of scraped site.

In [18]:
def parsePersons(persons):
    names = []
    if isinstance(persons,dict):
        names.append(persons['name'])
        return names

    for person in persons:
        if person['@type'] == "Person":
            names.append(person['name'])
    return names

### Generate JSON File
This function takes the html generated in `getHTML`, extracts the data into a python dictionary then finally returns a `JSON` file.

In [3]:
def getJSON(html):
    data = {}
    data['id'] =  html.find(attrs={'property':'pageId'})['content']
    data['url'] = 'https://www.imdb.com/title/'+data['id']
    html_json =  html.find(attrs={'type':'application/ld+json'}).text.strip()
    fetchedJson = json.loads(html_json)
    data['poster'] = html.find(attrs={'class':'poster'}).find('img')['src']
    title_wrapper =  html.find(attrs={'class':'title_wrapper'}).text.strip()
    data['title'] = title_wrapper[:title_wrapper.find(')')+1]
    data['rating'] = html.find(itemprop='ratingValue').text
    data['bestRating'] = html.find(itemprop='bestRating').text
    data['votes'] = html.find(itemprop='ratingCount').text
    data['rated'] = fetchedJson['contentRating']
    data['genres'] = fetchedJson['genre']
    data['description'] = fetchedJson['description']
    data['cast'] = parsePersons(fetchedJson['actor'])
    data['writers'] = parsePersons(fetchedJson['creator'])
    data['directors'] = parsePersons(fetchedJson['director'])
    json_data = json.dumps(data)
    return json_data

### Handle URL
Function used to handle exceptions and concat strings when dealing with the URL input.

In [6]:
def getURL(input):
    try:
        if input[0] == 't' and input[1] == 't':
            html = getHTML('https://www.imdb.com/title/'+input+'/')

        else:
            html = getHTML('https://www.google.co.in/search?q='+input)
            for cite in html.findAll('cite'):
                if 'imdb.com/title/tt' in cite.text:
                    html = getHTML(cite.text)
                    break
        return getJSON(html)
    except Exception as e:
        print(e)
        return 'Invalid input or Network Error!'

## Conclusion
This notebook walks through the various functions used in the `imdb_scraper.py` file. This python file will be used in the next notebook for further processing.