# Film Script Analyzer

## Data preparation

### Webscrapping

All scripts were scrapped from the following page: [SprinfieldSpringfield.](http://www.springfieldspringfield.co.uk/)

Proceed to scrap the information from the pages, and parse it to *UTF-8* encoded text.

The **'fetch'** function requests and receives the contents of a webpage throws and exception if finds a problem, it is used to scrape the web for information.

In [None]:
import operator
import requests
import string
import sys

def fetch(address):
    res = requests.get(address)
    try:
        res.raise_for_status()
    except Exception as exc:
        print('Problem: %s'%(exc))
    return res

For the source URL, the variations are in the page number and the page letter, which is the starting letter of the script name. Collect the ranges or number of pages for all the initial letters.

In [None]:
import bs4
import json

url       = ["http://www.springfieldspringfield.co.uk/movie_scripts.php?order=","&page="]
letters   = string.ascii_uppercase
charNum   = {}

for i in list('0'+letters):
    start = '1'
    main = bs4.BeautifulSoup(fetch(url[0]+i+url[1]+start).text,'lxml')
    temp = main.find_all('a')
    for j in temp[-1:]:
        num = str(j.contents[0]).encode('utf8')
        charNum.update({i:int(num)})

Now that the Letter/Length pair is available, proceed to scrap the links. In order to not overload the server it, randomly select one number for each letter, that is, one page randomy visited for per letter. That should serve the purpose of a statistically valid sampling.

The URLs are returned without the root that needs to be added. A dictionary is defined to hold the name of the movie and its link these links contain the scripts and then is saved to disk.

In [None]:
import pandas as pd
import numpy  as np

prefix = 'http://www.springfieldspringfield.co.uk'
pages  = {}

for char in charNum.keys():
    for num in range(1,charNum[char]+1):
        temp = bs4.BeautifulSoup(fetch(url[0]+char+url[1]+str(num)).text,'lxml')
        aLinks = temp.find_all('a',class_='script-list-item')
        clean = ''
        for link in aLinks:
            pages.update({link.contents[0].encode('utf-8'):prefix+link.get('href')})
            
with open('data/pages.json','w') as f:
    json.dump(pages,f)

To perform the web scrapping specifically from this page. The **'springScrap'** function is defined. This function is customized to extract the information from the Springfield page.

In [None]:
def springScrap(raw):
    result = ' '
    soup = bs4.BeautifulSoup(raw.text,'lxml')
    for e in soup.findAll('br'):
        e.extract()
    for text in soup.find_all('div',class_='scrolling-script-container'):
        result += text.get_text().encode('utf8')
    return result

Iteratively, for each name in the pages dictionary, request the data.

In [None]:
for name, url in zip(pages.keys(),pages.values()):
    temp = springScrap(fetch(url))
    f = open('scripts/'+''.join(e for e in name if e.isalnum() or e == ' ')+'.txt','w')
    f.write(temp)
    f.close()

A total of **17,271** scripts wer stored in disk.

Next section: [IMDB](https://github.com/luisecastro/film_script_analysis/blob/master/01_imdb.ipynb)