## 1. Loading Required Packages

`BeautifulSoup` is one of pthon's main webscraping libraries. `numpy` and `pandas` are standard data manipulation packages (they will give you functionality that you are used to from `R`, e.g. data frames). `requests` allows you to interact with web pages. `re` handles regular expressions. `sklearn` is python's main machine learning library (it is amazing and the gold standard). `nltk` is for natural language processing, in our case for lemmatizing the words we will be scraping. 

In [1]:
# Beautiful Soup
from bs4 import BeautifulSoup

# Utils
import numpy as np
import pandas as pd
import requests, re, time
from time import sleep

# Sklearn
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# NLTK
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dsb48\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dsb48\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dsb48\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## 2. Scraping Plots from Rotten Tomatoes Top 100 Movies for Each Genre

The webpage for each movie has a storyline section that we can scrape after figuring out the CSS tag. The Rotten Tomatoes movie webpages look like https://www.rottentomatoes.com/m/[code]/ where [code] is specific to each movie and we need to figure it out. For simplicity, we'll just create a list of the genres we want to scrape from and their associated urls; however the commented out code below can be used to scrape the top movies from all genres. To get our scraped data, we have to go into Chrome's developer tools to figure out what part of the website we are interested in. In this case, we want the anchor tags ('a') of class 'unstyled articleLink' and we will use the `.find_all(tag, class_)` method of our `BeautifulSoup` object to extract the relevant html.

#### 2.a Get Links

In [2]:
# Getting Links to different genre lists
#url = "https://www.rottentomatoes.com/top/bestofrt/"
#resp = requests.get(url)
#soup = BeautifulSoup(resp.text)
#soup = BeautifulSoup(str(soup.find_all("ul", class_="dropdown-menu")[0]))

#genre_links = soup.find_all('a')
#pattern = "href=\"(.*)\">"
#genres = [re.search(pattern, str(el)).group(1) for el in genre_links if re.search(pattern, str(el)) is not None]

# defining the list of genres we are interested in scraping
genres = ["/top/bestofrt/top_100_action__adventure_movies/",            # action/adventure
          "/top/bestofrt/top_100_comedy_movies/",                       # comedy
          "/top/bestofrt/top_100_drama_movies/",                        # drama
          "/top/bestofrt/top_100_science_fiction__fantasy_movies/",     # sci-fi/fantasy
          "/top/bestofrt/top_100_sports__fitness_movies/",              # sports
          "/top/bestofrt/top_100_horror_movies/"]                       # horror

# initialize list of movie url extensions
movie_pages = []

# getting links to movie pages
for genre in genres:
    sleep(np.random.randint(low=1,high=3,size=1))   # not necessary, but just puts random time between requests so we don't get
    url = "https://www.rottentomatoes.com" + genre  # flagged as a bot
    resp = requests.get(url)                        # 1. connecting to the site
    soup = BeautifulSoup(resp.text)                 # 2. creating a beatiful soup object from the html text
    pattern = "href=\"(/m.*)\">"                    # 3. defining the regex pattern that allows us to extract links
    
    # 4. extract and clean relevant html into a list using python's list comprehension
    links = soup.find_all("a", class_ = "unstyled articleLink") 
    movie_links = [re.search(pattern, str(link)).group(1) for link in links if re.search(pattern, str(link)) is not None]
    movie_pages.extend(movie_links)                 # 5. adding the cleaned links to the list we initialized
    
# remove duplicates
movie_pages = list(set(movie_pages))

We are able to go from this:

In [3]:
BeautifulSoup(resp.text)

<!DOCTYPE html>
<html lang="en" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
<head prefix="og: http://ogp.me/ns# flixstertomatoes: http://ogp.me/ns/apps/flixstertomatoes#">
<script src="//cdn.optimizely.com/js/594670329.js"></script>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="VPPXtECgUUeuATBacnqnCm4ydGO99reF-xgNklSbNbc" name="google-site-verification"/>
<meta content="034F16304017CA7DCF45D43850915323" name="msvalidate.01"/>
<link href="https://staticv2-4.rottentomatoes.com/static/images/iphone/apple-touch-icon.png" rel="apple-touch-icon"/>
<link href="https://staticv2-4.rottentomatoes.com/static/images/icons/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="https://staticv2-4.rottentomatoes.com/static/styles/css/rt_main.css" rel="stylesheet"/>
<script id="jsonLdSchema" type="application/ld+json">{"@context":"http

To this:

In [4]:
movie_pages

['/m/murderball',
 '/m/anatomy-of-a-murder',
 '/m/skyfall',
 '/m/dawn_of_the_planet_of_the_apes',
 '/m/gentlemen_prefer_blondes',
 '/m/looper',
 '/m/city_lights',
 '/m/step_into_liquid',
 '/m/save_the_green_planet_2004',
 '/m/all_is_lost_2013',
 '/m/french_connection',
 '/m/2001_a_space_odyssey',
 '/m/1197992-perfect_game',
 '/m/up',
 '/m/1021749-touch_of_evil',
 '/m/et_the_extraterrestrial',
 '/m/wages_of_fear',
 '/m/roman_holiday',
 '/m/no_country_for_old_men',
 '/m/the_rules_of_the_game',
 '/m/red_army_2015',
 '/m/blair_witch_project',
 '/m/the_hurt_locker',
 '/m/the_lego_movie',
 '/m/the_african_queen_1951',
 '/m/1046129-fugitive',
 '/m/ghostbusters',
 '/m/tangerine_2015',
 '/m/freaks',
 '/m/up_for_grabs',
 '/m/damned_united',
 '/m/8-12',
 '/m/marvels_the_avengers',
 '/m/geralds_game',
 '/m/pinocchio_1940',
 '/m/the_wrestler',
 '/m/balthazar',
 '/m/crouching_tiger_hidden_dragon',
 '/m/ginger_snaps',
 '/m/hell_or_high_water',
 '/m/split_2017',
 '/m/it_comes_at_night',
 '/m/moana_201

#### 2.b Scraping Plots

We are just going to repeat the approach we took above, but now we know from looking at the movie webpages that the synopses we are interested in are flagged with an `id = "movieSynopsis"` tag which we can now throw into `.find_all()`.

In [5]:
plots = []

for movie_page in movie_pages:
    url = "https://www.rottentomatoes.com" + movie_page
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text)
    plots.append(re.search("\\n(.*)\\n", str(soup.find_all(id="movieSynopsis")[0])).group(1).strip())   
plots[0:4]

["Like any other great sports story, 'Murderball' features fierce rivalry, stopwatch suspense, dazzling athletic prowess, larger-than-life personalities and triumph over daunting odds. But murderball, the original name for the full-contact sport now known as quad rugby, is played by quadriplegics in armored wheelchairs. 'Murderball' is a story like no other, told by men who see the world from a different angle. Quad rugby players have suffered injuries that have left them with limited function in all four limbs. Whether by car wreck, gunshot, fist fight, rogue bacteria or any of an endless list of possible misadventures, quad rugby's young men have found their lives dramatically altered. Watching them in action -- both on court and off -- smashes every stereotype one has ever had about the handicapped. It also redefines what it is to be a man, what it is to live a full life, and what it is to be a winner.",
 'Based on the best-selling novel by Robert Traver (the pseudonym for Michigan 

# 3. Latent Dirichlet Allocation

First we create functions for cleaning and 'lemmatizing' our plots.

In [16]:
import string

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

def lemmatize_plots(plots):
    """ cleans scraped list of documents to be ready for LDA """
    # remove all punctuation and send to lower case
    no_punct = [plot.translate(str.maketrans('', '', string.punctuation)) for plot in plots]
    pos_list = ["JJ", "JJR", "JJS", "NN", "NNS", "NNPS", "RB", "RBR", "RBS", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]
    clean = []
    for plot in no_punct:
        clean.append(' '.join([w for w in plot.split() if nltk.pos_tag([w])[0][1] in pos_list]))
    
    lemmatizer = WordNetLemmatizer()
    
    cleaner = []
    # lemmatize documents
    for plot in clean:
        cleaner.append(' '.join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(plot)]))
    return cleaner

plots_lemmatized = lemmatize_plots(plots)

Then we initialize a `CountVectorizer` object, which creates a matrix of word counts that we will pass to the `LatentDirichletAllocation`.

In [17]:
vectorizer = CountVectorizer(analyzer='word',       
                             min_df=0.01,                   # eliminate words occuring in less than this proportion of docs
                             max_df=0.7,                    # eliminate words occuring in more than this proportion of docs 
                             stop_words='english',          # remove stop words
                             token_pattern='[a-zA-Z]{3,}',  # num chars > 3, no numbers
                             #max_features=50000            # max number of uniq words
                            )

vectorized_plots = vectorizer.fit_transform(plots_lemmatized)

Finally, we specify our LDA model, fit it, and report our results.

In [24]:
lda_model = LatentDirichletAllocation(n_components = 15, # Number of topics
                                      max_iter=50,                # Max learning iterations
                                      learning_method='online',   
                                      random_state=100,          # Random state
                                      batch_size=128,            # n docs in each learning iter
                                      evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      n_jobs = -1                # Use all available CPUs
                                     )
lda_output = lda_model.fit_transform(vectorized_plots)

# Show top n keywords for each topic
def show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=20):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords

topic_keywords = show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=15)        

# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords

Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9,Word 10,Word 11,Word 12,Word 13,Word 14
Topic 0,voice,murder,dead,capture,film,woman,force,race,classic,plan,stun,hero,set,novel,attack
Topic 1,camp,driver,film,escape,french,captain,nicholson,british,american,group,set,new,half,southern,japanese
Topic 2,child,family,award,jane,turn,academy,horror,black,home,terrify,film,begin,evil,town,nominee
Topic 3,eve,life,love,work,world,star,story,rescue,man,way,rovi,meet,home,planet,people
Topic 4,quest,voice,guide,encounter,teenager,band,film,feature,island,haunt,sexual,vision,sense,strange,comedy
Topic 5,social,explores,century,relationship,struggle,carrie,way,drama,film,special,classic,focus,story,set,rovi
Topic 6,young,family,time,kill,make,home,mission,order,begin,come,force,men,life,alien,future
Topic 7,film,best,director,movie,rovi,make,picture,hal,erickson,academy,play,story,version,star,won
Topic 8,film,friend,best,movie,focus,son,father,way,ordinary,girl,social,wood,shoot,year,start
Topic 9,chance,bob,hotel,adventure,fox,survival,family,war,world,lady,new,trust,fight,famous,change


## 2. Scraping Plots from IMDb's '1,000 Greatest Films of All Time'

The webpage for each movie has a storyline section that we can scrape after figuring out the CSS tag. The IMBD movie webpages look like https://www.imdb.com/title/[code]/ where [code] follows the pattern "tt1234567". First we will scrape these codes from the "Top 1,000" list, then we will scrape the individual storylines.  

In [None]:
def movie_link_getter(tags):
    """ Returns a list of codes for each movies webpage"""
    link_pattern = re.compile('tt[0-9]{7}')
    links = [link_pattern.search(str(tag)) for tag in tags]
    return [link.group(0) for link in links]

# get the url codes for top 1,000 most best movies on IMBD
links = []

# webpages are broken up into 100 movie chunks, so we will need to loop through them
for start in range(1,11):
    url = "https://www.imdb.com/list/ls006266261/?sort=list_order,asc&st_dt=&mode=detail&page=" + str(start)
    #url = 'https://www.imdb.com/search/title/?groups=top_1000&count=100&start=' + str(start) + '&ref_=adv_nxt'
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text)
    tags = soup.select('.lister-item-header a')
    links.extend(movie_link_getter(tags))

In [None]:
from itertools import cycle

# download proxies so we don't get blocked
#proxies = requests.get("https://free-proxy-list.net/")
#proxies_soup = BeautifulSoup(proxies.text)
#proxies_df = pd.read_html(str(proxies_soup.find_all('table')[0]))[0]
#proxies_df.dropna(inplace=True)
#proxy_pool = cycle(proxies_df['IP Address'])

story_lines = []

for link in links[0:10]:
    "https://www.rottentomatoes.com"
    url = 'https://www.imbd.com/title/' + link + "/plotsummary?ref_=tt_stry_pl"
    #while True:
    #    try:
    #        proxy = next(proxy_pool) # get a new proxy
    #        resp = requests.get(url,proxies={"http": proxy, "https": proxy})
    #    except:
    #        continue
    #    break
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text)
    synopsis = soup.find_all(id=re.compile("^synopsis-.*"))
    
    # clean up the text a little
    synopsis = re.sub("<br/>", " ", str(synopsis[0]))
    synopsis = re.search(">(.*)<", synopsis).group(1)
    
    story_lines.append(synopsis)