# Lab Assignment 8



## Problem 1: Analyzing a movie script

The [Internet Movie Script Database](http://www.imsdb.com/) contains the text of screenplays for thousands of movies, including many current releases. Let's analyze a script. If you want to learn more about how screenplays are formatted, [here you go](https://www.finaldraft.com/mm_media/mm_pdf/How_to_Format_a_Screenplay.pdf).

In [1]:
import requests
from bs4 import BeautifulSoup, NavigableString, Tag
import nltk
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
screenplay_raw = requests.get('http://www.imsdb.com/scripts/Star-Wars-The-Force-Awakens.html').text

Turn the `screenplay_raw` into soup.

In [4]:
screenplay_soup = BeautifulSoup(screenplay_raw,'lxml')

In Google Chrome, use the JavaScript console to identify the part of the script's HTML tree containing the body of the script. Then use BeautifulSoup's [`.find_all()`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) method to extract this information. **(Hint: it's beneath the `<td class="scrtext">` tag.)**

In [5]:
screenplay_body = screenplay_soup.find_all('pre')

Check to make sure there is hopefully only one (huge) tag matching your search by calling `len` on `screenplay_body`.

In [6]:
len(screenplay_body)

1

In [10]:
screenplay_body[0]

<pre>

 
<b>                               STAR WARS: THE FORCE AWAKENS
</b>
                         

                         

                                       Written by

                         
                      Lawrence Kasdan, J.J. Abrams &amp; Michael Arndt

                      

                         
                       Based on characters created by George Lucas
                         

                
         

                         
          A long time ago in a galaxy far, far away...

                         

                         

<b>                                        STAR WARS
</b> 
<b>                                       EPISODE VII
</b>
<b>                                    THE FORCE AWAKENS
</b>


           Luke Skywalker has vanished. In his absence,
           the sinister FIRST ORDER has risen from the
           ashes of the Empire and will not rest until
           Skywalker, the last Jedi, has been destroyed.
       

Since `screenplay_body` is a list with (hopefully) one giant tag, call the `.children` method on the first element of the list. This returns an iterator. So wrap the expression generating the iterator inside a `list()` to make it spill its guts and save it as `screenplay_children`.

In [9]:
screenplay_children = list(screenplay_body[0].children)

How many lines are in the screenplay children?

In [11]:
len(screenplay_children)

2976

What's an example of a child node in the screenplay? Slice the `screenplay_children` to return 10 examples of what it contains.

['\r\n', <b>                          KYLO REN
 </b>, "           You're so right.\r\n          And as he RIPS IT DOWN ACROSS SAN TEKKA!\r\n          Poe, RUNNING, SEES THIS AND YELLS, AIMS HIS BLASTER AND FIRES\r\n          AT KYLO REN! Instantly:\r\n          Kylo Ren RAISES HIS HAND -- POE'S BLAST FREEZES -- THE BOLT\r\n\r\n", <b>          OF ENERGY STRAINING AND VIBRATING IN MID AIR!
 </b>, "          Kylo Ren sees Poe, who suddenly CANNOT MOVE, but strains to.\r\n          He is grabbed by Stormtroopers who drag him past the\r\n          VIBRATING, FROZEN BLAST, to Kylo Ren.\r\n          A Stormtrooper begins a brutal PAT DOWN. Kylo Ren moves\r\n          closer. Poe just glares. The Stormtrooper KICKS OUT Poe's\r\n          legs -- he lands hard on his knees.\r\n          Kylo Ren kneels to look at Poe.\r\n\r\n", <b>                          POE
 </b>, '           So who talks first? You talk first?\r\n\r\n', <b>                          KYLO REN
 </b>, '           The old man ga

Many (but not guaranteed to be all) scripts appear to use the `<b>` tags to mark when a character speaks. Use the `.find_all()` method on the `screenplay_body` to get all of these headings and save them as `screenplay_headings`.

In [64]:
screenplay_headings = 
len(screenplay_headings)

1488

What's are examples of screenplay headings?

In [65]:
screenplay_headings[90:100]

[<b>                          (CONTINUED)
 </b>, <b>                         CONTINUED:
 </b>, <b>          EXT. NIIMA OUTPOST - CLEANING TABLE - DAY
 </b>, <b>          INT. NIIMA TRADING STRUCTURE - DAY
 </b>, <b>                          UNKAR
 </b>, <b>          EXT. NIIMA OUTPOST - DAY
 </b>, <b>          INT. REY'S DWELLING - DAY
 </b>, <b>          EXT. REY'S DWELLING - DUSK
 </b>, <b>                          (CONTINUED)
 </b>, <b>                          REY
 </b>]

It looks like there's lots of white space in these headings. Use Python's string methods like [`.strip()`](https://docs.python.org/3.5/library/stdtypes.html#string-methods) and loop through these headings and clean them up and store as `cleaned_headings`. **IMPORTANT**: The elements in this list are soup `Tag` objects, not strings. Use the `.text` method on the objects to convert them to strings before applying the `.strip()` method.

What do these headings look like now?

In [69]:
cleaned_headings[90:100]

['(CONTINUED)',
 'CONTINUED:',
 'EXT. NIIMA OUTPOST - CLEANING TABLE - DAY',
 'INT. NIIMA TRADING STRUCTURE - DAY',
 'UNKAR',
 'EXT. NIIMA OUTPOST - DAY',
 "INT. REY'S DWELLING - DAY",
 "EXT. REY'S DWELLING - DUSK",
 '(CONTINUED)',
 'REY']

Make a dictionary called `heading_counter` that is keyed by the heading and has values for the number of times the heading occurs. Count how often these different headings occur.

Use my `sorting_dict` function to give you the top 20 headings.

In [81]:
from operator import itemgetter

def sorting_dict(d):
    _sorted = sorted(d.items(),key=itemgetter(1),reverse=True)
    return _sorted[:20]

sorting_dict(heading_counter)

[('FINN', 139),
 ('REY', 136),
 ('HAN', 122),
 ('(CONTINUED)', 77),
 ('CONTINUED:', 73),
 ('POE', 61),
 ('KYLO REN', 54),
 ('LEIA', 25),
 ('GENERAL HUX', 22),
 ("HAN (CONT'D)", 21),
 ("REY (CONT'D)", 19),
 ("FINN (CONT'D)", 19),
 ('MAZ', 17),
 ('FN-2187', 16),
 ('INT. MILLENNIUM FALCON - COCKPIT - DAY', 15),
 ('INT. RESISTANCE BASE - DAY', 13),
 ('SNOKE', 12),
 ('BALA-TIK', 11),
 ('STORMTROOPER', 10),
 ('CAPTAIN PHASMA', 10)]

## Problem 2: Cleaning text of screenplay

Loop through the `screenplay_children`, check the type of each element, and add only those that are `NavigableString` type to a list `screenplay_lines`.

There's still a lot of whitespace breaking up these lines. Clean them up using regular expression, string split and join methods, or some other combination.

["Don't give up. He still might show up. Whoever it is you're waiting for. Classified. I know all about waiting. BB-8 BEEPTALKS a question.",
 "For my family. They'll be back. One day. Come on. She tries to force a smile, but can't, really. She heads off. BB-8 BEEPS... then heads after her.",
 'Rey stands with BB-8 in front of Unkar Plutt, at his window. He reviews her goods. He glances quickly at BB-8.',
 'These five pieces are worth... Let me see here... One half portion.',
 'Last week they were a half portion each. She hates him. He leans forward.',
 'What about the droid?',
 'What about him?',
 "I'll pay for him. BB-8 doesn't like this at all. Rey is awkward, but curious.",
 'Sixty portions. CLOSE ON HER. Stunned. Literally hungry for this amount of food, her stomach practically rumbles. BB-8 sees her interest and BEEPS furiously, not liking this conversation at all. She looks at Unkar. Looks down at BB-8. Considers it all. Finally, she hears herself say:',
 "Actually... the droid'

## Problem 3: Normalizing screenplay text

Now that you have the lines from the movie, we want to analyze their content. The lines already have some sentence segmentation done for you, so we want to proceed farther down the Pipeline of Information Extraction (from the [NLTK book](http://www.nltk.org/book/ch07.html)):

![](http://www.nltk.org/images/pipeline1.png)

Setup some support functions from NLTK.

In [100]:
porter = nltk.PorterStemmer()
wnl = nltk.WordNetLemmatizer()

Create a `line_tokens` list containing all the lower-case tokens by looping over each line of the `cleaned_lines` and using the `nltk.word_tokenize` function.

['any',
 'we',
 'have',
 'seen',
 '--',
 'hurtles',
 'past',
 'us',
 ',',
 'of',
 'seemingly',
 'endless',
 'length',
 ',',
 'eclipsing',
 'the',
 'moon',
 '.',
 'after',
 'a',
 'long',
 'beat',
 ',',
 'four',
 'transport',
 'ships',
 'fly',
 'from',
 'a',
 'hangar',
 '.',
 'we',
 'hold',
 'on',
 'them',
 'now',
 ',',
 'as',
 'they',
 'fly',
 'off',
 'toward',
 'a',
 'distant',
 'planet',
 '.',
 'jakku',
 '.',
 'music',
 'builds']

NLTK will tell you if each word is alphabetic or a punctuation phrase.

In [None]:
for token in line_tokens[:20]:
    print(token.rjust(10),token.isalpha())

NLTK also has a list of common English stopwords.

In [108]:
stopwords = nltk.corpus.stopwords.words('english')
stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 '

Create a `cleaned_tokens` list that removes the stopwords and the punctuation.

['written',
 'lawrence',
 'kasdan',
 'abrams',
 'michael',
 'arndt',
 'based',
 'characters',
 'created',
 'george',
 'lucas',
 'long',
 'time',
 'ago',
 'galaxy',
 'far',
 'far',
 'away',
 'luke',
 'skywalker',
 'vanished',
 'absence',
 'sinister',
 'first',
 'order',
 'risen',
 'ashes',
 'empire',
 'rest',
 'skywalker',
 'last',
 'jedi',
 'destroyed',
 'support',
 'republic',
 'general',
 'leia',
 'organa',
 'leads',
 'brave',
 'resistance',
 'desperate',
 'find',
 'brother',
 'luke',
 'gain',
 'help',
 'restoring',
 'peace',
 'justice']

Now use the `porter` PorterStemmer to stem each token.

Now use the `wnl` WordNet lemmatizer to lemmatize each token as a comparison.

Create a counter dictionary for each bag of tokens (stemmed and lemmatized).

Use the `sorting_dict` function from above to show the most frequent words.

In [115]:
sorting_dict(porter_freq)

[('rey', 276),
 ('finn', 239),
 ('han', 173),
 ('ren', 126),
 ('back', 119),
 ('see', 112),
 ('kylo', 107),
 ('look', 106),
 ('chewi', 105),
 ('poe', 101),
 ('turn', 97),
 ('get', 87),
 ('move', 85),
 ('ship', 77),
 ('one', 67),
 ('fire', 67),
 ('come', 66),
 ('stormtroop', 64),
 ('head', 64),
 ('falcon', 62)]

In [116]:
sorting_dict(lemma_freq)

[('rey', 276),
 ('finn', 239),
 ('han', 173),
 ('ren', 126),
 ('back', 119),
 ('kylo', 107),
 ('chewie', 105),
 ('see', 103),
 ('poe', 101),
 ('turn', 93),
 ('look', 89),
 ('get', 81),
 ('move', 77),
 ('ship', 77),
 ('one', 67),
 ('falcon', 62),
 ('fighter', 57),
 ('head', 55),
 ('come', 54),
 ('fire', 53)]