INTRODUCTION
--

The goal of this project was primarily to combine two of my primary interests: teen dramas and American politics. In this project, I investigated what impact, if any, 9/11 had on the television series Gilmore Girls through dictionary based sentiment analysis and frequency distribution. Gilmore Girls is a show that is known for the sheer amount of language in any given episodes. The titular characters, Rory and Lorelai Gilmore, are known for their fast-talking and immense repertoire of cultural references. As such, this made Gilmore Girls an interesting choice for an analysis on such a monumental global event. 

Through sentiment and frequency distribution analysis, I sought to uncover any change in words related to 9/11 between seasons one and three of Gilmore Girls, which were released in 2000 and 2002 respectively.


The data used for this project, that is the transcription of Gilmore Girls episodes was obtained from https://www.gilmoregirls.org. This included a handful of episodes from both seasons 1 and 3 scraped from the website cited above using BeautifulSoup. Season two was omitted as the episodes aired only a short time after the events of 9/11 and were most likely written before 9/11/2001. 

CODE
---

In [1]:
import seaborn as sns
import matplotlib.pyplot as plt

The above packages will help us plot the frequency of words later.

In [2]:
#import pandas as pd
#import numpy as np

In [None]:
#import gensim
#import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

In [3]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

In [4]:
stops = set(stopwords.words('english'))
#print(stops)

Nltk allows us to tokenize the text, i.e., transforming a large block of text into single words, and remove certain stop words. The list of stop words provided by the nltk dictionary includes words like "because", "only", "am", which that are commonly used in spoken and written language. These words detract from other important words in the text.

In [5]:
import requests

In [6]:
from bs4 import BeautifulSoup

Importing 'requests' allows to get the HTML from a given url where desired text is located. This is how we will get the transcripts of the Gilmore Girls episodes.

BeautifulSoup allows us to easily parse data that has been has been scraped from the web. 

In [None]:
#import seaborn as sns
#import re

In [7]:
def getWordsFromPage(pageNumber):
    url = 'https://www.gilmoregirls.org/eguide/transcripts/episode{pageNumber}.html'
    url = url.format(pageNumber=pageNumber)

    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    data = ''
    for data in soup.find_all('body'):
        script = data.get_text()    
        get_words = word_tokenize(script)
        lower_words = [x.lower() for x in get_words]
        
        new_words = ' '.join(ch for ch in lower_words if ch.isalnum())
        
        data = new_words
        stopwords = nltk.corpus.stopwords.words('english')
        GGStops = ["morey","oh", "rory", "lorelai", "ah", "ooh", "l", "r", "emily", "max", 
                    "luke", "lane", "richard", "know", "well", "really", "okay", "sookie", 
                    "would", "like", "dean", "go", "paris", "zach", "dave", "terry", "michel", "kyle", 
                   "brian" "louise", "jackson", "yes", "yeah", "kirk", "taylor", "darren", 'darren'
                   "louise", "madeline", "tristan", "jess", "ok", "going", "okay", "get", "got",
                    "think", "bye", "hi", "uh", "somthing", "gon", "na", "tell", "one", 'grandma', 
                  'grandpa', 'grandmother', '1', 'chilton', 'something', 'carol', "site", 'navigation'
                  'transcript', 'navigation', 'summary', 'cast', 'characters', 'episode', 'guide',
                  'drella', "said", "say", "joey", 'debbie', 'jamie', 'jennifer', 'ian']
        stopwords.extend(GGStops)
        words = word_tokenize(data)
        wordsFiltered = []

        for w in words:
            if w not in stopwords:
                wordsFiltered.append(w)
                    
        return(wordsFiltered)


In the above code, we created a function that retrieves the script data from the url (https://www.gilmoregirls.org/eguide/transcripts/episode{pageNumber}.html) by the url page number which corresponds to the season and episode number. Using BeautifulSoup, we parse the retrieved text into single words. We then transform all of the words into their lowercase form. Using the 'isalnum()' method, we remove any characters that are not alphanumerical, like punctuation. 

Using nltk we were able to remove common English stopwords. However, upon inspection of the data, we find that character names, expressions like "oh," and other unimportant words prevent us from focusing on the important words. Therefore, we appended a new list (GGStops) onto the list of stopwords to filter out the unwanted words. 

In [8]:
season_1 = (getWordsFromPage(1)) + (getWordsFromPage(2)) + (getWordsFromPage(3)) + (getWordsFromPage(4)) + (getWordsFromPage(5)) 
season_3 = (getWordsFromPage(301)) + (getWordsFromPage(302)) + (getWordsFromPage(303)) + (getWordsFromPage(304)) + (getWordsFromPage(305)) 

here we concatenated the word lists from episodes within a given season and assigned them to a variable denoting their season 

In [None]:
print(season_3)

Using the print function we can look at a list of words from each given episode. for season one episodes are simply 1 - 21.
for further seasons, they are formated as such: season 2 episode 1 = 201 etc.

In [None]:
from wordcloud import WordCloud

WordCloud lets us make a word cloud from the list off words :) unbelievable

In [None]:
wordcloud = WordCloud(background_color="white", max_words=150, contour_width=1, contour_color='steelblue')

~make it pretty

In [None]:
wordcloud.generate(str(getWordsFromPage(319)))

get a word cloud for a specific episode 

In [None]:
wordcloud.generate(str(season_1))

get a word cloud with pre-defined seasons variable

In [None]:
wordcloud.to_image()

visualize the word cloud!

In [None]:
sns.set_style('darkgrid')
nlp_words=nltk.FreqDist(season_1)
nlp_words.plot(25);


another visualization, a graph representing the frequency of words in a given season

In [None]:
from collections import Counter
data_set1 = season_1
Counter = Counter(data_set1)

most_occur = Counter.most_common(50)
  
#print(most_occur)


the words that occur the most in season 1

In [None]:
from collections import Counter
data_set2 = season_3
Counter = Counter(data_set2)

most_occur = Counter.most_common(50)
  
#print(most_occur)

the words that occur the most in season 3

In [None]:
li1 = season_1
li2 = season_3
 
wordsin1 = []
for element in li1:
    if element not in li2:
        wordsin1.append(element)
 


from collections import Counter
data_set = wordsin1
Counter = Counter(wordsin1)

most_occur = Counter.most_common(110)
  
#print(most_occur)
#print(wordsin1)


In [None]:
sns.set_style('darkgrid')
w1_words=nltk.FreqDist(wordsin1)
w1_words.plot(25);

w1_words.tabulate(10)

words occurring in season 1 and not season 3

In [None]:
li1 = season_1
li2 = season_3
 
wordsin3 = []
for element in li2:
    if element not in li1:
        wordsin3.append(element)
 


from collections import Counter
data_set = wordsin3
Counter = Counter(wordsin3)

most_occur = Counter.most_common(100)
  
#print(most_occur)
#print(wordsin3)


In [None]:
sns.set_style('darkgrid')
w3_words=nltk.FreqDist(wordsin3)
w3_words.plot(25);

w3_words.tabulate(10)


words occurring in season 3 and not season 1

Resources