# Differentiating the Humor of Long-Running Animated Adult Comedy TV Shows
#### Ben Jablonski
#### 11-28-2020


In [1]:
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')



### Introduction
Comedy has been a perennial genre in television since the emergence of network television, but there is an incredible variety between the writing and themes of different TV shows. Specifically, animated adult comedy shows have become increasingly popular since the emergence of *The Simpsons* in the early 1990s. The creation of Adult Swim in 2001 created a dedicated time and network for adult animation, and the emergence of streaming services like Netflix and Hulu have cemented funding for new animation projects. Over the last decade, shows like *Rick and Morty* have become cultural sensations. For a more thorough investigation of the history of adult animation, this article from [*Time*](https://time.com/5752400/adult-animation-golden-age/) is a fantastic resource. I wanted to test whether computational techniques such as classification and word2vec models can differentiate between the writing and themes of various animated comedy TV shows. Specifically, which words, themes and relationships between characters define and separate the TV shows from each other? Potentially, these methods can reveal subtextual themes that may not be clear to a casual viewer of the TV show and could elucidate choices made by the writers of TV shows whether the choices are conscious or subconscious. Because of the increasing popularity of animated comedy shows, differentiating and understanding their writing and themes serves an important cultural role through complementing the experience of watching a popular TV show. Specifically, these computational techniques allow a human viewer to further investigate previously banal relationships, and exposing these relationships can provide a critical acuity that enables a richer understanding of a TV show. 


### Relevant Studies 
There are two types of studies that are relevant to this analysis. Some studies are data based while others are more anecdotal and focus on philosophical themes of TV shows. For example, for the TV show *South Park*, there are 45 podcast episodes and 12 YouTube videos on the YouTube channel [Wisecrack](https://www.youtube.com/channel/UC6-ymYjG0SU0jUWnWh9ZzEQ) alone. These analyses fall into the second category with most of the crew of Wisecrack being philosophy graduate students or former philosophy professors. One example is the podcast episode on the "Band in China" *South Park* episode which discusses international copyright issues and censorship as financial blackmail from China. There are various other studies on YouTube or sites like Medium, but most of these studies focus less on the genre of animated comedy shows as a whole. Instead, they focus on individual episodes or individual TV shows. 
The other category of studies is more data science based. These studies focus on individual TV shows with the scripts of every episode. Some examples include "[Going Down to South Park -- Text Analysis with R](https://medium.com/@vertabeloacdm/going-down-to-south-park-text-analysis-with-r-61e8beef6851)", "[The Simpsons meets Data Visualization](https://towardsdatascience.com/the-simpsons-meets-data-visualization-ef8ef0819d13)", "[Visualizing Archer](https://medium.com/@Elijah_Meeks/visualizing-archer-bcb80e319625)", or "[Futurama: Bender's NLP](https://towardsdatascience.com/futarama-benders-nlp-775c47871ad5)". However, many of these studies count individual character's lines and compare the character's lines between seasons. For example, *The Simpsons* study above finds that Homer and Marge talk to each other the most among every character, and they find that Flanders's lines have the most positive sentiment. Neither of these conclusions provide particular insight into their characters or the themes of the TV show. Additionally, I could not find studies that compare different animated adult comedy TV shows.





### Data 
It is difficult to be unbiased in deciding which TV shows to include in the analysis, but only a dozen animated adult comedy shows have run for more than 5 seasons. I wanted to select TV shows with at least 60 or 70 episodes in order to have a larger amount of text to analyze for each show. Given these conditions, I decided to select the following TV shows: *South Park*, *The Simpsons*, *Futurama*, *American Dad!*, and *Family Guy*. Other shows that met the 70-episode threshold but were not selected include *Archer* and *King of the Hill*. The production studios for these shows had filed DMCA takedowns of publicly available transcripts, so they are not included. Other shows such as *Bob's Burgers* and *Aqua Teen Hunger Force* did not have complete transcripts of every episode online. 

There are five datasets, one for each TV show listed above, and each row of the dataset represents one line spoken in an episode of a given TV show. Therefore, a row includes the line spoken, the character speaking the line, the episode in which the line is spoken, and the season in which the episode occurs. 

While I have tried to fill in missing episode transcripts in order to have every episode represented in the datasets, there are some exceptions such as the *South Park* episodes "200" and "201" which has been banned for its depiction of the Islamic prophet Muhammad. Overall, these gaps in the dataset are relatively limited and do not significantly affect the accuracy of the conclusions presented. 

Additionally, all of these datasets can be found and downloaded on my [GitHub repo](https://github.com/bjablonski20/final-project-qtm340) along with all code used for the analysis in order to enable replication of the results below.

The *Futurama*, *South Park*, and *The Simpsons* datasets were created by fans of the shows on Kaggle and GitHub, but *Family Guy* and *American Dad* were manually retrieved via web scraping using Beautiful Soup. 

The following code loads all data needed for this project from GitHub. 

In [None]:

## importing
import re
import os
import pandas as pd
import little_mallet_wrapper
import seaborn
import glob
from pathlib import Path
import nltk
import numpy as np
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from pandas import DataFrame
from pandas import Series, DataFrame
from sklearn.feature_extraction import text
from scipy.stats import pearsonr, norm
sid = SentimentIntensityAnalyzer()

## South Park 
### retrieve csv from my github repo
if not os.path.exists('sp_ep_data.csv.1'):
    !wget https://raw.githubusercontent.com/bjablonski20/final-project-qtm340/main/Transcript%20CSVs/sp_ep_data.csv
### convert into a pandas dataframe
sp_ep_data = pd.read_csv ('sp_ep_data.csv.1',error_bad_lines=False)
sp_ep_data = sp_ep_data.drop(['episode_link', 'season_link', 'season_name'], axis=1)
### split dataframe into text files for each episode
for season in range(1,sp_ep_data['season_number'].max()+1): ## season numbers
    for episode in range(1, sp_ep_data[sp_ep_data['season_number'] == season]['season_episode_number'].max()):
        filename = "sp_ep" + str(season) + "_" + str(episode)+"_.txt"
        path = "SouthPark_Data/" + filename
        with open(path, "w") as file:
            file.writelines(sp_ep_data[(sp_ep_data['season_number'] == season) & (sp_ep_data['season_episode_number'] == episode)]['text'])
       

 ##The Simpsons

### retrieve csv from my github repo
if not os.path.exists('simpsons_script_lines.csv'):
    !wget https://raw.githubusercontent.com/bjablonski20/final-project-qtm340/main/Transcript%20CSVs/simpsons_script_lines.csv
### convert into a pandas dataframe
simp_ep_data = pd.read_csv ('simpsons_script_lines.csv',error_bad_lines=False)
### Adding season column to simpsons dataset -- i know this is unwieldy but i decided to brute force it 
simp_ep_data['season'] = np.where(simp_ep_data['episode_id'] < (14),1,0)
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22))&(simp_ep_data['episode_id'] >= (14)),2,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24))&((simp_ep_data['episode_id'] >= (14+22))),3,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22))&(simp_ep_data['episode_id'] >= (14+22+24)),4,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22))&(simp_ep_data['episode_id'] >= (14+22+24+22)),5,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25))&(simp_ep_data['episode_id'] >= (14+22+24+22+22)),6,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25)),7,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25)),8,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25)),9,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25+23))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25)),10,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25+23+22))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25+23)),11,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25+23+22+21))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25+23+22)),12,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25+23+22+21+22))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25+23+22+21)),13,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25+23+22+21+22+22))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25+23+22+21+22)),14,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25+23+22+21+22+22+22))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25+23+22+21+22+22)),15,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25+23+22+21+22+22+22+21))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25+23+22+21+22+22+22)),16,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25+23+22+21+22+22+22+21+22))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25+23+22+21+22+22+22+21)),17,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (401))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25+23+22+21+22+22+22+21+22)),18,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (401+20))&(simp_ep_data['episode_id'] >= (401)),19,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (401+20+21))&(simp_ep_data['episode_id'] >= (401+20)),20,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (401+20+21+23))&(simp_ep_data['episode_id'] >= (401+20+21)),21,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (401+20+21+23+22))&(simp_ep_data['episode_id'] >= (401+20+21+23)),22,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (401+20+21+23+22+22))&(simp_ep_data['episode_id'] >= (401+20+21+23+22)),23,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (401+20+21+23+22+22+22))&(simp_ep_data['episode_id'] >= (401+20+21+23+22+22)),24,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (401+20+21+23+22+22+22+22))&(simp_ep_data['episode_id'] >= (401+20+21+23+22+22+22)),25,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] >= (401+20+21+23+22+22+22+22)),26,simp_ep_data['season'])
### Split dataframe into individual text files for each episode
for season in range(1,simp_ep_data['season'].max()+1): ## season numbers
    if type(simp_ep_data[simp_ep_data['season'] == season]['episode_id'].max()) == type(simp_ep_data[simp_ep_data['season'] == 1]['episode_id'].max()):
        for episode in range(int(simp_ep_data[simp_ep_data['season'] == season]['episode_id'].min()), int(simp_ep_data[simp_ep_data['season'] == season]['episode_id'].max())):
            filename = "simp_ep" + str(season) + "_" + str(episode)+"_.txt"
            path = "Simpsons_Data/" + filename
            with open(path, "w") as file:
                file.writelines(simp_ep_data[(simp_ep_data['season'] == season) & (simp_ep_data['episode_id'] == episode)]['spoken_words'])
### these files were empty 
os.remove("./Simpsons_Data/simp_ep21_447_.txt")
os.remove("./Simpsons_Data/simp_ep20_424_.txt")
os.remove("./Simpsons_Data/simp_ep25_550_.txt")
                                  
    


## Futurama 
### retrieve csv from my github repo
if not os.path.exists('futurama_ep_data.csv'):
    !wget https://raw.githubusercontent.com/bjablonski20/final-project-qtm340/main/Transcript%20CSVs/futurama_ep_data.csv
futur_ep_data = pd.read_csv ('futurama_ep_data.csv',error_bad_lines=False)

### adds episode column
episode = []
ep_no = 1
count = 0
for index, row in futur_ep_data.iterrows(): 
    if count == 23811:
        break
    else:
        if futur_ep_data['Episode'][count] != futur_ep_data['Episode'][count+1]:
            episode.append(ep_no)
            ep_no = ep_no +1
            count = count +1
        else: 
            count = count + 1
            episode.append(ep_no)
episode.append(114)
futur_ep_data['Episode Number'] = episode

### Splits the dataframe into individual text files for each episode 
for season in range(1,futur_ep_data['Season'].max()+1): ## season numbers
    for episode in range(1, futur_ep_data[futur_ep_data['Season'] == season]['Episode Number'].max()):
        filename = "futur_ep" + str(season) + "_" + str(episode)+"_.txt"
        path = "Futurama_Data/" + filename
        with open(path, "w") as file:
            try:
                file.writelines(futur_ep_data[(futur_ep_data['Season'] == season) & (futur_ep_data['Episode Number'] == episode)]['Line'])
            except TypeError:
                break
### removes all empty episodes 
from os import listdir
from os.path import isfile, join
mypath = "Futurama_Data/"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in onlyfiles:
    if os.path.getsize(mypath + file) == 0:
        os.remove(mypath + file)     


    



In [2]:
## Web Scraping 

### beautiful soup function that creates a pandas Dataframe with the episode, speaker and the line for the series 
from bs4 import BeautifulSoup
import requests
import re 
start = "<b>"
end = "</b>"
def grab_urls(soup):
  episode_urls = []
  for season in soup.findAll('div', style="column-count:2"):
    for episode in season.findAll('a'):
        try:
           episode_urls.append(episode.attrs['href'])
        except:
           continue
  return episode_urls
def scripts_from_html(html):
    html_page = requests.get(html, headers={'User-Agent': 'Mozilla/5.0'})
    soup = BeautifulSoup(html_page.content, 'html.parser')
    urls = grab_urls(soup)
    texts = []
    characters = []
    episode = []
    ep_no = 1
    character = "NONE"
    for i in urls:
        html = "https://transcripts.fandom.com/" + i
        count = 0
        html_page = requests.get(html)
        html_string = html_page.text
        soup = BeautifulSoup(html_string, 'html.parser')
        for i in soup.find_all("p"):
            text1 = str(i)
            if "<p>" in text1:
                text1 = text1[text1.index("<p>")+len("<p>"):text1.index("</p>")]
            episode.append(ep_no)
            if "</b>" in str(i):
                text2 = str(i)
                character = text2[text2.index("<b>")+len("<b>"):text2.index("</b>")]
            characters.append(character)
            count = count + 1
            if "</b>" in text1:
                text1 = text1[text1.index("</b>")+len("</b>"):len(text1)]
            texts.append(text1)
        ep_no = ep_no + 1

    d = {'Character': characters,'Episode': episode, 'Line': texts}
    return pd.DataFrame(d)

In [3]:
## American Dad
html_amDad = "https://transcripts.fandom.com/wiki/American_Dad!"
amDad_ep_data = scripts_from_html(html_amDad)
## adds the season variable -- done manually because i couldnt find season number as a header when parsing through the html 
amDad_ep_data['season'] = np.where(amDad_ep_data['Episode'] < (23),1,0)
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (23+19))&(amDad_ep_data['Episode'] >= (23)),2,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (23+19+16))&(amDad_ep_data['Episode'] >= (23+19)),3,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (23+19+16+20))&(amDad_ep_data['Episode'] >= (23+19+16)),4,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (23+19+16+20+18))&(amDad_ep_data['Episode'] >= (23+19+16+20)),5,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (23+19+16+20+18+19))&(amDad_ep_data['Episode'] >= (23+19+16+20+18)),6,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (23+19+16+20+18+19+18))&(amDad_ep_data['Episode'] >= (23+19+16+20+18+19)),7,amDad_ep_data['season'])

amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (133+19))&(amDad_ep_data['Episode'] >= (133)),8,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (133+19+20))&(amDad_ep_data['Episode'] >= (133+19)),9,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (133+19+20+18))&(amDad_ep_data['Episode'] >= (133+19+20)),10,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (133+19+20+18+22))&(amDad_ep_data['Episode'] >= (133+19+20+18)),11,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (133+19+20+18+22+22))&(amDad_ep_data['Episode'] >= (133+19+20+18+22)),12,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (133+19+20+18+22+22+22))&(amDad_ep_data['Episode'] >= (133+19+20+18+22+22+22)),13,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (133+19+20+18+22+22+22+22))&(amDad_ep_data['Episode'] >= (133+19+20+18+22+22+22+22)),14,amDad_ep_data['season'])

amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (278+22))&(amDad_ep_data['Episode'] >= (278)),15,amDad_ep_data['season'])

## create individual .txt docs for each episode
for season in range(1,amDad_ep_data['season'].max()+1): ## season numbers
    try:
        for episode in range(1, (int)(amDad_ep_data[amDad_ep_data['season'] == season]['Episode'].max())):
            filename = "amDad_ep" + str(season) + "_" + str(episode)+"_.txt"
            path = "amDad_Data/" + filename
            with open(path, "w") as file:
                file.writelines(amDad_ep_data[(amDad_ep_data['season'] == season) & (amDad_ep_data['Episode'] == episode)]['Line'])
    except ValueError:
        break


## removes all empty episodes 
from os import listdir
from os.path import isfile, join
mypath = "amDad_Data/"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in onlyfiles:
    if os.path.getsize(mypath + file) == 0:
        os.remove(mypath + file)

In [4]:
## Family Guy
html_famGuy = "https://transcripts.fandom.com/wiki/Family_Guy"
famGuy_ep_data = scripts_from_html(html_famGuy)
famGuy_ep_data['season'] = np.where(famGuy_ep_data['Episode'] < (8),1,0)
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (8+21))&(famGuy_ep_data['Episode'] >= (8)),2,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (8+21+22))&(famGuy_ep_data['Episode'] >= (8+21)),3,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (8+21+22+30))&(famGuy_ep_data['Episode'] >= (8+21+22)),4,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (8+21+22+30+18))&(famGuy_ep_data['Episode'] >= (8+21+22+30)),5,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (8+21+22+30+18+12))&(famGuy_ep_data['Episode'] >= (8+21+22+30+18)),6,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (8+21+22+30+18+12+16))&(famGuy_ep_data['Episode'] >= (8+21+22+30+18+12)),7,famGuy_ep_data['season'])

famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (112+21))&(famGuy_ep_data['Episode'] >= (112)),8,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (112+21+18))&(famGuy_ep_data['Episode'] >= (112+21)),9,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (112+21+18+23))&(famGuy_ep_data['Episode'] >= (112+21+18)),10,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (112+21+18+23+22))&(famGuy_ep_data['Episode'] >= (112+21+18+23)),11,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (112+21+18+23+22+21))&(famGuy_ep_data['Episode'] >= (112+21+18+23+22)),12,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (112+21+18+23+22+21+18))&(famGuy_ep_data['Episode'] >= (112+21+18+23+22+21)),13,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (112+21+18+23+22+21+18+20))&(famGuy_ep_data['Episode'] >= (112+21+18+23+22+21+18)),14,famGuy_ep_data['season'])

famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (255+20))&(famGuy_ep_data['Episode'] >= (255)),15,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (255+20+20))&(famGuy_ep_data['Episode'] >= (255+20)),16,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (255+20+20+20))&(famGuy_ep_data['Episode'] >= (255+20+20)),17,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (255+20+20+20+20))&(famGuy_ep_data['Episode'] >= (255+20+20+20)),18,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (255+20+20+20+20+22))&(famGuy_ep_data['Episode'] >= (255+20+20+20+20)),19,famGuy_ep_data['season'])


## creates individual .txt files for each episode
for season in range(1,famGuy_ep_data['season'].max()+1): ## season numbers
    try:
        for episode in range(1, (int)(famGuy_ep_data[famGuy_ep_data['season'] == season]['Episode'].max())):
            filename = "famGuy_ep" + str(season) + "_" + str(episode)+"_.txt"
            path = "famGuy_Data/" + filename
            with open(path, "w") as file:
                file.writelines(famGuy_ep_data[(famGuy_ep_data['season'] == season) & (famGuy_ep_data['Episode'] == episode)]['Line'])
    except ValueError:
        break


## removes empty files 
mypath = "famGuy_Data/"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in onlyfiles:
    if os.path.getsize(mypath + file) == 0:
        os.remove(mypath + file)

### Classification Method
I created three different classification models. The models differentiate between *American Dad!* and *Family Guy* episodes, *Simpsons* and *Futurama* episodes, and *South Park* and *Simpsons* episodes. The first two pairs are selected because the shows in each pair are created and written by many of the same people while the last pair is picked because *South Park* and *The Simpsons* are the two longest running shows in the dataset. Additionally, stop words are taken out of the text, including show specific place and character names. The stop words used for this analysis can also be downloaded on my GitHub repo. 

The goal of this section is both to test whether the classification method can differentiate between similar TV shows based only on the text of an episode of the show and also to find which words are most important in differentiating between any two TV shows. The relative ability of the classification model to distinguish between episodes of specific shows can provide an indicator of the similarity of two shows which is important to identify the diversity of humor and theme between two shows. Additionally, identifying which episodes the model repeatedly misidentifies provides opportunity for close reading by looking at the text of the show and identifying *why* the classification model has failed. The logistic regression weights given to individual words indicates the importance of words to the identification of a show. While it is an imperfect measure, the highly weighted words can potentially paint an overall theme that differentiates a show from its counterpart in the model. 


#### *Family Guy*-*American Dad!* Model



In [None]:
## Family Guy-American Dad
### General Things 
### load stopwords
from sklearn.feature_extraction import text
text_file = open('./docs/stopwords.txt')
jockers_words = text_file.read().split()
new_stopwords = text.ENGLISH_STOP_WORDS.union(jockers_words)

### create dtm
corpus_path = './Classification_Data_AmDadFamGuy/'
vectorizer = CountVectorizer(input='filename', encoding='utf8', stop_words = new_stopwords, min_df=20, dtype='float64')


directory = "./Classification_Data_AmDadFamGuy/"
files = glob.glob(f"{directory}/*.txt")
titles = [Path(file).stem for file in files]

corpus = []
for title in titles:
    filename = title + ".txt"
    corpus.append(corpus_path + filename)
dtm = vectorizer.fit_transform(corpus)
vocab = vectorizer.get_feature_names()
matrix = dtm.toarray()
df = DataFrame(matrix, columns=vocab)









filepaths = []
fulltext = []
for filepath in glob.iglob('Classification_Data_AmDadFamGuy/*.txt'): ## grab titles
    filepaths.append(filepath)
for filepath in filepaths: ## grab text and get the scores of each individual episode
    with open(filepath, "r") as file:
        text = file.read()
        fulltext.append(text)
clasdict = {'Name': filepaths, 'Text': fulltext
}
classdf = pd.DataFrame(clasdict)
def changeFilePath(filePath):
    newPath = filePath[32:]
    return newPath
classdf['Name'] = classdf['Name'].apply(changeFilePath)
amDadBool = []

for name in classdf['Name']:
    if name[0:4] == "amDa":
        amDadBool.append(1)
    else:
        amDadBool.append(0)
classdf['amDad'] = amDadBool

df_concat = pd.concat([classdf, df], axis = 1)
noAD = df_concat['amDad'].sum()
noAD = int(noAD)
df_simp = df_concat[df_concat['amDad'] == 0]
df_sp = df_concat[df_concat['amDad'] == 1]
df_simp = df_simp.sample(n=noAD)
df_final = pd.concat([df_simp, df_sp])
df_final = df_final.reset_index()
df_final = df_final.drop(columns="index")
meta = df_final[["Name", "amDad"]]
df = df_final.loc[:,'000':]

meta['PROBS'] = ''
meta['PREDICTED'] = ''

model = LogisticRegression(penalty = 'l1', C = 1.0, solver='liblinear')


for this_index in df_final.index.tolist():
    title = meta.loc[meta.index[this_index], 'Name'] 
    CLASS = meta.loc[meta.index[this_index], 'amDad']
    #print(title, CLASS) 
    
    train_index_list = [index_ for index_ in df.index.tolist() if index_ != this_index] # exclude the title to be predicted
    X = df.loc[train_index_list] # the model trains on all the data except the excluded title row
    y = meta.loc[train_index_list, 'amDad'] # the y row tells the model which class each title belongs to
    TEST_CASE = df.loc[[this_index]]

    model.fit(X,y) # fit the model
    prediction = model.predict_proba(TEST_CASE) # calculate probability of test case
    predicted = model.predict(TEST_CASE) # calculate predicted class of test case
    meta.at[this_index, 'PREDICTED'] = predicted # add predicted class to metadata
    meta.at[this_index, 'PROBS'] = str(prediction) # add probabilities to metadata

meta = meta.replace([0], 0)
meta = meta.replace([1], 1)
sum_column = meta['amDad'] - meta['PREDICTED']
meta['RESULT'] = sum_column


meta = meta.replace([0], 0)
meta = meta.replace([1], 1)

sum_column = meta['amDad'] - meta['PREDICTED']
meta['Result'] = sum_column


meta[meta['RESULT'] == 1]

canonic_c = 1.0

def Ztest(vec1, vec2):

    X1, X2 = np.mean(vec1), np.mean(vec2)
    sd1, sd2 = np.std(vec1), np.std(vec2)
    n1, n2 = len(vec1), len(vec2)

    pooledSE = np.sqrt(sd1**2/n1 + sd2**2/n2)
    z = (X1 - X2)/pooledSE
    pval = 2*(norm.sf(abs(z)))

    return z, pval
def feat_pval_weight(meta_df_, dtm_df_):

    dtm0 = dtm_df_.loc[meta_df_[meta_df_['amDad']==0].index.tolist()].to_numpy()
    dtm1 = dtm_df_.loc[meta_df_[meta_df_['amDad']==1].index.tolist()].to_numpy()

    pvals = [Ztest(dtm0[ : ,i], dtm1[ : ,i])[1] for i in range(dtm_df_.shape[1])]
    clf = LogisticRegression(penalty = 'l1', C = canonic_c, class_weight = 'balanced', solver='liblinear')
    clf.fit(dtm_df_, meta_df_['amDad']==0)
    weights = clf.coef_[0]

    feature_df = pd.DataFrame()

    feature_df['FEAT'] = dtm_df_.columns
    feature_df['P_VALUE'] = pvals
    feature_df['LR_WEIGHT'] = weights

    return feature_df


feat_df = feat_pval_weight(meta, df)


In [10]:
accuracy = len(meta[meta['RESULT'] == 0])/len(meta)
print("The accuracy of the Family Guy-American Dad model is " + str(accuracy))

The accuracy of the Family Guy-American Dad model is 0.9162790697674419


In [11]:
feat_df.sort_values('LR_WEIGHT', ascending = True).head(20)

Unnamed: 0,FEAT,P_VALUE,LR_WEIGHT
3779,usa,2.588796e-30,-1.14164
1549,cia,4.619195e-11,-0.579678
1143,alien,0.0002602196,-0.431902
2770,need,0.0001935979,-0.430273
2724,morning,2.00274e-15,-0.406824
1150,american,1.272633e-10,-0.370146
1682,crazy,0.00774967,-0.35722
3928,worry,0.1449264,-0.353994
1772,didn,0.005865955,-0.257974
1087,aah,0.02638893,-0.229395


In [12]:
feat_df.sort_values('LR_WEIGHT', ascending = True).tail(20)

Unnamed: 0,FEAT,P_VALUE,LR_WEIGHT
3878,white,0.283302,0.130519
2223,grunting,0.574259,0.145611
2517,laugh,3.555786e-24,0.152365
2662,mean,1.03866e-07,0.15699
2239,guys,4.260305e-16,0.172757
2262,happy,0.1843361,0.188815
2363,huh,6.124952e-09,0.194199
3732,trouble,0.004781575,0.209467
2659,mayor,0.00414859,0.214255
3797,violence,3.336212e-82,0.222522


#### *Futurama*-*The Simpsons* Model



In [None]:
## Futurama-Simpsons
## General Things 
# load stopwords
from sklearn.feature_extraction import text
text_file = open('./docs/stopwords.txt')
jockers_words = text_file.read().split()
new_stopwords = text.ENGLISH_STOP_WORDS.union(jockers_words)

# create dtm
corpus_path = './Classification_Data_SimpFuturama/'
vectorizer = CountVectorizer(input='filename', encoding='utf8', stop_words = new_stopwords, min_df=20, dtype='float64')


directory = "./Classification_Data_SimpFuturama/"
files = glob.glob(f"{directory}/*.txt")
titles = [Path(file).stem for file in files]

corpus = []
for title in titles:
    filename = title + ".txt"
    corpus.append(corpus_path + filename)
dtm = vectorizer.fit_transform(corpus)
vocab = vectorizer.get_feature_names()
matrix = dtm.toarray()
df = DataFrame(matrix, columns=vocab)


filepaths = []
fulltext = []
for filepath in glob.iglob('Classification_Data_SimpFuturama/*.txt'): ## grab titles
    filepaths.append(filepath)
for filepath in filepaths: ## grab text and get the scores of each individual episode
    with open(filepath, "r") as file:
        text = file.read()
        fulltext.append(text)
clasdict = {'Name': filepaths, 'Text': fulltext
}
classdf = pd.DataFrame(clasdict)
def changeFilePath(filePath):
    newPath = filePath[33:]
    return newPath
classdf['Name'] = classdf['Name'].apply(changeFilePath)
futramaBool = []

for name in classdf['Name']:
    if name[0:4] == "simp":
        futramaBool.append(0)
    else:
        futramaBool.append(1)
classdf['Futurama'] = futramaBool

df_concat = pd.concat([classdf, df], axis = 1)
noF = df_concat['Futurama'].sum()

df_simp = df_concat[df_concat['Futurama'] == 0]
df_sp = df_concat[df_concat['Futurama'] == 1]
df_simp = df_simp.sample(n=noF)
df_final = pd.concat([df_simp, df_sp])
df_final = df_final.reset_index()
df_final = df_final.drop(columns="index")
meta = df_final[["Name", "Futurama"]]
df = df_final.loc[:,'000':]

meta['PROBS'] = ''
meta['PREDICTED'] = ''

model = LogisticRegression(penalty = 'l1', C = 1.0, solver='liblinear')


for this_index in df_final.index.tolist():
    title = meta.loc[meta.index[this_index], 'Name'] 
    CLASS = meta.loc[meta.index[this_index], 'Futurama']
    #print(title, CLASS) 
    
    train_index_list = [index_ for index_ in df.index.tolist() if index_ != this_index] # exclude the title to be predicted
    X = df.loc[train_index_list] # the model trains on all the data except the excluded title row
    y = meta.loc[train_index_list, 'Futurama'] # the y row tells the model which class each title belongs to
    TEST_CASE = df.loc[[this_index]]

    model.fit(X,y) # fit the model
    prediction = model.predict_proba(TEST_CASE) # calculate probability of test case
    predicted = model.predict(TEST_CASE) # calculate predicted class of test case
    meta.at[this_index, 'PREDICTED'] = predicted # add predicted class to metadata
    meta.at[this_index, 'PROBS'] = str(prediction) # add probabilities to metadata

meta = meta.replace([0], 0)
meta = meta.replace([1], 1)
sum_column = meta['Futurama'] - meta['PREDICTED']
meta['RESULT'] = sum_column


meta = meta.replace([0], 0)
meta = meta.replace([1], 1)

sum_column = meta['Futurama'] - meta['PREDICTED']
meta['Result'] = sum_column


meta[meta['RESULT'] == 1]

canonic_c = 1.0

def Ztest(vec1, vec2):

    X1, X2 = np.mean(vec1), np.mean(vec2)
    sd1, sd2 = np.std(vec1), np.std(vec2)
    n1, n2 = len(vec1), len(vec2)

    pooledSE = np.sqrt(sd1**2/n1 + sd2**2/n2)
    z = (X1 - X2)/pooledSE
    pval = 2*(norm.sf(abs(z)))

    return z, pval
def feat_pval_weight(meta_df_, dtm_df_):

    dtm0 = dtm_df_.loc[meta_df_[meta_df_['Futurama']==0].index.tolist()].to_numpy()
    dtm1 = dtm_df_.loc[meta_df_[meta_df_['Futurama']==1].index.tolist()].to_numpy()

    pvals = [Ztest(dtm0[ : ,i], dtm1[ : ,i])[1] for i in range(dtm_df_.shape[1])]
    clf = LogisticRegression(penalty = 'l1', C = canonic_c, class_weight = 'balanced', solver='liblinear')
    clf.fit(dtm_df_, meta_df_['Futurama']==0)
    weights = clf.coef_[0]

    feature_df = pd.DataFrame()

    feature_df['FEAT'] = dtm_df_.columns
    feature_df['P_VALUE'] = pvals
    feature_df['LR_WEIGHT'] = weights

    return feature_df


feat_df = feat_pval_weight(meta, df)



In [11]:
accuracy = len(meta[meta['RESULT'] == 0])/len(meta)
print("The accuracy of the Futurama-Simpsons model is: " + str(accuracy))

The accuracy of the Futurama-Simpsons model is: 0.9375


In [12]:
feat_df.sort_values('LR_WEIGHT', ascending = True).head(20)

Unnamed: 0,FEAT,P_VALUE,LR_WEIGHT
2114,planet,2.125784e-10,-0.974364
2370,robot,2.366492e-11,-0.795632
2214,professor,3.293145e-07,-0.774413
2655,space,2.360484e-05,-0.468828
757,delivery,6.85094e-06,-0.331318
2881,things,0.002060697,-0.291092
3067,wanna,0.0002764642,-0.261212
127,ass,1.11797e-06,-0.249741
2798,surface,0.02312275,-0.217342
2520,ship,3.045339e-05,-0.212384


In [13]:
feat_df.sort_values('LR_WEIGHT', ascending = True).tail(20)

Unnamed: 0,FEAT,P_VALUE,LR_WEIGHT
1076,festival,0.01718463,0.0
1846,mom,0.001907975,0.002129
223,beer,0.07911002,0.005314
852,dream,0.2870535,0.012255
1681,little,4.464262e-05,0.016235
3172,work,0.1939321,0.026852
2441,school,2.758281e-07,0.056057
293,book,5.541133e-05,0.062601
1571,kids,3.093289e-05,0.086007
1034,family,1.264796e-07,0.088111


#### *South Park*-*The Simpsons* Model


In [None]:
## South Park-Simpsons
## General Things 
# load stopwords
from sklearn.feature_extraction import text
text_file = open('./docs/stopwords.txt')
jockers_words = text_file.read().split()
new_stopwords = text.ENGLISH_STOP_WORDS.union(jockers_words)

# create dtm
corpus_path = './Classification_Data/'
vectorizer = CountVectorizer(input='filename', encoding='utf8', stop_words = new_stopwords, min_df=20, dtype='float64')


directory = "./Classification_Data/"
files = glob.glob(f"{directory}/*.txt")
titles = [Path(file).stem for file in files]

corpus = []
for title in titles:
    filename = title + ".txt"
    corpus.append(corpus_path + filename)
dtm = vectorizer.fit_transform(corpus)
vocab = vectorizer.get_feature_names()
matrix = dtm.toarray()
df = DataFrame(matrix, columns=vocab)


filepaths = []
fulltext = []
for filepath in glob.iglob('Classification_Data/*.txt'): ## grab titles
    filepaths.append(filepath)
for filepath in filepaths: ## grab text and get the scores of each individual episode
    with open(filepath, "r") as file:
        text = file.read()
        fulltext.append(text)
clasdict = {'Name': filepaths, 'Text': fulltext
}
classdf = pd.DataFrame(clasdict)
def changeFilePath(filePath):
    newPath = filePath[20:]
    return newPath
classdf['Name'] = classdf['Name'].apply(changeFilePath)
southParkBool = []

for name in classdf['Name']:
    if name[0:4] == "simp":
        southParkBool.append(0)
    else:
        southParkBool.append(1)
classdf['SouthPark'] = southParkBool

df_concat = pd.concat([classdf, df], axis = 1)
noSP = df_concat['SouthPark'].sum()

df_simp = df_concat[df_concat['SouthPark'] == 0]
df_sp = df_concat[df_concat['SouthPark'] == 1]
df_simp = df_simp.sample(n=noSP)
df_final = pd.concat([df_simp, df_sp])
df_final = df_final.reset_index()
df_final = df_final.drop(columns="index")
meta = df_final[["Name", "SouthPark"]]
df = df_final.loc[:,'000':]

meta['PROBS'] = ''
meta['PREDICTED'] = ''

model = LogisticRegression(penalty = 'l1', C = 1.0, solver='liblinear')


for this_index in df_final.index.tolist():
    title = meta.loc[meta.index[this_index], 'Name'] 
    CLASS = meta.loc[meta.index[this_index], 'SouthPark']
    #print(title, CLASS) 
    
    train_index_list = [index_ for index_ in df.index.tolist() if index_ != this_index] # exclude the title to be predicted
    X = df.loc[train_index_list] # the model trains on all the data except the excluded title row
    y = meta.loc[train_index_list, 'SouthPark'] # the y row tells the model which class each title belongs to
    TEST_CASE = df.loc[[this_index]]

    model.fit(X,y) # fit the model
    prediction = model.predict_proba(TEST_CASE) # calculate probability of test case
    predicted = model.predict(TEST_CASE) # calculate predicted class of test case
    meta.at[this_index, 'PREDICTED'] = predicted # add predicted class to metadata
    meta.at[this_index, 'PROBS'] = str(prediction) # add probabilities to metadata

meta = meta.replace([0], 0)
meta = meta.replace([1], 1)
sum_column = meta['SouthPark'] - meta['PREDICTED']
meta['RESULT'] = sum_column


meta = meta.replace([0], 0)
meta = meta.replace([1], 1)

sum_column = meta['SouthPark'] - meta['PREDICTED']
meta['Result'] = sum_column


meta[meta['RESULT'] == 1]

canonic_c = 1.0

def Ztest(vec1, vec2):

    X1, X2 = np.mean(vec1), np.mean(vec2)
    sd1, sd2 = np.std(vec1), np.std(vec2)
    n1, n2 = len(vec1), len(vec2)

    pooledSE = np.sqrt(sd1**2/n1 + sd2**2/n2)
    z = (X1 - X2)/pooledSE
    pval = 2*(norm.sf(abs(z)))

    return z, pval
def feat_pval_weight(meta_df_, dtm_df_):

    dtm0 = dtm_df_.loc[meta_df_[meta_df_['SouthPark']==0].index.tolist()].to_numpy()
    dtm1 = dtm_df_.loc[meta_df_[meta_df_['SouthPark']==1].index.tolist()].to_numpy()

    pvals = [Ztest(dtm0[ : ,i], dtm1[ : ,i])[1] for i in range(dtm_df_.shape[1])]
    clf = LogisticRegression(penalty = 'l1', C = canonic_c, class_weight = 'balanced', solver='liblinear')
    clf.fit(dtm_df_, meta_df_['SouthPark']==0)
    weights = clf.coef_[0]

    feature_df = pd.DataFrame()

    feature_df['FEAT'] = dtm_df_.columns
    feature_df['P_VALUE'] = pvals
    feature_df['LR_WEIGHT'] = weights

    return feature_df


feat_df = feat_pval_weight(meta, df)


In [15]:
accuracy = len(meta[meta['RESULT'] == 0])/len(meta)
print("The accuracy of The South Park-Simpsons Model is: " + str(accuracy))

The accuracy of The South Park-Simpsons Model is: 0.9868421052631579


In [16]:
feat_df.sort_values('LR_WEIGHT', ascending = True).head(20)

Unnamed: 0,FEAT,P_VALUE,LR_WEIGHT
4120,walks,1.023394e-162,-0.655405
4007,turns,7.402690000000001e-140,-0.569603
1200,dude,7.295699e-111,-0.482903
3783,takes,8.699849e-89,-0.388868
231,away,7.299929e-116,-0.283372
2439,monitor,0.0003230522,-0.255395
194,ass,1.234175e-29,-0.241539
2265,looks,4.888705e-116,-0.159034
563,canada,0.02300872,-0.141829
1752,guys,7.788472e-67,-0.137286


In [17]:
feat_df.sort_values('LR_WEIGHT', ascending = True).tail(20)

Unnamed: 0,FEAT,P_VALUE,LR_WEIGHT
1919,hot,0.3868896,0.075627
1455,fine,0.006462251,0.079655
3732,super,0.0131443,0.080141
2261,look,5.866443e-45,0.082228
3851,thanks,0.06286511,0.092077
684,chief,1.990508e-06,0.093275
2207,life,0.2314188,0.095925
246,baby,0.03167185,0.114769
1687,gotta,2.027379e-05,0.129279
445,boy,0.1730123,0.130369


### Word2Vec Method 
I created five separate Word2Vec models with each model being trained on every script of an individual TV show. This technique enables parsing out vocabulary and thematic differences between TV shows by investigating the similar words in one model compared to another, and it additionally provides a show's provisional definition of a word by displaying which words are similar in the vocabulary of that show. 

While the classification section attempts to define and differentiate shows on a binary basis (i.e. What themes and words are uniquely characteristic of *The Simpsons* rather than *Futurama*?), this section primarily seeks to find unique differences in the vocabulary of an individual TV show in relation to the four other shows in the dataset. This is, by nature, much more open ended, seeking to identify major themes that can uniquely characterize the writing of an individual TV show. 

The following code trains all five word2vec models and then displays the 10 most similar words to the word "girl", "science" and "family".



In [None]:
## importing packages
import gensim
from nltk.tokenize import sent_tokenize
from nltk.tokenize.treebank import TreebankWordTokenizer
import nltk
nltk.download('punkt')
## South Park
base_dir = "./SouthPark_Data/" 

all_docs = [] 

docs = os.listdir(base_dir) # get a list of all the files in the directory

for doc in docs: # iterate through the docs
    if not doc.startswith('.'): # get only the .txt files
        with open(base_dir + doc, "r", encoding="utf-8") as file: # force unicode conversion to keep PCs happy
            text = file.read() # read in the file as a single text string
            all_docs.append(text) # append it to the all_docs list

tokenizer = TreebankWordTokenizer()

# Get titles
directory = "./SouthPark_Data/"
files = glob.glob(f"{directory}/*.txt")
sp_titles = [Path(file).stem for file in files]

# and the function
def make_sentences(list_txt):
    all_txt = []
    counter = 0
    for txt in list_txt:
        lower_txt = txt.lower()
        sentences = sent_tokenize(lower_txt)
        sentences = [tokenizer.tokenize(sent) for sent in sentences]
        all_txt += sentences
       # print(sp_titles[counter]) # let's print the title of the obit
       # print(len(sentences))  # let's check how many sentences there are per obit
        #print("\n")
        counter += 1
    return all_txt

sentences = make_sentences(all_docs)

# trains our model!
sp_model = gensim.models.Word2Vec(
    sentences,
    min_count=5, 
    size=200,
    workers=5) # parallel processing; 
sp_model.save('sp_model') ## saves the model

## Simpsons


base_dir = "./Simpsons_Data/" 

all_docs = [] # our list which will store the text of each doc; empty for now

docs = os.listdir(base_dir) # get a list of all the files in the directory

for doc in docs: # iterate through the docs
    if not doc.startswith('.'): # get only the .txt files
        with open(base_dir + doc, "r", encoding="utf-8") as file: # force unicode conversion to keep PCs happy
            text = file.read() # read in the file as a single text string
            all_docs.append(text) # append it to the all_docs list

# need our handy nltk tokenizer 
tokenizer = TreebankWordTokenizer()

# and we'll get titles
directory = "./Simpsons_Data/"
files = glob.glob(f"{directory}/*.txt")
simp_titles = [Path(file).stem for file in files]

# and the function
def make_sentences(list_txt):
    all_txt = []
    counter = 0
    for txt in list_txt:
        lower_txt = txt.lower()
        sentences = sent_tokenize(lower_txt)
        sentences = [tokenizer.tokenize(sent) for sent in sentences]
        all_txt += sentences
        #print(simp_titles[counter]) # let's print the title of the obit
       # print(len(sentences))  # let's check how many sentences there are per obit
       # print("\n")
        counter += 1
    return all_txt

sentences = make_sentences(all_docs)

# let's train our model!
simp_model = gensim.models.Word2Vec(
    sentences,
    min_count=5, # default is 5; this trims the corpus for words only used once; 
    size=200, # size of NN layers; default is 100; higher for larger corpora
    workers=5) # parallel processing; needs Cython
simp_model.save('simp_model')



## Family Guy
base_dir = "./famGuy_Data/" 
def make_sentences(list_txt):
    all_txt = []
    counter = 0
    for txt in list_txt:
        lower_txt = txt.lower()
        sentences = sent_tokenize(lower_txt)
        sentences = [tokenizer.tokenize(sent) for sent in sentences]
        all_txt += sentences
    return all_txt
def createW2VSentences(dire):
    all_docs = [] # our list which will store the text of each doc; empty for now

    docs = os.listdir(dire) # get a list of all the files in the directory

    for doc in docs: # iterate through the docs
        if not doc.startswith('.'): # get only the .txt files
            with open(dire + doc, "r", encoding="utf-8") as file: # force unicode conversion to keep PCs happy
                text = file.read() # read in the file as a single text string
                all_docs.append(text) # append it to the all_docs list

# need our handy nltk tokenizer 
    tokenizer = TreebankWordTokenizer()

# and we'll get titles
    files = glob.glob(f"{dire}/*.txt")
    par_titles = [Path(file).stem for file in files]

    sentences = make_sentences(all_docs)
    return sentences

# let's train our model!
famGuyDir = "./famGuy_Data/"
sentences = createW2VSentences(famGuyDir)

famGuy_model = gensim.models.Word2Vec(
    sentences,
    min_count=5, # default is 5; this trims the corpus for words only used once; 
    size=200, # size of NN layers; default is 100; higher for larger corpora
    workers=5) # parallel processing; needs Cython
famGuy_model.save('famGuy_model')

## American Dad
amDadDir = "./amDad_Data/"
sentences = createW2VSentences(amDadDir)

amDad_model = gensim.models.Word2Vec(
    sentences,
    min_count=5, # default is 5; this trims the corpus for words only used once; 
    size=200, # size of NN layers; default is 100; higher for larger corpora
    workers=5) # parallel processing; needs Cython
amDad_model.save('amDad_model')

## Futurama
futurDir = "./Futurama_Data/"
sentences = createW2VSentences(futurDir)

futur_model = gensim.models.Word2Vec(
    sentences,
    min_count=5, # default is 5; this trims the corpus for words only used once; 
    size=200, # size of NN layers; default is 100; higher for larger corpora
    workers=5) # parallel processing; needs Cython
futur_model.save('futur_model')

In [23]:
def print_similar_words(word):
    print("Similar words for the word: " + word + "\n")
    print("Futurama model: \n" + "")
    print(futur_model.wv.most_similar(word, topn=10))
    print("\n")
    print("American Dad model: \n")
    print(amDad_model.wv.most_similar(word, topn=10))
    print("\n")
    print("Family Guy model: \n")
    print(famGuy_model.wv.most_similar(word, topn=10))
    print("\n")
    print("Simpsons model: \n")
    print(simp_model.wv.most_similar(word, topn=10))
    print("\n")
    print("South Park model: \n")
    print(sp_model.wv.most_similar(word, topn=10))
print_similar_words("girl")

Similar words for the word: girl

Futurama model: 

[('murder', 0.9974381923675537), ('information', 0.9971227645874023), ('clear', 0.9971027374267578), ('lying', 0.9970874786376953), ('kinda', 0.9968289136886597), ('heat', 0.9967789649963379), ('xmas', 0.9966915249824524), ('worse', 0.9965395927429199), ('acting', 0.996530294418335), ('hungry', 0.9964852333068848)]


American Dad model: 

[('kid', 0.866576075553894), ('dog', 0.8120333552360535), ('joke', 0.8057672381401062), ('alien', 0.7916181683540344), ('movie', 0.7905248403549194), ('little', 0.7783411741256714), ('lady', 0.7713030576705933), ('guy', 0.7710555791854858), ('mistake', 0.7640916109085083), ('person', 0.7620455026626587)]


Family Guy model: 

[('kid', 0.8343309164047241), ('woman', 0.8112009763717651), ('lady', 0.8006547093391418), ('dog', 0.7698657512664795), ('chick', 0.7621921896934509), ('person', 0.7526925802230835), ('guy', 0.7453711032867432), ('joke', 0.7126767635345459), ('big', 0.6856566667556763), ('man', 

In [24]:
print_similar_words("science")

Similar words for the word: science

Futurama model: 

[('cards', 0.9975329041481018), ('fall', 0.9974583387374878), ('rescue', 0.9970595836639404), ('penny', 0.9970381855964661), ('respect', 0.9968644976615906), ('bath', 0.9966974258422852), ('lover', 0.9966610670089722), ('animals', 0.9965253472328186), ('minds', 0.9964526891708374), ('making', 0.9964302778244019)]


American Dad model: 

[('kit', 0.9689302444458008), ('annual', 0.9665029048919678), ('glory', 0.963065505027771), ('spare', 0.9608169198036194), ('level', 0.9607637524604797), ('ending', 0.9607020616531372), ('fitness', 0.9603885412216187), ('pie', 0.9584792852401733), ('commitment', 0.9575470685958862), ('mount', 0.9571338295936584)]


Family Guy model: 

[('papers', 0.8639549612998962), ('fecal', 0.8612370491027832), ('jeep', 0.8589364290237427), ('hippie', 0.8526805639266968), ('flu', 0.849153995513916), ('pretzel', 0.8460206985473633), ('whassat', 0.8455643653869629), ('theft', 0.8438236713409424), ('wires', 0.842278

In [25]:
print_similar_words("family")

Similar words for the word: family

Futurama model: 

[('surprise', 0.9983640909194946), ('national', 0.9979523420333862), ('date', 0.9979090690612793), ('liquor', 0.9978708028793335), ('died', 0.9978538751602173), ('white', 0.997847318649292), ('filthy', 0.9978205561637878), ('private', 0.9977128505706787), ('inside', 0.9976363182067871), ('third', 0.9975912570953369)]


American Dad model: 

[('father', 0.8370622992515564), ('life', 0.8351610898971558), ('party', 0.8196209669113159), ('mother', 0.8146016597747803), ('hair', 0.8101903200149536), ('husband', 0.798795759677887), ('house', 0.7953490018844604), ('wife', 0.7951940894126892), ('friend', 0.7911969423294067), ('lady', 0.7898155450820923)]


Family Guy model: 

[('story', 0.5983825922012329), ('dog', 0.5799961090087891), ('new', 0.5631469488143921), ('book', 0.5606899261474609), ('life', 0.5385695099830627), ('guest', 0.5376495122909546), ('body', 0.5346227288246155), ('star', 0.532819390296936), ('baby', 0.529739499092102), (

### Results 
#### Classification: *American Dad!*-*Family Guy* Model
The model classifying *Family Guy* and *American Dad* episodes is accurate around 89% to 92% of the time. This is the lowest accuracy between the three models, reflecting both the similar writing styles and themes of the two shows. Additionally, the words most important to classifying an episode as *American Dad* are words like "Alien" or "CIA" while the words most important to classifying an episode as *Family Guy* are words such as "white" or "fat".

While the two shows both center around dysfunctional families with a boisterous male patriarch, these results provide us insight to the different themes of the two TV shows. *Family Guy* is more likely to involve topics pertaining to race or comment on the weight of a character. *American Dad!*, on the other hand, has an alien as one of the main characters and uses decisions made by the CIA to move the plot of individual episodes forward. 

I think the biggest insight this section provides, however, is through the classification model's relative inability to differentiate between the two episodes. This parallels a theme that I have seen in commentary on animated adult comedies. After the success of *Family Guy* many shows attempted to mimic the art style and humor of the show. These similar shows include *Brickleberry*, *Paradise PD*, and *The Cleveland Show*. In the case of *American Dad*, the show was created by Seth McFarland who also created *Family Guy*, but unlike Matt Groening, the creator of *The Simpsons* and *Futurama*, he largely remained within the thematic structure that he had already set up in *Family Guy* which could be the reason why the classification model has a lower rate of accuracy. 



#### Classification: *Futurama*-*The Simpsons* Model
The classification model differentiating between *Futurama* and *The Simpsons* is more accurate than the previous model, possessing a success rate of around 94%. This is due to the themes of the two shows being radically different despite their similar writing styles, spearheaded by Matt Groening and Matt Cohen. 

The words with the highest logistic regression weights, indicating that they are significant to the model classifying an episode as *Futurama*, are words such as "professor", "robot" and "planet". The words with the lowest logistic regression weights, indicating that they are significant to the model classifying an episode as *The Simpsons* are words such as "dad", "kids" and "school". The *Futurama* words reflect the more science fiction themes of the TV show, while *The Simpsons* words reflect the more familial themes of the show. When watching the shows, the behaviors of characters in each show are very similar. Characters such as Professor Farnsworth in *Futurama* are similar to John Frink in *The Simpsons*, and Fry's antics in *Futurama* mimic those of Bart and Homer in *The Simpsons*. Despite these similar character traits, there are nevertheless reliable thematic differences that allow the model to classify episodes reliably. 

An episode that is repeatedly misidentified as a *Futurama* episode is the season 23 episode "Them, Robot". This episode is one of the few *The Simpsons* episodes that genre bends from a family sitcom into a science-fiction narrative. After watching the episode, its themes of robotics and capitalist exploitation mirror constant language and themes in *Futurama*. This misidentification allows us to further identify what separates the two shows. That is, the genre of science-fiction divides *Futurama* from the familial sitcom genre of *The Simpsons*.

#### Classification: *South Park*-*The Simpsons* Model
The *South Park*-*Simpsons* model is by far the most accurate, and this is largely expected given they are very different shows. *South Park* is raunchier and directly satirizes political and social figures, while *The Simpsons* centers around a family and has a plot that is less directly character driven compared to *South Park*.

The words with the highest logistic regression weights, indicating importance to classifying an episode as *South Park*, are words like "dude" or are swear words, while words with the lowest logistic regression weights are words like "dad", "kids" or "baby. The words that are key to identifying an episode as a *Simpsons* episode, again, reflect the familial themes of the show. Even though this familial structure may not be the core plot of many episodes in the show, the familial themes of the show, nevertheless, undergird the show and its plot. 

An episode that is repeatedly misidentified is the season 2 episode "Terrance and Phillip in Not Without My Anus" which centers around Cartman finding his father. This theme of fatherhood is not present in many *South Park* episodes, but it is very reminiscent of the familial themes of *The Simpsons* which is why the classification model misinterprets the text of the episode as *The Simpsons*. Interestingly, no episodes from *The Simpsons* are repeatedly misidentified which is indicative of the rigidity of the familial and thematic structure of *The Simpsons* when compared with the diversity of themes in *South Park*


#### Word2Vec
The word2vec models are frequently indistinguishable when looking at the ten most similar words to words such as "man", but there are notable differences that can highlight the differing underlying themes and content of each show. Additionally, these differences in the way that words are used can also give insight to the subconscious decisions that writers of the TV shows make. It is worth noting that the model is different everytime it is run, so there could be differences in the similar words, but because of the size of the datasets, the output of the models are generally similar on each run. 

First, the word "girl" or the word "woman" has radically different similar words between the TV shows. For *Family Guy*, *American Dad* and *The Simpsons*, the similar words are all familial in nature, including words like "kid", but the similar words for *South Park* and *Futurama* do not include these familial themes. *South Park*'s similar words for "girl" include "penis", "vagina", "turd" and "dog". This is a clear outlier and gives a glimpse of a theme in the show. Women and girls are frequently only brought into the plot of a show in a sexual context. While this is likely not a conscious choice made by the writers of the show, it is nevertheless alarming and shows the subconscious decisions of the writers to limit the role of female characters in the show. *Futurama*'s similar words to "girl" are "pet", "swamped" and "damned". This is a result of Leela, the female protagonist of the show, who is a "sewer mutant" and constantly has her pet nearby. However, despite both Matt Groening and Matt Cohen being heavily involved with both *Futurama* and *The Simpsons*, there are clear differences in their use of female characters between the two shows. This is in contrast to the shows created by Seth McFarland, *American Dad* and *Family Guy*, which have nearly identical similar words across the board. In this case, this, again, shows the thematic range between *Futurama* and *Simpsons* when compared with the range of *American Dad* and *Family Guy*.

Second, the word "science" also has interesting results. *South Park*'s similar words to "science" are "Christian" and "comedy". After seeing these words, I watched the episode "Go God Go" where "science" is frequently mentioned in its opposition to Christianity and religion. This oppositional relationship between "science" and religion is not mirrored in other TV shows in the dataset where "science" is similar to "progress" in the *Family Guy* model and "modern" or "mysterious" in *The Simpsons* model. 

Finally, the words "dad" and "family" are both highly weighted in the logistic regression for both the *Futurama*-*The Simpsons* and the *South Park*-*The Simpsons* classification models. Because of this high weighting, I decided to investigate the similar words to both "dad" and "family" in the word2vec models of all five shows in the dataset. Interestingly, every show except *Futurama* has nearly identical similar words for "dad" and "family". These similar words include "life", "marriage", "mother", "child" or "friends". However, *Futurama* breaks from the pack, possessing similar words like "store", "date", "liquor", "crew" and "class". There is again computational evidence for Matt Groening's and Matt Cohen's different writing in *Futurama* versus *The Simpsons*, but in this case, there is computational evidence for the vocabulary and themes in *Futurama* breaking from the genre of animated adult comedy TV shows as a whole. Almost every show in the genre is inundated with familial themes with a show that breaks from these themes being an exception rather than the rule. Prior to fully googling a list of shows in the genre, my dad, my brother and I thought of popular animated adult comedy TV shows that do not center around a familial unit or do not use a familial relationship to move the plot of many episodes across the length of the show's run. The only examples we could think of were *Futurama* and *BoJack Horseman*. I know many episodes of season 4 of *BoJack Horseman* center around BoJack's mother and his childhood, but compared to shows like *Archer*, the familial themes of the show are much more sporadic. Of course, this is anecdotal evidence based on three men with similar TV viewing habits, but it is nevertheless interesting that *Futurama* breaks from this convention despite Matt Groening having success with a family-based show in *The Simpsons*. After researching this further, I found an interesting [analysis](https://justtv.files.wordpress.com/2007/03/mittell_simpsons.pdf) written by Jason Mittell titled "Cartoon Realism: Genre Mixing and the Cultural Life of *The Simpsons*". In this paper, Mittell argues that using a familial unit as the basis for a TV comedy is relatively generic and uncontroversial in nature. This enables the TV show to be more successful especially on network television because of its much wider appeal. The deeper investigation that word2vec enabled highlights the usefulness of word2vec in enabling close reading and deeper investigation of themes that may be overlooked when watching a show. 


### Conclusion
This project adds to the ongoing discussion about adult animated comedy shows, hoping to provide answers about the distinctions between the most popular shows in the genre over the last three decades. However, there are limitations to the analysis presented. First, as I discussed in the data section above, there are a few shows including *Archer* and *King of the Hill* that are not included in the dataset due to copyright issues filed by the shows' production studios. Second, if a show is currently still releasing new episodes, the most recent season of a TV show does not have available transcripts. For example, "The Pandemic Special" of *South Park* does not have an available transcript because it was released recently. 

Going forward, analyzing the results of Matt Groening's and Seth McFarland's writing in different shows was intriguing, and I would like to see how the same comedy writers change over time or write differently when writing for two shows that are running at the same time. Additionally, similar analyses could be done for other TV show genres or subsections of TV show genres. For example, analysis could be done on the difference in writing between multi-cam sitcoms such as *Big Bang Theory* and *2 Broke Girls* and single-cam sitcoms such as *Modern Family* and *Silicon Valley*.