# Differentiating the Humor of Long-Running Animated Adult Comedy TV Shows
#### Ben Jablonski
#### 11-28-2020


In [22]:
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')



### Introduction
Since the emergence of network television, comedy has been a perennial TV show genre, but there is an incredible variety between the writing and themes of different TV shows. Specifically, following the success of *The Simpsons* in the early 1990s, animated adult comedy shows gained a foothold in the cultural psyche. The creation of Adult Swim in 2001 created a dedicated schedule and network for adult animation, and streaming services like Netflix and Hulu have cemented funding for new animation projects. Over the last decade, adult animation has transformed from a niche genre to mainstream, with shows like *Rick and Morty* becoming cultural sensations. For a more thorough investigation of the history of adult animation, this article from [*Time*](https://time.com/5752400/adult-animation-golden-age/) is a fantastic resource. 

I wanted to test whether computational techniques such as classification and word2vec models can differentiate between the writing and themes of various animated comedy TV shows. Specifically, which words, themes and relationships between characters define and separate the TV shows from each other? Potentially, these methods can reveal subtextual themes that may not be clear to a casual viewer of the TV show and could elucidate choices made by the writers of TV shows whether the choices are conscious or subconscious. Because of the increasing popularity of animated comedy shows, differentiating and understanding their writing and themes serves an important cultural role through complementing the experience of watching a popular TV show. Specifically, these computational techniques allow a human viewer to further investigate previously banal relationships, and exposing these relationships can provide a critical acuity that enables a richer understanding of a TV show and the genre as a whole. 


### Relevant Studies 
There are two types of studies that are relevant to this analysis. Some studies are data based while others are more anecdotal and focus on philosophical themes of TV shows. For example, for the TV show *South Park*, there are 45 podcast episodes and 12 YouTube videos on the YouTube channel [Wisecrack](https://www.youtube.com/channel/UC6-ymYjG0SU0jUWnWh9ZzEQ) alone. These analyses fall into the second category with most of the crew of Wisecrack being philosophy graduate students or former philosophy professors. One example is the podcast episode on the "Band in China" *South Park* episode which discusses international copyright issues and censorship as financial blackmail from China. There are various other studies on YouTube or sites like Medium, but most of these studies focus less on the genre of animated comedy shows as a whole. Instead, they focus on individual episodes or individual TV shows. 
The other category of studies is more data science based. These studies focus on individual TV shows with the scripts of every episode. Some examples include "[Going Down to South Park -- Text Analysis with R](https://medium.com/@vertabeloacdm/going-down-to-south-park-text-analysis-with-r-61e8beef6851)" by Patrik Drhlik, "[The Simpsons meets Data Visualization](https://towardsdatascience.com/the-simpsons-meets-data-visualization-ef8ef0819d13)" by Adam Reevesman, "[Visualizing Archer](https://medium.com/@Elijah_Meeks/visualizing-archer-bcb80e319625)" by Elijah Meeks, or "[Futurama: Bender's NLP](https://towardsdatascience.com/futarama-benders-nlp-775c47871ad5)" by Isaac Kim. However, many of these studies count individual character's lines and compare the character's lines between seasons, and I find the method of simply counting character's lines to be relatively unfulfilling. For example, *The Simpsons* study above finds that Homer and Marge talk to each other the most of every character pairing, and they find that Flanders' lines have the most positive sentiment. Neither of these conclusions provide particular insight into their characters or the themes of the TV show. Another study that I wanted to specifically highlight is ["Forecasting the Success of Television Series using Machine Learning"](https://arxiv.org/pdf/1910.12589.pdf) by Ramya Akula, Zachary Wieselthier, Laura Martin and Ivan Garibay from the University of Central Florida. Their dataset includes mostly live action sitcoms, but I found their methods and analysis to be interesting. They use clustering and machine learning forecast models to predict the rating of an episode based on how many lines individual characters have. I found this relevant because they find statistically significant relationships between IMDb rating and a character's lines for many of the shows, but they nevertheless acknowledge the many limitations of analyzing TV show success, including varying demographics, marketing budgets and radically changing consumer preferences. I wanted to bring this sentiment into this project. The medium of television is always in flux with changing themes and writing constantly, so many conclusions about TV genres are temporally provisional. 

Additionally, I could not find studies that compare different animated adult comedy TV shows.





### Data 
It is difficult to be unbiased in deciding which TV shows to include in the analysis, but less than a dozen animated adult comedy shows have run for more than 5 seasons. I wanted to select TV shows with at least 60 or 70 episodes in order to have a larger amount of text to analyze for each show. Given these conditions, I decided to select the following TV shows: *South Park*, *The Simpsons*, *Futurama*, *American Dad!*, and *Family Guy*. Other shows that met the 70-episode threshold but were not selected include *Archer* and *King of the Hill*. The production studios for these shows had filed DMCA takedowns of publicly available transcripts, so they are not included. Other shows such as *Bob's Burgers* and *Aqua Teen Hunger Force* did not have complete transcripts of every episode online. 

There are five datasets, one for each TV show listed above, and each row of the dataset represents one line spoken in an episode of a given TV show. Therefore, a row includes the line spoken, the character speaking the line, the episode in which the line is spoken, and the season in which the episode occurs. 

While I have tried to fill in missing episode transcripts in order to have every episode represented in the datasets, there are some exceptions such as the *South Park* episodes "200" and "201" which have been banned for their depiction of the Islamic prophet Muhammad. Overall, these gaps in the dataset are relatively limited and do not significantly affect the accuracy of the conclusions presented. 

Additionally, all of these datasets can be found and downloaded on my [GitHub repo](https://github.com/bjablonski20/final-project-qtm340) along with all code used for the analysis in order to enable replication of the results below.

The *Futurama*, *South Park*, and *The Simpsons* datasets were created by fans of the shows on Kaggle and GitHub, but the *Family Guy* and the *American Dad!* datasets were manually retrieved via web scraping using Beautiful Soup. 

The following code loads all data needed for this project from GitHub. 

In [None]:

## importing
import re
import os
import pandas as pd
import little_mallet_wrapper
import seaborn
import glob
from pathlib import Path
import nltk
import numpy as np
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from pandas import DataFrame
from pandas import Series, DataFrame
from sklearn.feature_extraction import text
from scipy.stats import pearsonr, norm
sid = SentimentIntensityAnalyzer()

## South Park 
### retrieve csv from my github repo
if not os.path.exists('sp_ep_data.csv.1'):
    !wget https://raw.githubusercontent.com/bjablonski20/final-project-qtm340/main/Transcript%20CSVs/sp_ep_data.csv
### convert into a pandas dataframe
sp_ep_data = pd.read_csv ('sp_ep_data.csv.1',error_bad_lines=False)
sp_ep_data = sp_ep_data.drop(['episode_link', 'season_link', 'season_name'], axis=1)
### split dataframe into text files for each episode
for season in range(1,sp_ep_data['season_number'].max()+1): ## season numbers
    for episode in range(1, sp_ep_data[sp_ep_data['season_number'] == season]['season_episode_number'].max()):
        filename = "sp_ep" + str(season) + "_" + str(episode)+"_.txt"
        path = "SouthPark_Data/" + filename
        with open(path, "w") as file:
            file.writelines(sp_ep_data[(sp_ep_data['season_number'] == season) & (sp_ep_data['season_episode_number'] == episode)]['text'])
       

 ##The Simpsons

### retrieve csv from my github repo
if not os.path.exists('simpsons_script_lines.csv'):
    !wget https://raw.githubusercontent.com/bjablonski20/final-project-qtm340/main/Transcript%20CSVs/simpsons_script_lines.csv
### convert into a pandas dataframe
simp_ep_data = pd.read_csv ('simpsons_script_lines.csv',error_bad_lines=False)
### Adding season column to simpsons dataset -- i know this is unwieldy but i decided to brute force it 
simp_ep_data['season'] = np.where(simp_ep_data['episode_id'] < (14),1,0)
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22))&(simp_ep_data['episode_id'] >= (14)),2,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24))&((simp_ep_data['episode_id'] >= (14+22))),3,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22))&(simp_ep_data['episode_id'] >= (14+22+24)),4,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22))&(simp_ep_data['episode_id'] >= (14+22+24+22)),5,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25))&(simp_ep_data['episode_id'] >= (14+22+24+22+22)),6,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25)),7,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25)),8,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25)),9,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25+23))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25)),10,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25+23+22))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25+23)),11,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25+23+22+21))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25+23+22)),12,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25+23+22+21+22))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25+23+22+21)),13,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25+23+22+21+22+22))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25+23+22+21+22)),14,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25+23+22+21+22+22+22))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25+23+22+21+22+22)),15,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25+23+22+21+22+22+22+21))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25+23+22+21+22+22+22)),16,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (14+22+24+22+22+25+25+25+25+23+22+21+22+22+22+21+22))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25+23+22+21+22+22+22+21)),17,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (401))&(simp_ep_data['episode_id'] >= (14+22+24+22+22+25+25+25+25+23+22+21+22+22+22+21+22)),18,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (401+20))&(simp_ep_data['episode_id'] >= (401)),19,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (401+20+21))&(simp_ep_data['episode_id'] >= (401+20)),20,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (401+20+21+23))&(simp_ep_data['episode_id'] >= (401+20+21)),21,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (401+20+21+23+22))&(simp_ep_data['episode_id'] >= (401+20+21+23)),22,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (401+20+21+23+22+22))&(simp_ep_data['episode_id'] >= (401+20+21+23+22)),23,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (401+20+21+23+22+22+22))&(simp_ep_data['episode_id'] >= (401+20+21+23+22+22)),24,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] < (401+20+21+23+22+22+22+22))&(simp_ep_data['episode_id'] >= (401+20+21+23+22+22+22)),25,simp_ep_data['season'])
simp_ep_data['season'] = np.where((simp_ep_data['episode_id'] >= (401+20+21+23+22+22+22+22)),26,simp_ep_data['season'])
### Split dataframe into individual text files for each episode
for season in range(1,simp_ep_data['season'].max()+1): ## season numbers
    if type(simp_ep_data[simp_ep_data['season'] == season]['episode_id'].max()) == type(simp_ep_data[simp_ep_data['season'] == 1]['episode_id'].max()):
        for episode in range(int(simp_ep_data[simp_ep_data['season'] == season]['episode_id'].min()), int(simp_ep_data[simp_ep_data['season'] == season]['episode_id'].max())):
            filename = "simp_ep" + str(season) + "_" + str(episode)+"_.txt"
            path = "Simpsons_Data/" + filename
            with open(path, "w") as file:
                file.writelines(simp_ep_data[(simp_ep_data['season'] == season) & (simp_ep_data['episode_id'] == episode)]['spoken_words'])
### these files were empty 
os.remove("./Simpsons_Data/simp_ep21_447_.txt")
os.remove("./Simpsons_Data/simp_ep20_424_.txt")
os.remove("./Simpsons_Data/simp_ep25_550_.txt")
                                  
    


## Futurama 
### retrieve csv from my github repo
if not os.path.exists('futurama_ep_data.csv'):
    !wget https://raw.githubusercontent.com/bjablonski20/final-project-qtm340/main/Transcript%20CSVs/futurama_ep_data.csv
futur_ep_data = pd.read_csv ('futurama_ep_data.csv',error_bad_lines=False)

### adds episode column
episode = []
ep_no = 1
count = 0
for index, row in futur_ep_data.iterrows(): 
    if count == 23811:
        break
    else:
        if futur_ep_data['Episode'][count] != futur_ep_data['Episode'][count+1]:
            episode.append(ep_no)
            ep_no = ep_no +1
            count = count +1
        else: 
            count = count + 1
            episode.append(ep_no)
episode.append(114)
futur_ep_data['Episode Number'] = episode

### Splits the dataframe into individual text files for each episode 
for season in range(1,futur_ep_data['Season'].max()+1): ## season numbers
    for episode in range(1, futur_ep_data[futur_ep_data['Season'] == season]['Episode Number'].max()):
        filename = "futur_ep" + str(season) + "_" + str(episode)+"_.txt"
        path = "Futurama_Data/" + filename
        with open(path, "w") as file:
            try:
                file.writelines(futur_ep_data[(futur_ep_data['Season'] == season) & (futur_ep_data['Episode Number'] == episode)]['Line'])
            except TypeError:
                break
### removes all empty episodes 
from os import listdir
from os.path import isfile, join
mypath = "Futurama_Data/"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in onlyfiles:
    if os.path.getsize(mypath + file) == 0:
        os.remove(mypath + file)     


    



In [24]:
## Web Scraping 

### beautiful soup function that creates a pandas Dataframe with the episode, speaker and the line for the series 
from bs4 import BeautifulSoup
import requests
import re 
start = "<b>"
end = "</b>"
def grab_urls(soup):
  episode_urls = []
  for season in soup.findAll('div', style="column-count:2"):
    for episode in season.findAll('a'):
        try:
           episode_urls.append(episode.attrs['href'])
        except:
           continue
  return episode_urls
def scripts_from_html(html):
    html_page = requests.get(html, headers={'User-Agent': 'Mozilla/5.0'})
    soup = BeautifulSoup(html_page.content, 'html.parser')
    urls = grab_urls(soup)
    texts = []
    characters = []
    episode = []
    ep_no = 1
    character = "NONE"
    for i in urls:
        html = "https://transcripts.fandom.com/" + i
        count = 0
        html_page = requests.get(html)
        html_string = html_page.text
        soup = BeautifulSoup(html_string, 'html.parser')
        for i in soup.find_all("p"):
            text1 = str(i)
            if "<p>" in text1:
                text1 = text1[text1.index("<p>")+len("<p>"):text1.index("</p>")]
            episode.append(ep_no)
            if "</b>" in str(i):
                text2 = str(i)
                character = text2[text2.index("<b>")+len("<b>"):text2.index("</b>")]
            characters.append(character)
            count = count + 1
            if "</b>" in text1:
                text1 = text1[text1.index("</b>")+len("</b>"):len(text1)]
            texts.append(text1)
        ep_no = ep_no + 1

    d = {'Character': characters,'Episode': episode, 'Line': texts}
    return pd.DataFrame(d)

In [25]:
## American Dad
html_amDad = "https://transcripts.fandom.com/wiki/American_Dad!"
amDad_ep_data = scripts_from_html(html_amDad)
## adds the season variable -- done manually because i couldnt find season number as a header when parsing through the html 
amDad_ep_data['season'] = np.where(amDad_ep_data['Episode'] < (23),1,0)
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (23+19))&(amDad_ep_data['Episode'] >= (23)),2,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (23+19+16))&(amDad_ep_data['Episode'] >= (23+19)),3,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (23+19+16+20))&(amDad_ep_data['Episode'] >= (23+19+16)),4,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (23+19+16+20+18))&(amDad_ep_data['Episode'] >= (23+19+16+20)),5,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (23+19+16+20+18+19))&(amDad_ep_data['Episode'] >= (23+19+16+20+18)),6,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (23+19+16+20+18+19+18))&(amDad_ep_data['Episode'] >= (23+19+16+20+18+19)),7,amDad_ep_data['season'])

amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (133+19))&(amDad_ep_data['Episode'] >= (133)),8,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (133+19+20))&(amDad_ep_data['Episode'] >= (133+19)),9,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (133+19+20+18))&(amDad_ep_data['Episode'] >= (133+19+20)),10,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (133+19+20+18+22))&(amDad_ep_data['Episode'] >= (133+19+20+18)),11,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (133+19+20+18+22+22))&(amDad_ep_data['Episode'] >= (133+19+20+18+22)),12,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (133+19+20+18+22+22+22))&(amDad_ep_data['Episode'] >= (133+19+20+18+22+22+22)),13,amDad_ep_data['season'])
amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (133+19+20+18+22+22+22+22))&(amDad_ep_data['Episode'] >= (133+19+20+18+22+22+22+22)),14,amDad_ep_data['season'])

amDad_ep_data['season'] = np.where((amDad_ep_data['Episode'] < (278+22))&(amDad_ep_data['Episode'] >= (278)),15,amDad_ep_data['season'])

## create individual .txt docs for each episode
for season in range(1,amDad_ep_data['season'].max()+1): ## season numbers
    try:
        for episode in range(1, (int)(amDad_ep_data[amDad_ep_data['season'] == season]['Episode'].max())):
            filename = "amDad_ep" + str(season) + "_" + str(episode)+"_.txt"
            path = "amDad_Data/" + filename
            with open(path, "w") as file:
                file.writelines(amDad_ep_data[(amDad_ep_data['season'] == season) & (amDad_ep_data['Episode'] == episode)]['Line'])
    except ValueError:
        break


## removes all empty episodes 
from os import listdir
from os.path import isfile, join
mypath = "amDad_Data/"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in onlyfiles:
    if os.path.getsize(mypath + file) == 0:
        os.remove(mypath + file)

In [26]:
## Family Guy
html_famGuy = "https://transcripts.fandom.com/wiki/Family_Guy"
famGuy_ep_data = scripts_from_html(html_famGuy)
famGuy_ep_data['season'] = np.where(famGuy_ep_data['Episode'] < (8),1,0)
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (8+21))&(famGuy_ep_data['Episode'] >= (8)),2,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (8+21+22))&(famGuy_ep_data['Episode'] >= (8+21)),3,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (8+21+22+30))&(famGuy_ep_data['Episode'] >= (8+21+22)),4,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (8+21+22+30+18))&(famGuy_ep_data['Episode'] >= (8+21+22+30)),5,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (8+21+22+30+18+12))&(famGuy_ep_data['Episode'] >= (8+21+22+30+18)),6,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (8+21+22+30+18+12+16))&(famGuy_ep_data['Episode'] >= (8+21+22+30+18+12)),7,famGuy_ep_data['season'])

famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (112+21))&(famGuy_ep_data['Episode'] >= (112)),8,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (112+21+18))&(famGuy_ep_data['Episode'] >= (112+21)),9,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (112+21+18+23))&(famGuy_ep_data['Episode'] >= (112+21+18)),10,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (112+21+18+23+22))&(famGuy_ep_data['Episode'] >= (112+21+18+23)),11,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (112+21+18+23+22+21))&(famGuy_ep_data['Episode'] >= (112+21+18+23+22)),12,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (112+21+18+23+22+21+18))&(famGuy_ep_data['Episode'] >= (112+21+18+23+22+21)),13,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (112+21+18+23+22+21+18+20))&(famGuy_ep_data['Episode'] >= (112+21+18+23+22+21+18)),14,famGuy_ep_data['season'])

famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (255+20))&(famGuy_ep_data['Episode'] >= (255)),15,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (255+20+20))&(famGuy_ep_data['Episode'] >= (255+20)),16,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (255+20+20+20))&(famGuy_ep_data['Episode'] >= (255+20+20)),17,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (255+20+20+20+20))&(famGuy_ep_data['Episode'] >= (255+20+20+20)),18,famGuy_ep_data['season'])
famGuy_ep_data['season'] = np.where((famGuy_ep_data['Episode'] < (255+20+20+20+20+22))&(famGuy_ep_data['Episode'] >= (255+20+20+20+20)),19,famGuy_ep_data['season'])


## creates individual .txt files for each episode
for season in range(1,famGuy_ep_data['season'].max()+1): ## season numbers
    try:
        for episode in range(1, (int)(famGuy_ep_data[famGuy_ep_data['season'] == season]['Episode'].max())):
            filename = "famGuy_ep" + str(season) + "_" + str(episode)+"_.txt"
            path = "famGuy_Data/" + filename
            with open(path, "w") as file:
                file.writelines(famGuy_ep_data[(famGuy_ep_data['season'] == season) & (famGuy_ep_data['Episode'] == episode)]['Line'])
    except ValueError:
        break


## removes empty files 
mypath = "famGuy_Data/"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in onlyfiles:
    if os.path.getsize(mypath + file) == 0:
        os.remove(mypath + file)

### Classification Method
I created three different classification models. The models differentiate between *American Dad!* and *Family Guy* episodes, *The Simpsons* and *Futurama* episodes, and *South Park* and *The Simpsons* episodes. The first two pairs are selected because the shows in each pair are created and written by many of the same people while the last pair is picked because *South Park* and *The Simpsons* are the two longest running shows in the dataset. Additionally, stop words are taken out of the text, including show specific place and character names. The stop words used for this analysis can also be downloaded on my GitHub repo. 

The goal of this section is both to test whether the classification method can differentiate between similar TV shows based only on the text of an episode of the show and also to find which words are most important in differentiating between any two TV shows. The relative ability of the classification model to distinguish between episodes of specific shows can provide an indicator of the similarity of two shows which is important to identify the diversity of humor and theme between two shows. Additionally, identifying which episodes the model repeatedly misidentifies provides opportunity for close reading by looking at the text of the show and identifying *why* the classification model has failed. The logistic regression weights given to individual words indicate the importance of words to the identification of a show. While it is an imperfect measure, the highly weighted words can potentially paint an overall theme that differentiates a show from its counterpart in the model. 






### Word2Vec Method 
I created five separate Word2Vec models with each model being trained on every script of an individual TV show. This technique enables parsing out vocabulary and thematic differences between TV shows by investigating the similar words in one model compared to another, and it additionally provides a show's provisional definition of a word by displaying which words are similar in the vocabulary of that show. 

While the classification section attempts to define and differentiate shows on a binary basis (i.e. What themes and words are uniquely characteristic of *The Simpsons* rather than *Futurama*?), this section primarily seeks to find unique differences in the vocabulary of an individual TV show in relation to the four other shows in the dataset. This is, by nature, much more open ended, seeking to identify major themes that can uniquely characterize an individual TV show. 

The following code trains all five word2vec models.


In [None]:
## importing packages
import gensim
from nltk.tokenize import sent_tokenize
from nltk.tokenize.treebank import TreebankWordTokenizer
import nltk
nltk.download('punkt')
## South Park
base_dir = "./SouthPark_Data/" 

all_docs = [] 

docs = os.listdir(base_dir) # get a list of all the files in the directory

for doc in docs: # iterate through the docs
    if not doc.startswith('.'): # get only the .txt files
        with open(base_dir + doc, "r", encoding="utf-8") as file: # force unicode conversion to keep PCs happy
            text = file.read() # read in the file as a single text string
            all_docs.append(text) # append it to the all_docs list

tokenizer = TreebankWordTokenizer()

# Get titles
directory = "./SouthPark_Data/"
files = glob.glob(f"{directory}/*.txt")
sp_titles = [Path(file).stem for file in files]

# and the function
def make_sentences(list_txt):
    all_txt = []
    counter = 0
    for txt in list_txt:
        lower_txt = txt.lower()
        sentences = sent_tokenize(lower_txt)
        sentences = [tokenizer.tokenize(sent) for sent in sentences]
        all_txt += sentences
       # print(sp_titles[counter]) # let's print the title of the obit
       # print(len(sentences))  # let's check how many sentences there are per obit
        #print("\n")
        counter += 1
    return all_txt

sentences = make_sentences(all_docs)

# trains our model!
sp_model = gensim.models.Word2Vec(
    sentences,
    min_count=5, 
    size=200,
    workers=5) # parallel processing; 
sp_model.save('sp_model') ## saves the model

## Simpsons


base_dir = "./Simpsons_Data/" 

all_docs = [] # our list which will store the text of each doc; empty for now

docs = os.listdir(base_dir) # get a list of all the files in the directory

for doc in docs: # iterate through the docs
    if not doc.startswith('.'): # get only the .txt files
        with open(base_dir + doc, "r", encoding="utf-8") as file: # force unicode conversion to keep PCs happy
            text = file.read() # read in the file as a single text string
            all_docs.append(text) # append it to the all_docs list

# need our handy nltk tokenizer 
tokenizer = TreebankWordTokenizer()

# and we'll get titles
directory = "./Simpsons_Data/"
files = glob.glob(f"{directory}/*.txt")
simp_titles = [Path(file).stem for file in files]

# and the function
def make_sentences(list_txt):
    all_txt = []
    counter = 0
    for txt in list_txt:
        lower_txt = txt.lower()
        sentences = sent_tokenize(lower_txt)
        sentences = [tokenizer.tokenize(sent) for sent in sentences]
        all_txt += sentences
        #print(simp_titles[counter]) # let's print the title of the obit
       # print(len(sentences))  # let's check how many sentences there are per obit
       # print("\n")
        counter += 1
    return all_txt

sentences = make_sentences(all_docs)

# let's train our model!
simp_model = gensim.models.Word2Vec(
    sentences,
    min_count=5, # default is 5; this trims the corpus for words only used once; 
    size=200, # size of NN layers; default is 100; higher for larger corpora
    workers=5) # parallel processing; needs Cython
simp_model.save('simp_model')



## Family Guy
base_dir = "./famGuy_Data/" 
def make_sentences(list_txt):
    all_txt = []
    counter = 0
    for txt in list_txt:
        lower_txt = txt.lower()
        sentences = sent_tokenize(lower_txt)
        sentences = [tokenizer.tokenize(sent) for sent in sentences]
        all_txt += sentences
    return all_txt
def createW2VSentences(dire):
    all_docs = [] # our list which will store the text of each doc; empty for now

    docs = os.listdir(dire) # get a list of all the files in the directory

    for doc in docs: # iterate through the docs
        if not doc.startswith('.'): # get only the .txt files
            with open(dire + doc, "r", encoding="utf-8") as file: # force unicode conversion to keep PCs happy
                text = file.read() # read in the file as a single text string
                all_docs.append(text) # append it to the all_docs list

# need our handy nltk tokenizer 
    tokenizer = TreebankWordTokenizer()

# and we'll get titles
    files = glob.glob(f"{dire}/*.txt")
    par_titles = [Path(file).stem for file in files]

    sentences = make_sentences(all_docs)
    return sentences

# let's train our model!
famGuyDir = "./famGuy_Data/"
sentences = createW2VSentences(famGuyDir)

famGuy_model = gensim.models.Word2Vec(
    sentences,
    min_count=5, # default is 5; this trims the corpus for words only used once; 
    size=200, # size of NN layers; default is 100; higher for larger corpora
    workers=5) # parallel processing; needs Cython
famGuy_model.save('famGuy_model')

## American Dad
amDadDir = "./amDad_Data/"
sentences = createW2VSentences(amDadDir)

amDad_model = gensim.models.Word2Vec(
    sentences,
    min_count=5, # default is 5; this trims the corpus for words only used once; 
    size=200, # size of NN layers; default is 100; higher for larger corpora
    workers=5) # parallel processing; needs Cython
amDad_model.save('amDad_model')

## Futurama
futurDir = "./Futurama_Data/"
sentences = createW2VSentences(futurDir)

futur_model = gensim.models.Word2Vec(
    sentences,
    min_count=5, # default is 5; this trims the corpus for words only used once; 
    size=200, # size of NN layers; default is 100; higher for larger corpora
    workers=5) # parallel processing; needs Cython
futur_model.save('futur_model')

### Results 
#### Classification: *American Dad!*-*Family Guy* Model
The model classifying *Family Guy* and *American Dad* episodes is accurate around 89% to 92% of the time. This is the lowest accuracy between the three models, reflecting both the similar writing styles and themes of the two shows. Additionally, the words most important to classifying an episode as *American Dad* are words like "Alien", "USA" or "CIA" while words such as "white" or "fat" are most important to classifying an episode as *Family Guy*.

While the two shows both center around dysfunctional families with boisterous male patriarchs, these results provide us insight to the different themes of the two TV shows. *Family Guy* is more likely to involve topics pertaining to race or comment on the weight of a character. *American Dad!*, on the other hand, has an alien as one of the main characters and uses decisions made by the CIA to move the plot of individual episodes forward. 

I think the biggest insight this section provides, however, is through the classification model's relative inability to differentiate between the two episodes. This parallels a theme that I have seen in commentary on animated adult comedies. After the success of *Family Guy* many shows attempted to mimic the art style and humor of the show. These similar shows include *Brickleberry*, *Paradise PD*, and *The Cleveland Show*. In the case of *American Dad*, the show was created by Seth McFarland who also created *Family Guy*, but unlike Matt Groening, the creator of *The Simpsons* and *Futurama*, he largely remained within the thematic structure that he had already set up in *Family Guy* which could be the reason why the classification model has a lower rate of accuracy. 

In [None]:
## Family Guy-American Dad
### General Things 
### load stopwords
from sklearn.feature_extraction import text
text_file = open('./docs/stopwords.txt')
jockers_words = text_file.read().split()
new_stopwords = text.ENGLISH_STOP_WORDS.union(jockers_words)

### create dtm
corpus_path = './Classification_Data_AmDadFamGuy/'
vectorizer = CountVectorizer(input='filename', encoding='utf8', stop_words = new_stopwords, min_df=20, dtype='float64')


directory = "./Classification_Data_AmDadFamGuy/"
files = glob.glob(f"{directory}/*.txt")
titles = [Path(file).stem for file in files]

corpus = []
for title in titles:
    filename = title + ".txt"
    corpus.append(corpus_path + filename)
dtm = vectorizer.fit_transform(corpus)
vocab = vectorizer.get_feature_names()
matrix = dtm.toarray()
df = DataFrame(matrix, columns=vocab)









filepaths = []
fulltext = []
for filepath in glob.iglob('Classification_Data_AmDadFamGuy/*.txt'): ## grab titles
    filepaths.append(filepath)
for filepath in filepaths: ## grab text and get the scores of each individual episode
    with open(filepath, "r") as file:
        text = file.read()
        fulltext.append(text)
clasdict = {'Name': filepaths, 'Text': fulltext
}
classdf = pd.DataFrame(clasdict)
def changeFilePath(filePath):
    newPath = filePath[32:]
    return newPath
classdf['Name'] = classdf['Name'].apply(changeFilePath)
amDadBool = []

for name in classdf['Name']:
    if name[0:4] == "amDa":
        amDadBool.append(1)
    else:
        amDadBool.append(0)
classdf['amDad'] = amDadBool

df_concat = pd.concat([classdf, df], axis = 1)
noAD = df_concat['amDad'].sum()
noAD = int(noAD)
df_simp = df_concat[df_concat['amDad'] == 0]
df_sp = df_concat[df_concat['amDad'] == 1]
df_simp = df_simp.sample(n=noAD)
df_final = pd.concat([df_simp, df_sp])
df_final = df_final.reset_index()
df_final = df_final.drop(columns="index")
meta = df_final[["Name", "amDad"]]
df = df_final.loc[:,'000':]

meta['PROBS'] = ''
meta['PREDICTED'] = ''

model = LogisticRegression(penalty = 'l1', C = 1.0, solver='liblinear')


for this_index in df_final.index.tolist():
    title = meta.loc[meta.index[this_index], 'Name'] 
    CLASS = meta.loc[meta.index[this_index], 'amDad']
    #print(title, CLASS) 
    
    train_index_list = [index_ for index_ in df.index.tolist() if index_ != this_index] # exclude the title to be predicted
    X = df.loc[train_index_list] # the model trains on all the data except the excluded title row
    y = meta.loc[train_index_list, 'amDad'] # the y row tells the model which class each title belongs to
    TEST_CASE = df.loc[[this_index]]

    model.fit(X,y) # fit the model
    prediction = model.predict_proba(TEST_CASE) # calculate probability of test case
    predicted = model.predict(TEST_CASE) # calculate predicted class of test case
    meta.at[this_index, 'PREDICTED'] = predicted # add predicted class to metadata
    meta.at[this_index, 'PROBS'] = str(prediction) # add probabilities to metadata

meta = meta.replace([0], 0)
meta = meta.replace([1], 1)
sum_column = meta['amDad'] - meta['PREDICTED']
meta['RESULT'] = sum_column


meta = meta.replace([0], 0)
meta = meta.replace([1], 1)

sum_column = meta['amDad'] - meta['PREDICTED']
meta['Result'] = sum_column


meta[meta['RESULT'] == 1]

canonic_c = 1.0

def Ztest(vec1, vec2):

    X1, X2 = np.mean(vec1), np.mean(vec2)
    sd1, sd2 = np.std(vec1), np.std(vec2)
    n1, n2 = len(vec1), len(vec2)

    pooledSE = np.sqrt(sd1**2/n1 + sd2**2/n2)
    z = (X1 - X2)/pooledSE
    pval = 2*(norm.sf(abs(z)))

    return z, pval
def feat_pval_weight(meta_df_, dtm_df_):

    dtm0 = dtm_df_.loc[meta_df_[meta_df_['amDad']==0].index.tolist()].to_numpy()
    dtm1 = dtm_df_.loc[meta_df_[meta_df_['amDad']==1].index.tolist()].to_numpy()

    pvals = [Ztest(dtm0[ : ,i], dtm1[ : ,i])[1] for i in range(dtm_df_.shape[1])]
    clf = LogisticRegression(penalty = 'l1', C = canonic_c, class_weight = 'balanced', solver='liblinear')
    clf.fit(dtm_df_, meta_df_['amDad']==0)
    weights = clf.coef_[0]

    feature_df = pd.DataFrame()

    feature_df['FEAT'] = dtm_df_.columns
    feature_df['P_VALUE'] = pvals
    feature_df['LR_WEIGHT'] = weights

    return feature_df


feat_df = feat_pval_weight(meta, df)

In [69]:
accuracy = len(meta[meta['RESULT'] == 0])/len(meta)
print("The accuracy of the Family Guy-American Dad model is " + str(accuracy))

The accuracy of the Family Guy-American Dad model is 0.9023255813953488


In [70]:
feat_df.sort_values('LR_WEIGHT', ascending = True).head(20)

Unnamed: 0,FEAT,P_VALUE,LR_WEIGHT
3764,usa,3.4907410000000004e-33,-1.36643
3219,save,3.435987e-05,-0.432497
3519,straight,0.009139758,-0.432255
3720,true,0.01286436,-0.42852
3363,sky,4.786872e-33,-0.41346
3312,shot,0.9190419,-0.396616
1547,cia,9.839928e-11,-0.375171
2761,need,6.653096e-05,-0.371639
2318,hmm,0.0003276732,-0.340708
3863,whoo,0.004405436,-0.336842


In [71]:
feat_df.sort_values('LR_WEIGHT', ascending = True).tail(20)

Unnamed: 0,FEAT,P_VALUE,LR_WEIGHT
2814,old,1.013702e-06,0.162016
1675,crap,1.738423e-18,0.167223
1395,brought,0.004178643,0.176746
2345,horse,0.6421214,0.194281
2359,huh,2.723837e-08,0.195783
1719,dancing,0.2645358,0.196719
3727,tucker,2.309205e-08,0.205411
1528,children,0.0003521932,0.212781
3159,right,2.552637e-26,0.231068
2235,guys,9.218887e-14,0.239201


#### Classification: *Futurama*-*The Simpsons* Model
The classification model differentiating between *Futurama* and *The Simpsons* is more accurate than the previous model, possessing a success rate of around 92% to 94%. This is due to the themes of the two shows being radically different despite their similar writing styles, spearheaded by Matt Groening and Matt Cohen. 

The words with the highest logistic regression weights, indicating that they are significant to the model classifying an episode as *Futurama*, are words such as "professor", "robot" and "planet". The words with the lowest logistic regression weights, indicating that they are significant to the model classifying an episode as *The Simpsons*, are words such as "dad", "kids" and "school". The *Futurama* words reflect the more science-fiction themes of the TV show, while *The Simpsons* words reflect the more familial themes of the show. When watching the shows, the behaviors of characters in each show are very similar. Characters such as Professor Farnsworth in *Futurama* are similar to John Frink in *The Simpsons*, and Fry's antics in *Futurama* mimic those of Bart and Homer in *The Simpsons*. Despite these similar character traits, there are nevertheless reliable thematic differences that allow the model to classify episodes. 

A *Simpsons* episode that is repeatedly misidentified as a *Futurama* episode is the season 23 episode "Them, Robot". This episode is one of the few *The Simpsons* episodes that genre bends from a family sitcom into a science-fiction narrative. After watching the episode, its themes of robotics and capitalist exploitation mirror constant language and themes in *Futurama*. This misidentification allows us to further identify what separates the two shows. That is, the genre of science-fiction divides *Futurama* from the familial sitcom genre of *The Simpsons*.

In [None]:
## Futurama-Simpsons
## General Things 
# load stopwords
from sklearn.feature_extraction import text
text_file = open('./docs/stopwords.txt')
jockers_words = text_file.read().split()
new_stopwords = text.ENGLISH_STOP_WORDS.union(jockers_words)

# create dtm
corpus_path = './Classification_Data_SimpFuturama/'
vectorizer = CountVectorizer(input='filename', encoding='utf8', stop_words = new_stopwords, min_df=20, dtype='float64')


directory = "./Classification_Data_SimpFuturama/"
files = glob.glob(f"{directory}/*.txt")
titles = [Path(file).stem for file in files]

corpus = []
for title in titles:
    filename = title + ".txt"
    corpus.append(corpus_path + filename)
dtm = vectorizer.fit_transform(corpus)
vocab = vectorizer.get_feature_names()
matrix = dtm.toarray()
df = DataFrame(matrix, columns=vocab)


filepaths = []
fulltext = []
for filepath in glob.iglob('Classification_Data_SimpFuturama/*.txt'): ## grab titles
    filepaths.append(filepath)
for filepath in filepaths: ## grab text and get the scores of each individual episode
    with open(filepath, "r") as file:
        text = file.read()
        fulltext.append(text)
clasdict = {'Name': filepaths, 'Text': fulltext
}
classdf = pd.DataFrame(clasdict)
def changeFilePath(filePath):
    newPath = filePath[33:]
    return newPath
classdf['Name'] = classdf['Name'].apply(changeFilePath)
futramaBool = []

for name in classdf['Name']:
    if name[0:4] == "simp":
        futramaBool.append(0)
    else:
        futramaBool.append(1)
classdf['Futurama'] = futramaBool

df_concat = pd.concat([classdf, df], axis = 1)
noF = df_concat['Futurama'].sum()

df_simp = df_concat[df_concat['Futurama'] == 0]
df_sp = df_concat[df_concat['Futurama'] == 1]
df_simp = df_simp.sample(n=noF)
df_final = pd.concat([df_simp, df_sp])
df_final = df_final.reset_index()
df_final = df_final.drop(columns="index")
meta = df_final[["Name", "Futurama"]]
df = df_final.loc[:,'000':]

meta['PROBS'] = ''
meta['PREDICTED'] = ''

model = LogisticRegression(penalty = 'l1', C = 1.0, solver='liblinear')


for this_index in df_final.index.tolist():
    title = meta.loc[meta.index[this_index], 'Name'] 
    CLASS = meta.loc[meta.index[this_index], 'Futurama']
    #print(title, CLASS) 
    
    train_index_list = [index_ for index_ in df.index.tolist() if index_ != this_index] # exclude the title to be predicted
    X = df.loc[train_index_list] # the model trains on all the data except the excluded title row
    y = meta.loc[train_index_list, 'Futurama'] # the y row tells the model which class each title belongs to
    TEST_CASE = df.loc[[this_index]]

    model.fit(X,y) # fit the model
    prediction = model.predict_proba(TEST_CASE) # calculate probability of test case
    predicted = model.predict(TEST_CASE) # calculate predicted class of test case
    meta.at[this_index, 'PREDICTED'] = predicted # add predicted class to metadata
    meta.at[this_index, 'PROBS'] = str(prediction) # add probabilities to metadata

meta = meta.replace([0], 0)
meta = meta.replace([1], 1)
sum_column = meta['Futurama'] - meta['PREDICTED']
meta['RESULT'] = sum_column


meta = meta.replace([0], 0)
meta = meta.replace([1], 1)

sum_column = meta['Futurama'] - meta['PREDICTED']
meta['Result'] = sum_column


meta[meta['RESULT'] == 1]

canonic_c = 1.0

def Ztest(vec1, vec2):

    X1, X2 = np.mean(vec1), np.mean(vec2)
    sd1, sd2 = np.std(vec1), np.std(vec2)
    n1, n2 = len(vec1), len(vec2)

    pooledSE = np.sqrt(sd1**2/n1 + sd2**2/n2)
    z = (X1 - X2)/pooledSE
    pval = 2*(norm.sf(abs(z)))

    return z, pval
def feat_pval_weight(meta_df_, dtm_df_):

    dtm0 = dtm_df_.loc[meta_df_[meta_df_['Futurama']==0].index.tolist()].to_numpy()
    dtm1 = dtm_df_.loc[meta_df_[meta_df_['Futurama']==1].index.tolist()].to_numpy()

    pvals = [Ztest(dtm0[ : ,i], dtm1[ : ,i])[1] for i in range(dtm_df_.shape[1])]
    clf = LogisticRegression(penalty = 'l1', C = canonic_c, class_weight = 'balanced', solver='liblinear')
    clf.fit(dtm_df_, meta_df_['Futurama']==0)
    weights = clf.coef_[0]

    feature_df = pd.DataFrame()

    feature_df['FEAT'] = dtm_df_.columns
    feature_df['P_VALUE'] = pvals
    feature_df['LR_WEIGHT'] = weights

    return feature_df


feat_df = feat_pval_weight(meta, df)

In [73]:
accuracy = len(meta[meta['RESULT'] == 0])/len(meta)
print("The accuracy of the Futurama-Simpsons model is: " + str(accuracy))

The accuracy of the Futurama-Simpsons model is: 0.9204545454545454


In [74]:
feat_df.sort_values('LR_WEIGHT', ascending = True).head(20)

Unnamed: 0,FEAT,P_VALUE,LR_WEIGHT
2209,professor,2.256554e-07,-0.953139
2365,robot,1.658894e-11,-0.822602
2109,planet,4.154998e-10,-0.650316
757,delivery,6.52555e-06,-0.373202
1915,news,2.294563e-06,-0.313089
65,alien,0.000423025,-0.283459
3006,universe,0.002864122,-0.266051
413,captain,0.001411391,-0.210065
559,coffee,0.3491012,-0.192436
1064,feel,0.1547857,-0.17312


In [75]:
feat_df.sort_values('LR_WEIGHT', ascending = True).tail(20)

Unnamed: 0,FEAT,P_VALUE,LR_WEIGHT
1076,fetch,0.2623289,0.0
1075,festival,0.09725443,0.0
1074,female,0.02678444,0.0
1247,gonna,0.001227304,0.008175
1921,night,0.05719649,0.010504
1595,la,0.06857983,0.01402
2563,sir,0.1742952,0.014594
293,book,0.0001331759,0.017958
1534,job,0.3744752,0.022288
193,baseball,0.1720918,0.03882


#### Classification: *South Park*-*The Simpsons* Model
The *South Park*-*Simpsons* model is by far the most accurate, and this is largely expected given they are very different shows. *South Park* is raunchier and directly satirizes political and social figures, while *The Simpsons* centers around a family and has a plot that is less directly character driven compared to *South Park*.

The words with the highest logistic regression weights, indicating importance to classifying an episode as *South Park*, are words like "dude" or are swear words, while words with the lowest logistic regression weights are words like "dad", "kids" or "baby". The words that are key to identifying an episode as a *Simpsons* episode, again, reflect the familial themes of the show. Even though this familial structure may not be the core plot of many episodes in the show, the familial themes of the show, nevertheless, undergird the show and its plot. 

A *South Park* episode that is repeatedly misidentified is the season 2 episode "Terrance and Phillip in Not Without My Anus" which centers around Cartman finding his father. This theme of fatherhood is not present in many *South Park* episodes, but it is very reminiscent of the familial themes of *The Simpsons* which is why the classification model misinterprets the text of the episode as *The Simpsons*. Interestingly, no episodes from *The Simpsons* are repeatedly misidentified which is indicative of the rigidity of the familial and thematic structure of *The Simpsons* when compared with the diversity of themes in *South Park*

In [None]:
## South Park-Simpsons
## General Things 
# load stopwords
from sklearn.feature_extraction import text
text_file = open('./docs/stopwords.txt')
jockers_words = text_file.read().split()
new_stopwords = text.ENGLISH_STOP_WORDS.union(jockers_words)

# create dtm
corpus_path = './Classification_Data/'
vectorizer = CountVectorizer(input='filename', encoding='utf8', stop_words = new_stopwords, min_df=20, dtype='float64')


directory = "./Classification_Data/"
files = glob.glob(f"{directory}/*.txt")
titles = [Path(file).stem for file in files]

corpus = []
for title in titles:
    filename = title + ".txt"
    corpus.append(corpus_path + filename)
dtm = vectorizer.fit_transform(corpus)
vocab = vectorizer.get_feature_names()
matrix = dtm.toarray()
df = DataFrame(matrix, columns=vocab)


filepaths = []
fulltext = []
for filepath in glob.iglob('Classification_Data/*.txt'): ## grab titles
    filepaths.append(filepath)
for filepath in filepaths: ## grab text and get the scores of each individual episode
    with open(filepath, "r") as file:
        text = file.read()
        fulltext.append(text)
clasdict = {'Name': filepaths, 'Text': fulltext
}
classdf = pd.DataFrame(clasdict)
def changeFilePath(filePath):
    newPath = filePath[20:]
    return newPath
classdf['Name'] = classdf['Name'].apply(changeFilePath)
southParkBool = []

for name in classdf['Name']:
    if name[0:4] == "simp":
        southParkBool.append(0)
    else:
        southParkBool.append(1)
classdf['SouthPark'] = southParkBool

df_concat = pd.concat([classdf, df], axis = 1)
noSP = df_concat['SouthPark'].sum()

df_simp = df_concat[df_concat['SouthPark'] == 0]
df_sp = df_concat[df_concat['SouthPark'] == 1]
df_simp = df_simp.sample(n=noSP)
df_final = pd.concat([df_simp, df_sp])
df_final = df_final.reset_index()
df_final = df_final.drop(columns="index")
meta = df_final[["Name", "SouthPark"]]
df = df_final.loc[:,'000':]

meta['PROBS'] = ''
meta['PREDICTED'] = ''

model = LogisticRegression(penalty = 'l1', C = 1.0, solver='liblinear')


for this_index in df_final.index.tolist():
    title = meta.loc[meta.index[this_index], 'Name'] 
    CLASS = meta.loc[meta.index[this_index], 'SouthPark']
    #print(title, CLASS) 
    
    train_index_list = [index_ for index_ in df.index.tolist() if index_ != this_index] # exclude the title to be predicted
    X = df.loc[train_index_list] # the model trains on all the data except the excluded title row
    y = meta.loc[train_index_list, 'SouthPark'] # the y row tells the model which class each title belongs to
    TEST_CASE = df.loc[[this_index]]

    model.fit(X,y) # fit the model
    prediction = model.predict_proba(TEST_CASE) # calculate probability of test case
    predicted = model.predict(TEST_CASE) # calculate predicted class of test case
    meta.at[this_index, 'PREDICTED'] = predicted # add predicted class to metadata
    meta.at[this_index, 'PROBS'] = str(prediction) # add probabilities to metadata

meta = meta.replace([0], 0)
meta = meta.replace([1], 1)
sum_column = meta['SouthPark'] - meta['PREDICTED']
meta['RESULT'] = sum_column


meta = meta.replace([0], 0)
meta = meta.replace([1], 1)

sum_column = meta['SouthPark'] - meta['PREDICTED']
meta['Result'] = sum_column


meta[meta['RESULT'] == 1]

canonic_c = 1.0

def Ztest(vec1, vec2):

    X1, X2 = np.mean(vec1), np.mean(vec2)
    sd1, sd2 = np.std(vec1), np.std(vec2)
    n1, n2 = len(vec1), len(vec2)

    pooledSE = np.sqrt(sd1**2/n1 + sd2**2/n2)
    z = (X1 - X2)/pooledSE
    pval = 2*(norm.sf(abs(z)))

    return z, pval
def feat_pval_weight(meta_df_, dtm_df_):

    dtm0 = dtm_df_.loc[meta_df_[meta_df_['SouthPark']==0].index.tolist()].to_numpy()
    dtm1 = dtm_df_.loc[meta_df_[meta_df_['SouthPark']==1].index.tolist()].to_numpy()

    pvals = [Ztest(dtm0[ : ,i], dtm1[ : ,i])[1] for i in range(dtm_df_.shape[1])]
    clf = LogisticRegression(penalty = 'l1', C = canonic_c, class_weight = 'balanced', solver='liblinear')
    clf.fit(dtm_df_, meta_df_['SouthPark']==0)
    weights = clf.coef_[0]

    feature_df = pd.DataFrame()

    feature_df['FEAT'] = dtm_df_.columns
    feature_df['P_VALUE'] = pvals
    feature_df['LR_WEIGHT'] = weights

    return feature_df


feat_df = feat_pval_weight(meta, df)

In [77]:
accuracy = len(meta[meta['RESULT'] == 0])/len(meta)
print("The accuracy of The South Park-Simpsons Model is: " + str(accuracy))

The accuracy of The South Park-Simpsons Model is: 0.9849624060150376


In [78]:
feat_df.sort_values('LR_WEIGHT', ascending = True).head(20)

Unnamed: 0,FEAT,P_VALUE,LR_WEIGHT
3999,turns,6.106483e-139,-0.71389
4110,walks,7.096521000000001e-163,-0.680366
1199,dude,2.348527e-111,-0.50231
3775,takes,1.8002299999999998e-90,-0.413393
2435,monitor,0.0006575918,-0.277334
230,away,4.399156e-117,-0.231929
686,children,3.973744e-12,-0.211516
2262,looks,1.64248e-114,-0.171067
193,ass,2.034393e-29,-0.145451
562,canada,0.02603544,-0.131874


In [79]:
feat_df.sort_values('LR_WEIGHT', ascending = True).tail(20)

Unnamed: 0,FEAT,P_VALUE,LR_WEIGHT
358,big,1.047561e-06,0.087279
3419,sir,0.2093214,0.090096
2204,life,0.1037431,0.091707
4115,want,6.869435e-08,0.093827
3032,real,0.005184926,0.102954
2258,look,9.231539e-45,0.113109
1226,eat,0.7992704,0.114671
3843,thanks,0.09818325,0.116476
3725,super,0.002839899,0.120854
1414,feel,0.0191556,0.127865


#### Word2Vec
The word2vec models are frequently indistinguishable when looking at the ten most similar words to words such as "man", but there are notable differences that can highlight the differing underlying themes and content of each show. Additionally, these differences in the way that words are used can also give insight to the subconscious decisions that writers of the TV shows make. It is worth noting that the model is different everytime it is run, so there could be differences in the similar words, but because of the size of the datasets, the output of the models are generally similar on each run. 

First, the word "girl" or the word "woman" has radically different similar words between the TV shows. For *Family Guy*, *American Dad* and *The Simpsons*, the similar words are all familial in nature, including words like "kid", but the similar words for *South Park* and *Futurama* do not include these familial themes. *South Park*'s similar words for "girl" include "penis", "vagina", "turd" and "dog". This is a clear outlier and gives a glimpse of a theme in the show. Women and girls are frequently only brought into the plot of a show in a sexual context, especially in the first 10 seasons of the show. While this is likely not a conscious choice made by the writers of the show, it is nevertheless alarming and shows the subconscious decisions of the writers to limit the role of female characters in the show. *Futurama*'s similar words to "girl" are "pet", "swamped" and "damned". This is a result of Leela, the female protagonist of the show, who is a "sewer mutant" and constantly has her pet nearby. However, despite both Matt Groening and Matt Cohen being heavily involved with both *Futurama* and *The Simpsons*, there are clear differences in their use of female characters between the two shows. This is in contrast to the shows created by Seth McFarland, *American Dad* and *Family Guy*, which have nearly identical similar words across the board. In this case, this, again, shows the thematic range between *Futurama* and *Simpsons* when compared with the range of *American Dad* and *Family Guy*.

In [40]:
def print_similar_words(word):
    print("Similar words for the word: " + word + "\n")
    print("Futurama model: \n" + "")
    print(futur_model.wv.most_similar(word, topn=10))
    print("\n")
    print("American Dad model: \n")
    print(amDad_model.wv.most_similar(word, topn=10))
    print("\n")
    print("Family Guy model: \n")
    print(famGuy_model.wv.most_similar(word, topn=10))
    print("\n")
    print("Simpsons model: \n")
    print(simp_model.wv.most_similar(word, topn=10))
    print("\n")
    print("South Park model: \n")
    print(sp_model.wv.most_similar(word, topn=10))
print_similar_words("girl")

Similar words for the word: girl

Futurama model: 

[('fake', 0.9985529184341431), ('crash', 0.9983259439468384), ('dollars', 0.9982154369354248), ('control', 0.9979572296142578), ('eating', 0.9978959560394287), ('shooting', 0.9978055357933044), ('less', 0.997783362865448), ('dig', 0.9977428317070007), ('lazy', 0.9976521134376526), ('angry', 0.9976294040679932)]


American Dad model: 

[('kid', 0.8849986791610718), ('dog', 0.8093185424804688), ('alien', 0.8025319576263428), ('lady', 0.7974755167961121), ('little', 0.7966359257698059), ('second', 0.7924938201904297), ('guy', 0.7896062135696411), ('shot', 0.7758139371871948), ('person', 0.7632725238800049), ('fish', 0.762179970741272)]


Family Guy model: 

[('kid', 0.8633959293365479), ('lady', 0.7928106784820557), ('dog', 0.7781652212142944), ('woman', 0.7661473751068115), ('guy', 0.7580353021621704), ('person', 0.7306180000305176), ('bit', 0.7255604863166809), ('big', 0.7163251638412476), ('joke', 0.7130960822105408), ('baby', 0.70977

Second, the word "science" also has interesting results. *South Park*'s similar words to "science" are "Christian" and "comedy". After seeing these words, I watched the *South Park* episode "Go God Go" where "science" and religion are the core themes. Throughout the episode, Richard Dawkins and Ms. Garrison attempt to rid the world of religious ideology in favor of scientific belief. In fact, the entire episode's plot centers upon the ways that scientific dogma can create factionalism similar to religious factionalism. This oppositional relationship between "science" and religion is not mirrored in other TV shows in the dataset where "science" is similar to "progress" in the *Family Guy* model and "modern" or "mysterious" in *The Simpsons* model. This relationship between religion and science is what differentiates episodes of *South Park* that investigate science, compared to thematically similar episodes of *The Simpsons*. For example, the episode "The Monkey Suit" involves themes of both religion and science beginning with a debate around creationism and evolution. However, this dichotomy between creationism and evolution is merely the set up for the plot rather than the central focus. The plot veers away and centers upon a trial against Lisa and an investigation that decides Homer is the evolutionary missing link. Despite the fact that science and religion begin the episode in opposition, there are no episodes where investigating this perceived dichotomy is the central focus of the plot.

In [41]:
print_similar_words("science")

Similar words for the word: science

Futurama model: 

[('animals', 0.9976933598518372), ('form', 0.9974814057350159), ('straight', 0.9962860941886902), ('dignity', 0.9955741763114929), ('eternity', 0.9955719709396362), ('destroyed', 0.995414137840271), ('parts', 0.9952705502510071), ('sleeping', 0.9948841333389282), ('episode', 0.9948302507400513), ('tail', 0.9948232173919678)]


American Dad model: 

[('post', 0.9672064185142517), ('points', 0.9643006324768066), ('patio', 0.9625817537307739), ('finest', 0.959784984588623), ('legal', 0.9576351642608643), ('market', 0.956371009349823), ('trees', 0.9562504291534424), ('replaced', 0.9558205604553223), ('backup', 0.9555937051773071), ('uniform', 0.955528736114502)]


Family Guy model: 

[('custody', 0.9071251153945923), ('certificate', 0.9017797112464905), ('chevy', 0.8972327709197998), ('grade', 0.8955509662628174), ('profits', 0.8922510743141174), ('tradition', 0.8919973373413086), ('ticking', 0.891424298286438), ('elsewhere', 0.8908125

Finally, the words "dad" and "family" are both highly weighted in the logistic regression for both the *Futurama*-*The Simpsons* and the *South Park*-*The Simpsons* classification models. Because of this high weighting, I decided to investigate the similar words to both "dad" and "family" in the word2vec models of all five shows in the dataset. Interestingly, every show except *Futurama* has nearly identical similar words for "dad" and "family". These similar words include "life", "marriage", "mother", "child" or "friends". However, *Futurama* breaks from the pack, possessing similar words like "store", "date", "liquor", "crew" and "class". There is again computational evidence for Matt Groening's and Matt Cohen's different writing in *Futurama* versus *The Simpsons*, but in this case, there is computational evidence for the vocabulary and themes in *Futurama* breaking from the genre of animated adult comedy TV shows as a whole. Almost every show in the genre is inundated with familial themes with a show that breaks from these themes being an exception rather than the rule. Prior to fully googling a list of shows in the genre, my dad, my brother and I thought of popular animated adult comedy TV shows that do not center around a familial unit or do not use a familial relationship to move the plot of many episodes across the length of the show's run. The only examples we could think of were *Futurama* and *BoJack Horseman*. I know many episodes of season 4 of *BoJack Horseman* center around BoJack's mother and his childhood, but compared to shows like *Archer*, the familial themes of the show are much more sporadic. Of course, this is anecdotal evidence based on three men with similar TV viewing habits, but it is nevertheless interesting that *Futurama* breaks from this convention despite Matt Groening having success with a family-based show in *The Simpsons*. After researching this further, I found an interesting [analysis](https://justtv.files.wordpress.com/2007/03/mittell_simpsons.pdf) written by Jason Mittell titled "Cartoon Realism: Genre Mixing and the Cultural Life of *The Simpsons*". In this paper, Mittell argues that using a familial unit as the basis for a TV comedy is relatively generic and uncontroversial in nature. This enables the TV show to be more successful especially on network television because of its much wider appeal. I agree with this analysis, but I do think that the emergence of streaming services allows shows with less of a broad appeal to see success. Mittell's article was written in 2007, so of course, he could not predict this trend, but perhaps, in the coming decades, the structure of a familial sitcom will become less of a norm in animated adult comedy. The deeper investigation that word2vec enabled highlights the usefulness of word2vec in enabling close reading and analysis of themes that may be overlooked when watching a show. 

In [42]:
print_similar_words("family")

Similar words for the word: family

Futurama model: 

[('against', 0.9983259439468384), ('whose', 0.9980803728103638), ('wherever', 0.9980451464653015), ('warming', 0.9979935884475708), ('cause', 0.9979621767997742), ('large', 0.9979024529457092), ('smaller', 0.9977759122848511), ('judge', 0.9977691173553467), ('mutant', 0.9977593421936035), ('eyes', 0.9977487325668335)]


American Dad model: 

[('house', 0.816673219203949), ('life', 0.8013492226600647), ('friends', 0.7967162132263184), ('party', 0.7960283756256104), ('chance', 0.7941237688064575), ('husband', 0.7878105640411377), ('child', 0.7807456254959106), ('room', 0.7767513990402222), ('wife', 0.7750695943832397), ('money', 0.7724897861480713)]


Family Guy model: 

[('child', 0.5585479736328125), ('story', 0.5509899854660034), ('dog', 0.5475580096244812), ('new', 0.5388461947441101), ('body', 0.5280017852783203), ('book', 0.5255934596061707), ('girl', 0.5255507826805115), ('star', 0.5248533487319946), ('guy', 0.5237522721290588)

### Conclusion
This project adds to the ongoing discussion about adult animated comedy shows, hoping to provide answers about the distinctions between the most popular shows in the genre over the last three decades. However, there are limitations to the analysis presented. First, as I discussed in the data section above, there are a few shows including *Archer* and *King of the Hill* that are not included in the dataset due to copyright issues filed by the shows' production studios. Second, if a show is currently still releasing new episodes, the most recent season of a TV show does not have available transcripts. For example, "The Pandemic Special" of *South Park* does not have an available transcript because it was released recently. Additionally, as stated in the "Relevant Studies" section above, TV show genres and their writing are constantly changing, so many conclusions could be altered or even reversed as consumer preferences change or streaming services become more important to the production of television shows. 

Going forward, analyzing the results of Matt Groening's and Seth McFarland's writing in different shows was intriguing, and I would like to see how the same comedy writers change over time or write differently when writing for two shows that are running at the same time. Additionally, similar analyses could be done for other TV show genres or subsections of TV show genres. For example, analysis could be done on the difference in writing between multi-cam sitcoms such as *Big Bang Theory* and *2 Broke Girls* and single-cam sitcoms such as *Modern Family* and *Silicon Valley*.