# Metadata

```yaml
course:   DS 5001 
module:   Final Project
topic:    Exploratory Text Analytics using Television Sitcom Scripts Final Report
author:   Eric Tria
date:     5 May 2023
```

# Student Info

```yaml
name:     Eric Tria
user_id:  emt4wf
email:    emt4wf@virginia.edu
```

# Project: Exploratory Text Analytics using Television Sitcom Scripts

# 1. Introduction

Television shows have been around for decades and have been part of the daily routine of a lot of people. One interesting thing to notice is how shows have somewhat changed over the years. For this project, I will be focusing on sitcoms or situational comedies. In earlier years, sitcoms were known for being filmed in front of a live audience where the laughter and reactions are recorded. More recently, however, single-camera comedies are becoming more popular. Single-camera comedies are filmed without a live audience where the scripts are more similar to a feature comedy [[3]](https://screencraft.org/blog/differences-single-camera-multi-camera-tv-pilot-scripts/). Due to these differences, I will be focusing on epsisode scripts of three popular single-camera sitcoms from the 2000s and 2010s: Parks and Recreation (2009-2015), Brooklyn Nine-Nine (2013-2021), and The Office US (2005-2013). I selected these three shows because they revolve around characters in various workspaces: a government office, police precinct, and corporate office. Given these three popular shows, are there similarities between their scripts in terms of the general topics, language used, sentiments, and such? To answer this question, I went through the scripts of all the episodes of these shows. 

The corpus used for this project contains episode scripts for all the seasons of each show: 7 seasons for Parks and Recreation, 8 for Brooklyn Nine-Nine, and 9 for The Office US. A sitcom episode usually runs between 20 to 30 minutes, so each episode script would be relatively short compared to a full movie. Sitcoms usually follow a three-act structure [[2]](https://www.masterclass.com/articles/how-to-write-a-sitcom) where each act consists of 3-5 scenes each [[1]](https://www.theatlantic.com/entertainment/archive/2014/12/cracking-the-sitcom-code/384068/). In total, this provided me with a large corpus to work with in order to extract insights that can possibly answer the initial question that I posed.

# 2. Data

### 2.1 Provenance

The episode scripts were programmatically scraped from [Sublikescript.com](https://subslikescript.com/), which is a website that contains scripts of a wide range of shows. The website had the scripts for the shows that I wanted to use for my project. The links I used were:
- [Parks and Recreation](https://subslikescript.com/series/Parks_and_Recreation-1266020)
- [Brooklyn Nine-Nine](https://subslikescript.com/series/Brooklyn_Nine-Nine-2467372)
- [The Office US](https://subslikescript.com/series/The_Office-386676) 

In order to do this, I wrote a Python class to do the web scraping which is located in `/lib/script_scraper.py`. This Python class takes in a Sublikescript link and converts it into a corpus of lines and tokens. 

In addition to the scripts, I extracted additional data from Wikipedia to add more data to each episode. I was able to eaxtract information such as episode title, director, writer, date, and the total US viewers in millions for each episode. For each season, I was also able to get the Rotten Tomato rating, which can show how critics view these shows. I also wrote a Python script to scrape the data from Wikipedia located in `/lib/wiki_scraper.py`, which takes a link and converts the details into a tabular format. This information was extracted from the following links:
- Parks and Recreation: [Wikipedia page](https://en.wikipedia.org/wiki/Parks_and_Recreation) and [episode list](https://en.wikipedia.org/wiki/List_of_Parks_and_Recreation_episodes)
- Brooklyn Nine-Nine: [Wikipedia page](https://en.wikipedia.org/wiki/Brooklyn_Nine-Nine) and [episode list](https://en.wikipedia.org/wiki/List_of_Brooklyn_Nine-Nine_episodes)
- The Office US: [Wikipedia page](https://en.wikipedia.org/wiki/The_Office_(American_TV_series) and [episode list](https://en.wikipedia.org/wiki/List_of_The_Office_(American_TV_series)_episodes)

The process of using these scripts to generate the corpus is available in a Jupyter notebook called `EXTRACT_DATA.ipynb`.

### 2.2 Location

The raw scraped data can be accessed through at the `/raw` folder of the repository. The different files are:
- Parks and Recreation:
    - `/raw/PNR-LINES.csv`
    - `/raw/PNR-TOKENS.csv`
    - `/raw/PNR-LIB.csv`
    - `/raw/PNR-LIB-EPS.csv`
- Brooklyn Nine-Nine:
    - `/raw/B99-LINES.csv`
    - `/raw/B99-TOKENS.csv`
    - `/raw/B99-LIB.csv`
    - `/raw/B99-LIB-EPS.csv`
- The Office US:
    - `/raw/OFFICE-LINES.csv`
    - `/raw/OFFICE-TOKENS.csv`
    - `/raw/OFFICE-LIB.csv`
    - `/raw/OFFICE-LIB-EPS.csv`

I have also provided files that combined all three series in the `/final` folder of the repository. The different files are:
- `/final/ALL-CORPUS.csv`
- `/final/ALL-LIB.csv`
- `/final/ALL-LIB-EPS.csv`

### 2.3 Description

In general, the data would be split in the following: series, seasons, episodes, scenes, lines, and tokens. Since the data source did not have explicit scene tagging, I used the structure discusses earlier where each sitcom would have 3 acts with 5 scenes each. For this project, I used an arbitrary number of 15 scenes per episode.

#### Files under `/raw`
The raw version of the corpus called `LINES` contains lines for each script. This table is formatted as follows:

| Column     | Type    | Description          |
|------------|---------|----------------------|
| series_id  | Index   | ID for the TV series |
| season_id  | Index   | Season number        |
| episode_id | Index   | Episode number       |
| scene_id   | Index   | Scene number         |
| line_id    | Index   | Line number          |
| line       | Feature | Line contents        |

The tokenized version of the corpus called `TOKENS` contains the script broken down into individual tokens. This table is formatted as follows:

| Column     | Type    | Description          |
|------------|---------|----------------------|
| series_id  | Index   | ID for the TV series |
| season_id  | Index   | Season number        |
| episode_id | Index   | Episode number       |
| scene_id   | Index   | Scene number         |
| line_id    | Index   | Line number          |
| token_id   | Index   | Token number         |
| token_str  | Feature | String token         |
| term_str   | Feature | Cleaned string term  |

The `LIB` tables contain additional information on a season level:

| Column           | Type    | Description                              |
|------------------|---------|------------------------------------------|
| series_id        | Index   | ID for the TV series                     |
| season_id        | Index   | Season number                            |
| num_episodes     | Feature | Number of episodes in the season         |
| year             | Feature | TV season year of the season             |
| viewers_millions | Feature | Average number of US viewers in millions |
| rt_rating        | Feature | Rotten Tomato rating                     |
| series_name      | Feature | Series name                              |

The `LIB-EPS` tables contain additional information on an episode level:

| Column        | Type    | Description                       |
|---------------|---------|-----------------------------------|
| series_id     | Index   | ID for the TV series              |
| season_id     | Index   | Season number                     |
| episode_id    | Index   | Episode number                    |
| episode_title | Feature | Title of the episode              |
| director      | Feature | Name of the episode's director    |
| first_writer  | Feature | Name of the epsiode's main writer |
| date          | Feature | Date when the episode aired       |
| us_viewers    | Feature | Number of US viewers in millions  |

#### Files under `/final`

The `ALL-CORPUS.csv` file contains the same details as the raw `TOKENS` files but with additional features:

| Column     | Type    | Description          |
|------------|---------|----------------------|
| pos_tuple  | Feature   | Token and its tagged part of speech |
| pos        | Feature   | Tagged part of speech       |

The `ALL-LIB.csv` file contains the same details as the raw `LIB` files but with additional engineered features:

| Column     | Type    | Description          |
|------------|---------|----------------------|
| series_season  | Feature   | Combined label using series name and season number |
| series_rotten_tomatoes        | Feature   | Category if rating is 90+ or not      |

The `ALL-LIB-EPS.csv` file contains the same details as the raw `LIB-EPS` files but with additional engineered features:

| Column     | Type    | Description          |
|------------|---------|----------------------|
| series_season_ep  | Feature   | Combined label using series name, season number, and episode number |
| year        | Feature   | Year when the episode aired      |

In terms of sizes, the breakdown is as follows:
- `ALL-CORPUS` has a total of **1,563,801** tokens
- `ALL-LIB` has a total of **24** seasons
- `ALL-LIB-EPS` has a total of **479** episodes

### 2.4 Format

All of the files listed above are in CSV format. The sources, however, were initially in HTML format since they were scraped from web pages. The HTML was transformed into text using BeautifulSoup's LXML parser and then that text was converted into the CSV files.

# 3. Models

Using this source data, I proceeded to create additional models to aid in my analysis of the corpus. These models and algorithms included the following: term frequency, hierarchical clustering, principal component analysis, topic modelling, word embeddings, and sentiment analysis.

### 3.0 OHCO Explanation

The OHCO or Ordered Hierarchy of Content Objects that I used for this corpus is:
- Series
- Seasons
- Episodes
- Scenes
- Lines
- Tokens

For my analysis, I will treat this OHCO levels similarly to the usual OHCO levels for books:
- Seasons used similarly to Books
- Episodes used similarly to Chapters
- Scenes used similarly to Paragraphs
- Lines used similarly to Sentences

### 3.1 VOCAB / Term Frequency

The first model that I created is using term frequency and other metrics derived from it. This makes up the `VOCAB` table for my corpus. This table highlights the TFIDF or Term Frequency - Inverse Document Frequency, which is the frequency of a word in a document weighted by that word's information value over the corpus
. The summary of the included fields are as follows:

| Column       | Type    | Description                                                 |
|--------------|---------|-------------------------------------------------------------|
| term_str     | Index   | String term                                                 |
| n            | Feature | Term frequency                                              |
| n_chars      | Feature | Term length                                                 |
| p            | Feature | Term probability                                            |
| i            | Feature | Term information                                            |
| h            | Feature | Term entropy                                                |
| max_pos      | Feature | Most frequently associated part-of-speech character         |
| n_pos        | Feature | POS ambiguity                                               |
| cat_pos      | Feature | POS ambiguity                                               |
| stop         | Feature | Dummy encoded variable if term is part of NLTK's stop words |
| tfidf_mean   | Feature | Mean value of term TFIDF                                    |
| tfidf_median | Feature | Median value of term TFIDF                                  |
| tfidf_max    | Feature | Max value of term TFIDF                                     |
| dfidf        | Feature | Document frequency                                          |

For the whole corpus, TFIDF and DFIDF values were computed using **EPISODES** as the OHCO level due to memory constraints. This file is located at:
- `/final/ALL-VOCAB-EPS.csv`

Additionally, I created separate versions of the VOCAB table for each TV series with the TFIDF and DFIDF values computed using **SCENES** as the OHCO level. These files are located at:
- `/final/PNR-VOCAB-SCENES.csv`
- `/final/B99-VOCAB-SCENES.csv`
- `/final/B99-VOCAB-SCENES.csv`

I created this VOCAB tables using the functions from the `lib/corpus_enhancer.py` Python class and demonstrated in `EXTRACT_DATA.ipynb`.

### 3.2 Hierarchical Clustering

To create the hierarchical clustering model, I used the 1000 most significant terms in the VOCAB table using DFIDF as the significance measure. I also filtered the terms to those whose maximum part-of-speech belong to `NN NNS VB VBD VBG VBN VBP VBZ JJ JJR JJS RB RBR RBS`.

I then created the three levels: Binary, Probabilistic, and Pythagorean/Euclidean. I then computed the distance using the following distance metrics: City block, Cosine, Euclidean, Jaccard, and Jensen-Shannon. I did the clustering to compate seasons and the final table is as follows:

| Column       | Type    | Description                                                 |
|--------------|---------|-------------------------------------------------------------|
| doc_a     | Index   | First document formatted as (series, season)           |
| doc_b            | Index | Second document formatted as (series, season) |
| cityblock      | Feature | City block distance                                    |
| cosine            | Feature | Cosine distance                                     |
| euclidean            | Feature | Euclidean distance                                     |
| jaccard            | Feature | Jaccard distance                                     |
| js            | Feature | Jensen-Shannon distance                                     |

This file is located at `/final/clustering/ALL-PAIRS.csv`. This process is demonstrated in `TEXT_ANALYSIS.ipynb`

### 3.3 Principal Component Analysis (PCA)

The next model I created was for principal component analysis. For this model, I limited the VOCAB to nouns and plural nouns: `NN and NNS`. However, I noticed that in my corpus, some of the proper nouns were tagged as `NN`. To fix this, I did additional filtering on the VOCAB table to remove the names of the main and recurring characters of all 3 TV series. For the actual PCA model, I generated 10 components while also normalizing the input matrix. I also computed the principal components using episodes as the OHCO level due to memory constraints. There are two tables saved for PCA.

The first one is a table of documents and components:

| Column       | Type    | Description                                                 |
|--------------|---------|-------------------------------------------------------------|
|series_id|Index|ID of the series|
|season_id|Index| Season number|
|episode_id|Index|Episode number|
|PC0 to PC9 (10 columns)| Features | Associated values of the principal components|
|LIB-EPS details (multiple columns)| Features | Additional details joined from the ALL-LIB-EPS table|

This table is located at `/final/pca/ALL-DOC-COMPS.csv`

The second table is a table of components and word counts (i.e., the “loadings”):

| Column       | Type    | Description                                                 |
|--------------|---------|-------------------------------------------------------------|
|term_str|Index| String term|
|PC0 to PC9 (10 columns) | Features | Associated values of the principal components|
|n|Feature| Word count|

This table is located at `/final/pca/ALL-LOADINGS.csv`

To conduct further exploration, I also generated separate principal components for each separate TV series. The files are also located at `/final/pca`. This process is demonstrated in `TEXT_ANALYSIS.ipynb`

### 3.4 Topic Models (LDA)

For topic modelling, I also limited the VOCAB to nouns and plural nouns: `NN and NNS`. Similarly, I had to remove the names of main and recurring characters. I also noticed that there were some filler words that were not being tagged as stop words. To resolve this, I did additional filtering by removing the filler words before generating the topics. I used the following parameters for the topic model:
- Ngram range: [1, 2]
- Max features for the count vectorizer: 4000
- Number of components for LDA: 20
- Max iterations for LDA: 5
- Number of words used to characterize a topic: 7

After a few experiments in generating topics, I found that it is also effective to ignore terms that appear in more than 50% of the documents since a lot of the words in my corpus are usual conversational words. I did this by setting the *max_df* parameter of the CountVectorizer. I generated the topic models using scenes as the OHCO level. There are two tables saved for topic modelling.

The first table is a table of document and topic concentrations:

| Column       | Type    | Description                                                 |
|--------------|---------|-------------------------------------------------------------|
|series_id|Index|ID of the series|
|season_id|Index| Season number|
|episode_id|Index|Episode number|
|scene_id|Index|Scene number|
|T00 to T19 (20 columns)|Features|Topic concentration values|

This table is located at `/final/topic_modeling/ALL-DOCS.csv`

The second table is a table of topics and term counts:

| Column       | Type    | Description                                                 |
|--------------|---------|-------------------------------------------------------------|
|term_str|Index| String term|
|T00 to T19 (20 columns)|Features| Topic concentration values|
|n|Feature|Term count|

This table is located at `/final/topic_modeling/ALL-VOCAB.csv`. This process is demonstrated in `TEXT_ANALYSIS.ipynb`. For additional exploration, I also generated topic models using plural nouns only `NNS`. The resulting tables are also located at `final/topic_modeling`.

### 3.5 Word Embeddings (word2vec)

For the word embedding model, I limited the VOCAB to nouns and verbs: `'NN', 'NNS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'`. Similar to PCA and LDA, I filtered out the names of the main and recurring characters. I used the following parameters for the word2vec model:
- Window: 2
- Vector size: 256
- Minimum count: 50
- Workers: 4

The resulting table is a table of terms and embeddings:

| Column       | Type    | Description                                                 |
|--------------|---------|-------------------------------------------------------------|
|term_str|Index| String term|
|vector|Feature| word2vec vector|
|x|Feature| x coordinate|
|y|Feature| y coordinate|
|n|Feature| Term count|
|max_pos|Feature| Most frequent part-of-speech of the term|
|pos_group|Feature| Either Noun or Verb|

This table is located at `/final/word_embedding/ALL-COORDS.csv`. This process is demonstrated in `TEXT_ANALYSIS.ipynb`. For additional exploration, I also generated separate word2vec models for each TV series which are also located at `/final/word_embedding`.

### 3.6 Sentiment Analysis

For sentiment analysis, I made use of the SALEX NRC lexicon. I applied this to my VOCAB table where sentiment values were then associated to the terms. These sentiment values were also multiplied by the term TFIDF values. I computed the sentiment analysis using episodes as the OHCO level. I was able to generate two tables for this model.

The first table is a table of sentiment and emotion values as features of each term:

| Column       | Type    | Description                                                 |
|--------------|---------|-------------------------------------------------------------|
|term_str|Index| String term|
|anger|Feature| Anger value of the term|
|anticipation|Feature| Anticipation value of the term|
|disgust|Feature| Disgust value of the term|
|fear|Feature| Fear value of the term|
|joy|Feature| Joy value of the term|
|sadness|Feature| Sadness value of the term|
|surprise|Feature| Surprise value of the term|
|trust|Feature| Trust value of the term|
|sentiment|Feature| Sentiment value of the term|

This table is located at `/final/sentiment_analysis/ALL-VOCAB-SA.csv`.

The second table is a table of sentiment polarity and emotions for each document:

| Column       | Type    | Description                                                 |
|--------------|---------|-------------------------------------------------------------|
|series_id|Index|ID of the series|
|season_id|Index| Season number|
|episode_id|Index|Episode number|
|anger|Feature| Anger value of the document|
|anticipation|Feature| Anticipation value of the document|
|disgust|Feature| Disgust value of the document|
|fear|Feature| Fear value of the document|
|joy|Feature| Joy value of the document|
|sadness|Feature| Sadness value of the document|
|surprise|Feature| Surprise value of the document|
|trust|Feature| Trust value of the document|
|sentiment|Feature| Sentiment value of the document|

This table is located at `/final/sentiment_analysis/ALL-DOC-SA.csv`. This process is demonstrated at `TEXT_ANALYSIS.ipynb`

# 4. Exploration

The entire code for the explorations that I did is `TEXT_ANALYSIS.ipynb`

The first step I did for my exploration was to compute TFIDF values for the entire corpus. Due to memory constraints, I computed the TFIDF at the episode level of the OHCO. Additionally, I computed the TFIDF values separately for each TV series using scenes as the OHCO level, and that was able to run. This was the precedent to the other explorations I did.

### Clustering

One of the initial things that I did was to check the similarity of different seasons from these shows and if there were similar seasons across different shows. I did this using hierarchical clustering and various distance metrics: City block, Cosine, Euclidean, Jaccard, and Jensen-Shannon. I was able to visualize the results using correlation tables and hierarchical cluster diagrams. The correlation table looks like this:

![Correlation](pictures/clustering_1.png)

I also plotted the hierarchical cluster diagrams for all the different metrics.

City block:

![City Block](pictures/clustering_2.png)

Cosine:

![Cosine](pictures/clustering_3.png)

Euclidean:

![Euclidean](pictures/clustering_4.png)

Jaccard:

![Jaccard](pictures/clustering_5.png)

Jensen-Shannon:

![Jensen-Shannon](pictures/clustering_2.png)

Most of the distance metrics, except for Jaccard, showed similar results, with seasons from each series typically being connected closer to each other. It was interesting to see that Parks and Recreation season 7 and Brooklyn Nine-Nine Season 1 appeared to be similar in most of the distance metrics. One possible hypothesis is because of similarity of writing. Side note: Michael Schur wrote (or co-wrote) 2 episodes for both [Park and Recreation season 7](https://en.wikipedia.org/wiki/Parks_and_Recreation_(season_7)) and [Brooklyn Nine-Nine Season 1](https://en.wikipedia.org/wiki/Brooklyn_Nine-Nine_(season_1))! 

### Principal Component Analysis (PCA)

For PCA, I created a first model that only uses Nouns and Plural Nouns (NN and NNS). The significant parameters that I used were:
- k = 10
- norm_docs = True
- center_by_mean = False
- center_by_variance = False
- OHCO = EPISODES

I visualized this using scatter plots. It was the first and second principal components that showed distinct boundaries for each series. The first principal component looks like this:

![1st PCA](pictures/pca_1.png)

The second principal component looks like this:

![2nd PCA](pictures/pca_2.png)

I also visualized the loadings for the second principal component:

![2nd PCA LOADINGS](pictures/pca_3.png)

We can see that the different corners have distinct words that are more related to each individual show. In one corner there are words like `detective` (Brooklyn Nine-Nine), then in another corner it has `parks and government` (Parks and Recreation), and on the other it has `paper, company, sales` (The Office US)

I also checked to see if the year of an episode showed any significance in the PCAs. I saw in the first principal component that there were some noticeable patterns but not as easilty distinguishable:

![1st PCA - Years](pictures/pca_4.png)

![1st PCA - Years, LOADINGS](pictures/pca_5.png)

I also tried to check other labels if they showed any distinct groups but they weren't noticeable.

Seasons (2nd PC):

![2nd PC - Seasons](pictures/pca_6.png)

Rotten Tomato Rating (1st PC):

![1st PC - Rotten Tomatoes](pictures/pca_7.png)

Director (1st PC):

![1st PC - Director](pictures/pca_8.png)

US Viewers (1st PC):

![2nd PC - US Viewers](pictures/pca_9.png)

I also tried to check if the boundaries for directors would show more in separate principal components for each series, but they just ended up the same.

### Topic Modeling (LDA)

For topic modeling I used the following important parameters:
- ngram_range = [1,2]
- n_terms = 4000
- n_topics = 20
- max_iter = 5
- n_top_terms = 7
- max_df = 0.5
- OHCO = SCENES

I also removed additional filler words that I observed were popping up consistently when I was initially generating the models. 

The first model I created was for nouns (NN and NNS) which created 20 topics in total. Here is a look of the topic distributions for some documents:

![Nouns Model](pictures/topic_1.png)

I was also able to check which topics were more associated with which TV series:

![Nouns Model - Association](pictures/topic_2.png)

It wasn't surprising to see the top topics for each of the TV series:
- Parks and Recreation: `T14 time, wait, look, god, need, mean, way`
- Brooklyn Nine-Nine: `T17 need, good, mean, tell, look, work, thank`
- The Office US: `T10 good, time, work, need, party, thank, way`

Most of the top topics had more generic words. This pushed me to generate a second model for my analysis. I generated a second topic model using the same parameters but with plural nouns (NNS) only:

![Plural Nouns Model](pictures/topic_3.png)

![Plural Nouns Model - Association](pictures/topic_4.png)

For this model, we were able to see more distinct words in the top topics for each of the TV series:
- Parks and Recreation: `T11 parks, minutes, lives, people, em, likes, things`
- Brooklyn Nine-Nine: `T10 police, people, sounds, thinks, men, things, kids`
- The Office US: `T00 people, people people, looks, turns, things, people things, things people` and `T01 knew, sales, things, people, smells, gotten, knew knew`

Although a lot of the topics still had generic words such as 'people' and 'things', we can also see the specific words for each series such as 'parks', 'police', and 'sales'.

### Word Embeddings (word2vec)

For the word2vec model, I used the following important parameters:
- window = 2
- vector_size = 256
- min_count = 50
- workers = 4
- OHCO = SCENES

I only used nouns and verbs to generate the model, and I was able to generate an interesting scatter plot of words:

![word2vec](pictures/word_1.png)

This had interesting word clusters such as this group of nouns containing various reactions such as laughing, chuckling, grunts, and groans:

![word2vec - Noun Group](pictures/word_2.png)

Another interesting word cluster is with these verbs portaying different actions:

![word2vec - Verb Group](pictures/word_3.png)

Additionally, I checked for word analogies using this model. I wanted to see if this model can give insight on how genders are portrayed in workplace comedies. I tried the following analogies:

![word2vec - Analogies](pictures/word_4.png)

Interestingly, this seemed to represent the main characters of the shows. Both Brooklyn Nine-Nine (`police`) and The Office (`office, business`) mostly have male lead characters while Parks and Recreation (`department, government, parks`) has a female lead character. 

Additionally, I created separate wor2vec models for each TV series using the same parameters. They showed to have clusters with the main themes of the respective shows:

Parks and Recreation (`parks, government, department`)
![word2vec - Parks and Recreation](pictures/word_5.png)

Brooklyn Nine-Nine (`undercover, precinct, detectives`)
![word2vec - Brooklyn Nine-Nine](pictures/word_6.png)

The Office US (`office, branch, manager`)
![word2vec - The Office](pictures/word_7.png)

### Sentiment Analysis

For sentiment analysis, I used the SALEX NRC lexicon. I created two models, using EPISODES and SEASONS as the OHCO levels.

This is what part of the sentiment values for episodes look like:

![Sentiment - Episodes](pictures/sentiment_1.png)

The one for seasons was easier to visualize:

![Sentiment - Seasons](pictures/sentiment_2.png)

![Sentiment - Seasons, Bar](pictures/sentiment_3.png)

It was interesting to see that seasons of Brooklyn Nine-Nine had generally lower sentiment values compared to the others.

Additionally, I visualized the trends of the sentiments for each TV series. I used both seasons and episodes to do this.

Parks and Recreation (Seasons):

![PNR - Seasons](pictures/sentiment_4.png)

Parks and Recreation (Episodes):

![PNR - Episodes](pictures/sentiment_7.png)

For Parks and Recreation, it is easier to see a smooth progression when comparing seasons instead of episodes.

Brooklyn Nine-Nine (Seasons):

![B99 - Seasons](pictures/sentiment_5.png)

Brooklyn Nine-Nine (Episodes):

![B99 - Seasons](pictures/sentiment_8.png)

Similarly for Brooklyn Nine-Nine, the episode sentiment visualization was more erratic compared to the one for seasons.

The Office (Seasons):

![OFFICE - Seasons](pictures/sentiment_6.png)

The Office (Episodes):

![OFFICE - Episodes](pictures/sentiment_9.png)

For The Office, it was interesting to see that there was an episode where all the emotion values peaked but it wasn't as evident when I visualized seasons.

# 5. Conclusion

This project was able to explore a lot of interesting things when it came to analyzing a corpus of sitcom scripts. Even using three different TV series combined in to one corpus showed promising results.

For hierarchical clustering, we saw that consecutive seasons tend to be more related to each other, which would make sense since the storyline of the entire show continues to flow. There are also instances where some seasons are more separated from the rest of the show. One interesting example is the last two seasons of The Office (Season 8 and 9). The Office famously had main characters leave the show for its last few seasons. This change in character make-up could also play a role in this clustering. As discussed earlier, there are also cases where some seasons from other shows are more related to each other. Factors such as writing style could possibly play into this.

For the principal components analysis, we saw that distinct words from the different shows tend to define the groups in the principal components. This makes sense since each TV series would have its own set of main themes. The same thing can be observed from the topic modeling and word embedding analyses that I conducted for this project, where distinct words from each series showed up in the respective models. It was a bit harder to generate the topic models, however, due to a lot of filler and generic words being used in the scripts. This can be attributed to the conversational nature of the language used in these scripts since sitcoms tend to portray day-to-day situations. 

For sentiment analysis, I was not able to see any distinct trends for the sentiments besides some sporadic dips and peaks for some episodes. This can show that the writers for each show tend to write episodes in the same way in terms of sentiment throughout all the seasons. 

Overall, I was able to see how these 3 famous workspace sitcoms (Parks and Recreation, Brooklyn Nine-Nine, The Office) were related to each other in terms of their episode scripts. Although there were some similarities observed, this exploration mostly showed how each show had their own unique distinction from the rest. This distinction and uniqueness would make sense as to why these 3 shows are beloved by their fans and viewers everywhere. I believe that this project can be used an exploratory framework for other similar projects. Maybe comparing a single-camera sitcom to a multi-camera sitcom would yield interesting results, or even comparing shows from drastically different decades or eras - there are a lot of exciting possibilities. This project has shown me how text analytics can provide valuable insight to various corpora. 

# Additional References

[1] Charney, N. 2014. "Cracking the Sitcom Code." The Atlantic. Retrieved May 1, 2023. https://www.theatlantic.com/entertainment/archive/2014/12/cracking-the-sitcom-code/384068/

[2] MasterClass. 2022. "How to Write a Sitcom: Sitcom Writing Guide." MasterClass. Retrieved May 1, 2023. https://www.masterclass.com/articles/how-to-write-a-sitcom

[3] Miyamoto, K. 2023. "Single-Camera vs Multi-Camera TV Sitcom Scripts: What's the Difference?" Screen Craft. Retrieved May 1, 2023. https://screencraft.org/blog/differences-single-camera-multi-camera-tv-pilot-scripts/
