In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from helpers.helper_functions import *

# Handling text 2 exercise

The Sheldon Cooper we all know and love (OK, some of us might not know him, and some might not love him) from the TV series "The Big Bang Theory" has gotten into an argument with Leonard from the same TV show. Sheldon insists that he knows the show better than anyone, and keeps making various claims about the show, which neither of them know how to prove or disprove. The two of them have reached out to you ladies and gentlemen, as data scientists, to help them. You will be given the full script of the series, with information on the episode, the scene, the person saying each dialogue line, and the dialogue lines themselves.

Leonard has challenged several of Sheldon's claims about the show, and throughout this exam you will see some of those and you will get to prove or disprove them, but remember: sometimes, we can neither prove a claim, nor disprove it!

## Task A: Picking up the shovel

**Note: You will use the data you preprocess in this task in all the subsequent ones.**

Our friends' argument concerns the entire show. We have given you a file in the `data/` folder that contains the script of every single episode. New episodes are indicated by '>>', new scenes by '>', and the rest of the lines are dialogue lines. Some lines are said by multiple people (for example, lines indicated by 'All' or 'Together'); **you must discard these lines**, for the sake of simplicity. However, you do not need to do it for Q1 in this task -- you'll take care of it when you solve Q2.

**Q1**. Your first task is to extract all lines of dialogue in each scene and episode, creating a dataframe where each row has the episode and scene where a dialogue line was said, the character who said it, and the line itself. You do not need to extract the proper name of the episode (e.g. episode 1 can appear as "Series 01 Episode 01 - Pilot Episode", and doesn't need to appear as "Pilot Episode"). Then, answer the following question: In total, how many scenes are there in each season? We're not asking about unique scenes; the same location appearing in two episodes counts as two scenes. You can use a Pandas dataframe with a season column and a scene count column as the response.

**Note: The data refers to seasons as "series".**

In [None]:
# your code goes here
import pandas as pd

EPISODE_MARK = '>>'
SCENE_MARK = '>'

with (open('./data/all_scripts.txt', 'r') as f):
    script = f.read()
#print(script[:40])
# Columns: Episode (name), Scene (description), Character, Dialogue line
df = pd.DataFrame(columns=['Episode','Scene','Character','Dialogue'])

episode = ""
scene = ""
character = ""
dialogue = ""
for line in script.strip().split('\n'):
    if line[:2] == EPISODE_MARK:
        episode = line[2:]
        continue
    elif line[:1] == SCENE_MARK:
        scene = line[1:]
        continue
    else:
        character, dialogue = line.split(":", 1)
        new_row = pd.DataFrame([episode, scene, character.strip(), dialogue.strip()], columns=df.columns)
        df = pd.concat([df, new_row], ignore_index=True)

print(df.to_string())

In [None]:
scene_df = pd.DataFrame(columns=['Series','Scene Count'])

SERIES_PREFIX = 'Series 01'

scene_count = 0
scene = ""
series = ""
for _, row in df.iterrows():
    if row['Episode'][:len(SERIES_PREFIX)+1] != series:
        if series != "":
            new_row = pd.DataFrame([[series,scene_count]],columns=scene_df.columns)
            scene_df = pd.concat([scene_df,new_row],ignore_index=True)
        series = row['Episode'][:len(SERIES_PREFIX)+1]
        scene_count = 0
    if row['Scene'] != scene:
        scene_count = scene_count+1
        scene = row['Scene']

new_row = pd.DataFrame([[series,scene_count]],columns=scene_df.columns)
scene_df = pd.concat([scene_df,new_row],ignore_index=True)

print(scene_df.to_string())

**Q2**. Now, let's define two sets of characters: all the characters, and recurrent characters. Recurrent characters are those who appear in more than one episode. For the subsequent sections, you will need to have a list of recurrent characters. Assume that there are no two _named characters_ (i.e. characters who have actual names and aren't referred to generically as "little girl", "grumpy grandpa", etc.) with the same name, i.e. there are no two Sheldons, etc. Generate a list of recurrent characters who have more than 90 dialogue lines in total, and then take a look at the list you have. If you've done this correctly, you should have a list of 20 names. However, one of these is clearly not a recurrent character. Manually remove that one, and print out your list of recurrent characters. To remove that character, pay attention to the _named character_ assumption we gave you earlier on. **For all the subsequent questions, you must only keep the dialogue lines said by the recurrent characters in your list.**

In [None]:
# your code goes here
all_characters = {}

"""
Sheldon: {
    episodes: 200
    lines: 300000
}
"""

episode = ""
for _, row in df.iterrows():
    character = row['Character']
    character_is_mentioned_before = (character in all_characters)
    if not character_is_mentioned_before:
        all_characters[character] = {"lines": 1, "episodes": [row['Episode']]}
    else:
        all_characters[character]['lines'] += 1
        if row['Episode'] not in all_characters[character]['episodes']:
            all_characters[character]['episodes'].append(row['Episode'])

recurrent_characters = {k: v for k, v in all_characters.items() if v['lines'] > 90 and len(v['episodes']) > 1}

print(f"Recurrent characters: {len(recurrent_characters)}")

for character in recurrent_characters:
    print(f"{character}:\n    lines: {recurrent_characters[character]['lines']},\n    episodes: {len(recurrent_characters[character]['episodes'])}")

# Simplify list and remove 'Man'
recurrent_characters = [k for k, _ in recurrent_characters.items()]
recurrent_characters.remove('Man')

print()
print(recurrent_characters, len(recurrent_characters))

## Task B: Read the scripts carefully

### Part 1: Don't put the shovel down just yet

**Q3**. From each dialogue line, replace punctuation marks (listed in the EXCLUDE_CHARS variable provided in `helpers/helper_functions.py`) with whitespaces, and lowercase all the text. **Do not remove any stopwords, leave them be for all the questions in this task.**

In [6]:
# your code goes here
from helpers.helper_functions import EXCLUDE_CHARS
import regex as re

pattern = "[" + "".join(re.escape(c) for c in EXCLUDE_CHARS) + "]"

df_recurrent = pd.DataFrame(columns=df.columns)

for _, row in df.iterrows():
    if row['Character'] not in recurrent_characters:
        continue
    new_row = row.copy()
    new_row['Dialogue'] = re.sub(pattern, " ", row['Dialogue'].lower())
    df_recurrent = pd.concat([df_recurrent, pd.DataFrame([new_row])], ignore_index=True)

In [None]:
print(df_recurrent[:1000].to_string())

**Q4**. For each term, calculate its "corpus frequency", i.e. its number of occurrences in the entire series. Visualize the distribution of corpus frequency using a histogram. Explain your observations. What are the appropriate x and y scales for this plot?

In [None]:
# your code goes here
from collections import Counter
import spacy

nlp = spacy.load('en_core_web_sm')
nlp.max_length = 3121313 # Length of all dialogue ¯\_(ツ)_/¯

all_dialogue = " ".join(df['Dialogue'])

doc = nlp(all_dialogue)

words = [token.text for token in doc if token.is_stop != True and token.is_punct != True]

# five most common tokens
word_freq = Counter(words)
common_words = word_freq.most_common()

print(common_words)

In [None]:
# words 24210 and max freq 4099
import math

# Extract only the frequencies from common_words
word_frequencies = [math.log(freq) for _, freq in common_words]
print(word_frequencies)

import matplotlib.pyplot as plt
plt.hist(word_frequencies,bins=10)
plt.xlabel('Word')
plt.ylabel('Frequency')
plt.show()

# ANSWER: Most words are very frequently used

### Part 2: Talkativity
**Q5**. For each of the recurrent characters, calculate their total number of words uttered across all episodes. Based on this, who seems to be the most talkative character?

In [None]:
# your code goes here
talkativity = {}

for _, row in df_recurrent.iterrows():
    character = row['Character']
    words_spoken = len(row['Dialogue'])
    if character not in talkativity.keys():
        talkativity[character] = 0
    talkativity[character] += words_spoken

for character in talkativity:
    print(f"{character} {talkativity[character]}")

# ANSWER: Sheldon by more than double

## Task D: The Detective's Hat

Sheldon claims that given a dialogue line, he can, with an accuracy of above 70%, say whether it's by himself or by someone else. Leonard contests this claim, since he believes that this claimed accuracy is too high.

**Q6**. Divide the set of all dialogue lines into two subsets: the training set, consisting of all the seasons except the last two, and the test set, consisting of the last two seasons.

In [None]:
train_rows = []
test_rows = []

for _, row in df_recurrent.iterrows():
    season = int(row['Episode'][len(SERIES_PREFIX)-1:len(SERIES_PREFIX)+1])
    if season < 9:
        train_rows.append(row)
    else:
        test_rows.append(row)

training = pd.DataFrame(train_rows, columns=df.columns)
testing = pd.DataFrame(test_rows, columns=df.columns)

In [95]:
print("Training")
print(training[:4].to_string())

print("Testing")
print(testing[:4].to_string())

Training
                                 Episode                         Scene Character                                                                                                                                                                                                                                                                                 Dialogue
0   Series 01 Episode 01 – Pilot Episode   A corridor at a sperm bank.   Sheldon  so if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits  if it s unobserved it will  however  if it s observed after it s left the plane but before it hits its target  it will not have gone through both slits 
1   Series 01 Episode 01 – Pilot Episode   A corridor at a sperm bank.   Leonard                                                                                                                                                                                           

**Q7**. Find the set of all words in the training set that are only uttered by Sheldon. Is it possible for Sheldon to identify himself only based on these? Use the test set to assess this possibility, and explain your method.

In [None]:
# your code goes here
