### Natural Language Processing - Text Classification using Seinfeld Transcript Data

##### 0 - Setting up required imports, importing data, and cleaning data.

In [1]:
## for data
import pandas as pd
import numpy as np

## for plotting
import matplotlib.pyplot as plt
import seaborn as sns

## for processing
import re
import nltk

## for bag-of-words
from sklearn import feature_extraction, model_selection, naive_bayes, pipeline, manifold, preprocessing

## for explainer
from lime import lime_text

## for word embedding (Word2Vec)
import gensim
import gensim.downloader as gensim_api

## for deep learning
from tensorflow.keras import models, layers, preprocessing as kprocessing
from tensorflow.keras import backend as K

## for bert language model
import transformers

In [5]:
# read the generated transcripts CSV from Data Scraper ipynb
seinfeldDF = pd.read_csv('seinfeld_transcripts.csv')

In [67]:
# Some Data Cleaning
# transpose the data frame
seinfeldDFT = seinfeldDF.T
# ravel the data frame aka put all columns into one centralized column, drop na's, lower case
combineddf = pd.Series(seinfeldDFT.values.ravel('F')).dropna()
combineddf = combineddf.to_frame(name='Character')
combineddf['Character'] = combineddf['Character'].str.lower()

# separate the character from the text
combineddf[['Character','Character Text']] = combineddf["Character"].str.split(":", 1, expand=True)
combineddf = combineddf.dropna()
print(combineddf)

        Character                                     Character Text
21738       jerry   you know, why we're here? [he means: here in ...
21740      [scene   pete's luncheonette. jerry and george are sit...
21741       jerry   seems to me, that button is in the worst poss...
21742      george               are you through? [kind of irritated]
21743       jerry             you do of course try on, when you buy?
...           ...                                                ...
86921       jerry    grand theft auto - don't steal any of my jokes.
86922  prisoner 3                      you suck - i'm gonna cut you.
86923       jerry   hey, i don't come down to where you work, and...
86924       guard   alright, seinfeld, that's it. let's go. come on.
86925       jerry   alright, hey, you've been great! see you in t...

[36582 rows x 2 columns]


##### 1 - Text Analysis

In [69]:
# lets take a look at the distribution of characters and their lines
combineddf.Character.str.split(expand=True).stack().value_counts()

jerry             9723
george            6494
elaine            5267
kramer            4335
newman             459
                  ... 
it...?               1
owns                 1
wallet)              1
(interrupting)       1
tells                1
Length: 2386, dtype: int64

Unsurprisingly, Jerry has the most lines followed by George, Elaine, and Kramer. We will use these 4 characters as our primary predictor targets.