# Natural Language Processing (NLP) Process For Seinfeld Transcripts

This notebook will outline the process of cleaning, tokenizing, and vectorizing text transcripts of Seinfeld Season 5 Episodes. Source of transcripts: https://www.seinfeldscripts.com/seinfeld-scripts.html

In [1]:
#Import the NLTK library, tokenizer, and methods
import nltk
nltk.download('punkt')
from nltk import word_tokenize
from nltk import sent_tokenize

import pandas as pd
import numpy as np
from sklearn.manifold import TSNE

#Import visualization libraries
import matplotlib.pyplot as plt
%matplotlib inline
from mpl_toolkits.mplot3d import Axes3D

[nltk_data] Downloading package punkt to /Users/Alex/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Read The Files

In [2]:
seinfeld_directory = 'Seinfeld_Episodes/Season_5/'

seinfeld_season_5_episodes = ['S05_E01_The_Mango.txt', 'S05_E02_The_Puffy_Shirt.txt',
                              'S05_E03_The_Glasses.txt', 'S05_E04_The_Sniffing_Accountant.txt',
                              'S05_E05_The_Bris.txt', 'S05_E06_The_Lip_Reader.txt',
                              'S05_E07_The_Non_Fat_Yogurt.txt', 'S05_E08_The_Barber.txt',
                              'S05_E09_The_Masseuse.txt', 'S05_E10_The_Cigar_Store_Indian.txt',
                              'S05_E11_The_Conversion.txt', 'S05_E12_The_Stall.txt',
                              'S05_E13_The_Dinner_Party.txt', 'S05_E14_The_Marine_Biologist.txt',
                              'S05_E15_The_Pie.txt', 'S05_E16_The_Stand-In.txt',
                              'S05_E17_The_Wife.txt', 'S05_E18_The_Raincoats_Part_1.txt',
                              'S05_E19_The_Raincoats_Part_2.txt', 'S05_E20_The_Fire.txt',
                              'S05_E21_The_Hamptons.txt', 'S05_E22_The Opposite.txt']


In [3]:
with open(seinfeld_directory + seinfeld_season_5_episodes[0], 'r') as file:
    raw_text_episode_1 = file.read().replace('\n', ' ')

In [4]:
raw_text_episode_1

"[location: nightclub] JERRY: A female orgasm is kinda like the bat cave. A very few people know where it is and if you're lucky enough to see it you probably don't know how you got there and you can't find you way back after you left. You know there are two types of female orgasm: the real and the fake. And I'll tell you right now, as a man, we don't know. We do not know, because to man sex is like a car accident and determining the female orgasm is like being asked 'What did you see after the car went out of control?'. 'I heard a lot of screeching sounds. I remember I was facing the wrong way at one point. And in the end my body was thrown clear. [location: Monk's] JERRY: So, what's her name? GEORGE: Karin. JERRY: Is she nice? GEORGE: Great. JERRY: So you like her? GEORGE: I think so. JERRY: You don't know? GEORGE: I can't tell anymore. JERRY: Well do you feel anything? GEORGE: Feel? What's that? JERRY: All right, let me ask you this: when she comes over, you're cleaning up a lot? GE

## Cleaning The Text Data

#Step 1: Manual

In [5]:
def clean_text(raw_text):
    cleaned_text = []
    for word in raw_text.split(" "):
        if not '(' in word and  not ')' in word:
            for symbol in ",?!'":
                word = word.replace(symbol, '').lower()
            cleaned_text.append(word)

    return cleaned_text

In [6]:
cleaned_text_episode_1 = clean_text(raw_text_episode_1)

In [7]:
cleaned_text_episode_1

['[location:',
 'nightclub]',
 'jerry:',
 'a',
 'female',
 'orgasm',
 'is',
 'kinda',
 'like',
 'the',
 'bat',
 'cave.',
 'a',
 'very',
 'few',
 'people',
 'know',
 'where',
 'it',
 'is',
 'and',
 'if',
 'youre',
 'lucky',
 'enough',
 'to',
 'see',
 'it',
 'you',
 'probably',
 'dont',
 'know',
 'how',
 'you',
 'got',
 'there',
 'and',
 'you',
 'cant',
 'find',
 'you',
 'way',
 'back',
 'after',
 'you',
 'left.',
 'you',
 'know',
 'there',
 'are',
 'two',
 'types',
 'of',
 'female',
 'orgasm:',
 'the',
 'real',
 'and',
 'the',
 'fake.',
 'and',
 'ill',
 'tell',
 'you',
 'right',
 'now',
 'as',
 'a',
 'man',
 'we',
 'dont',
 'know.',
 'we',
 'do',
 'not',
 'know',
 'because',
 'to',
 'man',
 'sex',
 'is',
 'like',
 'a',
 'car',
 'accident',
 'and',
 'determining',
 'the',
 'female',
 'orgasm',
 'is',
 'like',
 'being',
 'asked',
 'what',
 'did',
 'you',
 'see',
 'after',
 'the',
 'car',
 'went',
 'out',
 'of',
 'control.',
 'i',
 'heard',
 'a',
 'lot',
 'of',
 'screeching',
 'sounds.',
 

## Tokenize The Sentence

In [8]:
def tokenize(cleaned_text):
    joined_sentence = ' '.join(cleaned_text)
    tokenized_sentence = word_tokenize(joined_sentence)
    
    return tokenized_sentence


In [9]:
tokenize(cleaned_text_episode_1)

['[',
 'location',
 ':',
 'nightclub',
 ']',
 'jerry',
 ':',
 'a',
 'female',
 'orgasm',
 'is',
 'kinda',
 'like',
 'the',
 'bat',
 'cave',
 '.',
 'a',
 'very',
 'few',
 'people',
 'know',
 'where',
 'it',
 'is',
 'and',
 'if',
 'youre',
 'lucky',
 'enough',
 'to',
 'see',
 'it',
 'you',
 'probably',
 'dont',
 'know',
 'how',
 'you',
 'got',
 'there',
 'and',
 'you',
 'cant',
 'find',
 'you',
 'way',
 'back',
 'after',
 'you',
 'left',
 '.',
 'you',
 'know',
 'there',
 'are',
 'two',
 'types',
 'of',
 'female',
 'orgasm',
 ':',
 'the',
 'real',
 'and',
 'the',
 'fake',
 '.',
 'and',
 'ill',
 'tell',
 'you',
 'right',
 'now',
 'as',
 'a',
 'man',
 'we',
 'dont',
 'know',
 '.',
 'we',
 'do',
 'not',
 'know',
 'because',
 'to',
 'man',
 'sex',
 'is',
 'like',
 'a',
 'car',
 'accident',
 'and',
 'determining',
 'the',
 'female',
 'orgasm',
 'is',
 'like',
 'being',
 'asked',
 'what',
 'did',
 'you',
 'see',
 'after',
 'the',
 'car',
 'went',
 'out',
 'of',
 'control',
 '.',
 'i',
 'heard',