# Sitcoms 1: Obtaining and Cleaning the Data

This is the first of four notebooks completed on Sitcom Pilot Script Data. In this notebook, we will focus on introducing the project, obtaining the data through web-scraping, building our dataframe, and data-cleaning. After I am finished with gathering and scrubbing the data, I will explore the data in [Notebook 2 - EDA](2_sitcoms_EDA.ipynb), run classifiers on the data in [Notebook 3 - Deep Learning](3_sitcoms_classification.ipynb), and experiment with deep learning in [Notebook 4 - Deep Learning](4_sitcoms_deep_learning.ipynb). 

## Table of Contents

  * 1. [Introduction](#intro)  
      * 1.1. [Why Sitcoms?](#why)  
      * 1.2. [About the Data](#about)  
      * 1.3. [Defining "Fail" and "Success"](#define)  
      * 1.4  [Collection Sources](#source)    
  * 2. [Obtaining the Data](#obtain)  
      * 2.1. [Import Libraries](#import)  
      * 2.2. [Open and Preview Data](#open)  
      * 2.3. [Description of Columns](#cols)  
  * 3. [Web Scraping Using Selenium](#web)  
      * 3.1. [Creating a Dataframe of Scripts](#create)
  * 4. [Data Cleaning](#clean)  
      * 4.1. [Initial Text Preprocessing/Splitting Data](#itp)
      * 4.2. [Exploding Rows into More Rows](#expode)
      * 4.3. [Datatypes](#data)
      * 4.4. [Null Data](#null)
            * 4.4.1. [Notes Column](#notes)
            * 4.4.2.[Actors Column](#actors)

<a id='intro'></a>

# 1. Introduction

   Sitcoms have always been a staple of cable television. Over the years, audiences all over the nation have witnessed major networks debut hundreds of different sitcom pilots in an attempt to create the next "Seinfeld" or "Friends". So what makes certain sitcoms reach "water cooler conversation" status while others vanish just as fast as they came into existance? 

   Sometimes, the reason why a show fails is clear. Maybe a lead actor on the program got caught in a PR scandal or perhaps network budget cuts lead to an early cancellation. Personally, I believe that the key to a show's success is in the script. For this project, I decided to try to answer the question **Is there a relationship between a sitcom's script and the success it has?** and see what other information that I could uncover while analyzing television sitcom script data.

<a id='why'></a>

### 1.1. Why Sitcoms?

I've always had an appreciation for television and movie scripts and I wanted to incorporate them into my Capstone project somehow. While I tend to prefer dramatic television over sitcoms when praising tv, I can't deny the comfort that an old sitcom provides me. Of course, while my adoration of well-written television spans many levels, the real reasons that I chose to focus on sitcoms are:

1.) I chose sitcoms because their scripts are short (as the episodes tend to span just over 20 minutes) and I knew that it would be less time-consuming to model on shorter scripts.

2.) Additionally, sitcoms are historically both simple and formulaic. This means that trends are likely to be easier to identify and replicate than in more complex dramatic shows such as Breaking Bad. 

3.) And finally, with The Big Bang Theory airing its thirteenth and final season this year - after beating out television-giant Game of Thrones for the most watched series finale in 2019 - it looks like the sitcom game is about to open up and pick a new MVP. In my mind, this creates a demand for a project like this.

<a id='about'></a>

### 1.2. About The Data

I was unable to find an adequate dataset of television sitcom scripts online, so I ended up sourcing my own data for this project. To answer my questions about the relationship between a sitcom's script and its success, I decided to compile a dataframe of roughly 50 different sitcoms. About half of the sitcoms failed and half of the sitcoms were successful, as defined by the criteria listed in the section below. For each program, I included several data points that I thought could be relevant to my research; such as the year the show first debuted, the network it debuted on, a summary of the premise, etc. Most importantly, I also included a link to the script of the pilot episode of each show. I then used the web-scraping API Selenium to retrieve the pilot scripts from the web into my dataframe.

<a id='define'></a>

### 1.3. Defining "Fail" and "Success"

In choosing the televison series that I would include in my failed list and successful list, I developed a set of criteria. Sitcoms in the failed list each had a score of 6.5 or lower on IMDB, a score of 45% or lower on Rotten Tomatoes, and were each canceled after one season. 

Television series in my successful set each had a score of 7 or higher on IMDB, a score of 60% or higher on Rotten Tomatoes, and had at least 3 seasons (with the exception of Atlanta, which had 2 seasons at the time of data collection, but had already been renewed for both a 3rd and 4th season).

To create as much evenness among the data as well as control for as many confounding variables as possible, all television programs aired on major cable television networks, were American-made and were created in the last 30 years. This criteria meant excluding all of the successful and unsuccessful content released by Netflix, HBO, or other platforms of the like as well as excluding several sitcom giants, such as 'Cheers' and 'I Love Lucy'. Each show also fits into the genre of "situational comedy" and has an average episode duration of 22-minutes.

<a id='sources'></a>

### 1.4. Collection Sources

All scripts were obtained via web-scraping from the website "Springfield! Springfield!" (https://springfieldspringfield.co.uk), a database of thousands of television episode scripts and movie scripts. IMDB scores and Rotten Tomatoes scores were gathered directly from IMDB and Rotten Tomatoes, respectively. Data on the number of the people who viewed the pilot as well as viewing averages were largely collected from tvtango.com and tvseriesfinale.com. 

All data was collected during October 2019 by me, Emily Pfeifer.

<a id='obtain'></a>

# 2. Obtaining the Data

<a id='import'></a>

### 2.1 Import Libraries

In [1]:
#import necessary libaries

#webscraping
import requests
import re
from selenium import webdriver

#dataframes
import pandas as pd

#math
import numpy as np
np.random.seed(0)

#visualizations
import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline

<a id='open'></a>

### 2.2. Open and Preview Data

In [2]:
df = pd.read_csv('sitcoms.csv') #open and read file

In [3]:
df.head() #preview the head

Unnamed: 0,program,year,rtscore,IMDB,label,premise,url,seasons,episodes,network,notes,actors,viewpilot,viewseason1average,viewseason1finale,viewaverage,viewfinale
0,How to Be a Gentleman,2011,27,4.7,fail,Andrew Carlson (David Hornsby) has so immersed...,https://www.springfieldspringfield.co.uk/view_...,1,9,cbs,,,8.98,3.1,1.56,3.1,1.56
1,Accidentally on Purpose,2009,30,6.3,fail,"Billie, a newspaper film critic, has a one-nig...",https://www.springfieldspringfield.co.uk/view_...,1,18,cbs,show was in primetime spot (after HIMYM) for f...,,8.91,7.8,5.22,7.8,5.22
2,Imaginary Mary,2017,25,5.5,fail,"Up until recently, fiercely independent Alice ...",https://www.springfieldspringfield.co.uk/view_...,1,6,abc,3 additional episodes unaired,rachel dratch,5.39,3.15,2.13,3.15,2.13
3,Rush Hour,2016,23,5.7,fail,"The popular ""Rush Hour"" feature-film franchise...",https://www.springfieldspringfield.co.uk/view_...,1,13,cbs,based on movie of same name,,5.06,3.47,1.61,3.47,1.61
4,Bad Judge,2014,20,5.9,fail,"Rebecca Wright parties hard, rocks out as the ...",https://www.springfieldspringfield.co.uk/view_...,1,13,nbc,inspired by chelsea handler's book,chelsea handler,5.84,3.81,2.86,3.81,2.86


<a id='cols'></a>

### 2.3. Description of Columns

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 17 columns):
program               51 non-null object
year                  51 non-null int64
rtscore               51 non-null int64
IMDB                  51 non-null float64
label                 51 non-null object
premise               51 non-null object
url                   51 non-null object
seasons               51 non-null int64
episodes              51 non-null int64
network               51 non-null object
notes                 28 non-null object
actors                21 non-null object
viewpilot             49 non-null float64
viewseason1average    49 non-null float64
viewseason1finale     50 non-null float64
viewaverage           48 non-null float64
viewfinale            50 non-null object
dtypes: float64(5), int64(4), object(8)
memory usage: 6.9+ KB


The columns measure the following:

**program** - The title of the sitcom

**year** - the year that the sitcom premiered

**rtscore** - the Rotten Tomatoes score of the sitcom - this score is mostly generated by film critics

**IMDB** - the IMDB score of the sitcom - the score is mostly generated by audiences (rather than critics)

**label** - 'fail' or 'success' - denoting whether the program was successful or not

**premise** - a short summary of the program

**url** - the URL of the script for the pilot episode of the program

**seasons** - the number of seasons the program had

**episodes** - the number of episodes the program had

**network** - the network that the program first aired on

**notes** - general notes about the program. This column was a place for me to note if there was anything that differentiated the show from other shows, i.e., the program was based off of a movie of the same name or the program was animated, etc.

**actors** - any notable actors that the the show starred.

**viewpilot** - the number of people (in millions) who viewed the pilot episode

**viewseason1average** - the average number of people (in millions) who viewed each episode in the first season 

**viewseason1finale** - the number of people (in millions) who viewed the last episode of season 1

**viewaverage** - the average number of people (in millions) who viewed each episode

**viewfinale** - the number of people (in millions) who viewed the last episode of the series.

For a few of the columns, such as actors and notes, about half of the values are blank (or null) because they weren't necessary. Additionally, a few values of in each of the viewers columns are blank because I was unable to find a reliable source for the information. This mostly happened for shows that were made a while back or weren't very well known.

<a id='web'></a>

# 3. Web Scraping using Selenium

We have the URLs for the location of each pilot script, but have not extracted the text data from them yet. Using Selenium, an open-source web-based automation tool,  we will gather our script data from Springfield, Springfield.

Before we go through every link and grab our data, let's try it on the first link we have and make sure we understand the method.

In [5]:
driver = webdriver.Chrome() #create new instance of google chrome

In [6]:
#access chrome/open our website
driver.get('https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=how-to-be-a-gentleman-2011&episode=s01e01')

In [7]:
#pass xpath of title into find_elements function
title_element = driver.find_elements_by_xpath('//*[@id="content_container"]/div[2]/div[2]/h1')[0]
title = title_element.text #extract text from element

In [8]:
print(title) #check our work

How to Be a Gentleman (2011) s01e01 Episode Script


In [9]:
#pass xpath of script into find_elements function
script_element = driver.find_elements_by_xpath('//*[@id="content_container"]/div[2]/div[2]/div[2]/div[1]')[0]
script = script_element.text #extract text from element

In [10]:
print(script) #let's see if this worked

I am one of the last of my kind.
Those who came before me, ruled the world.
Everyday, I carry on their proud legacy.
I open the door for a lady, but I am not a doorman.
I protect fellow citizens but I am not a policeman.
I put out cigarettes but I am not a cigarette putter-outer man.
I am a gentleman.
Oop.
That was close.
Jerry, hey.
Never miss.
Andrew, I've got some good news.
The magazin's been sold.
When a gentleman receives bad news, he is never at a loss for words.
What? Yep.
New owners are changing the entire format.
No more urbane and upscale.
We're going young and sexy; be women in thongs, articles about abs.
Yep, they want to expand the readership by targeting people who don't read.
This is insane.
How are you taking this so well? Simple.
I'm faking it.
Look, this is unacceptable.
You're the editor.
We should both walk in there right now and tell them we are not changing a thing.
I was thinking about doing that, I'm 50.
So I decided that I'm actually very, very excited about t

So far, so good!

<a id='create'></a>

### 3.1. Creating Dataframe of Scripts

Now that we have verified how to use Selenium in this context, we will need to repeat the process for the remaining pilots. Our first step will be to create a list of URLs to iterate through.

In [11]:
url_list = df.url.unique() #creating url list from unique values in url column

Next, we're going to repeat the process of web-scraping above for every link in our list. We already have a list of program names that we can obtain from our dataframe, so the only thing we're going to extract from each link is the script text. 

In [12]:
res_script=[] #where to store our results
#iterating through the list of urls
for link in url_list:
    driver.get(link) #go to webpage
    script_element = driver.find_elements_by_xpath('//*[@id="content_container"]/div[2]/div[2]/div[2]/div[1]')[0] #script xpath
    script = script_element.text #extract text from element
    res_script.append(script) #adding each script to the results list

In [13]:
print(res_script[23]) #take a look at a random script to verify this worked

Mom? Hey, I need your help.
I'm at dad's.
Because I need to borrow money.
I got laid off.
I'm broke.
[Suspenseful music.]
You were married to the guy.
What's the best way to get money from him? Yeah.
I can't divorce him.
I'm over-thinking this.
I haven't seen him in two years.
Maybe he's a different guy.
Maybe he's mellowed.
You know what? I bet he's changed.
Guts or nuts.
Your choice.
[Whispers.]
He hasn't changed.
[Ben folds five's one angry dwarf and solemn faces.]
Don't you think you wanna be just a little more like me Why didn't you call first? Almost decorated my Buick with your balls! Come on, dad.
You wouldn't do that to your Buick.
I thought you were one of those jackasses who show up on my front door lookin' for a handout.
Right.
The elections are coming up.
I'm talking about the girl scouts.
They're nothing but beggars with merit badges.
There's nothing worse than someone coming to your home, trying to get money out of you.
What brings you to town, Henry? Um you know, I just

That all looks good! Let's go ahead and make our new dataframe.

In [14]:
titles = df.program.unique() #create list of program names

In [15]:
df_raw = pd.DataFrame(list(zip(titles, res_script)), #setting titles and scripts as columns
              columns=['title','script'])

In [16]:
#adding label (fail/success) to dataframe
df_raw['label'] = df['label'].values

As previously mentioned, a portion of the data collected in the original dataframe were columns that I had included as notes to myself, such as the premise of the show or whether or not any notable actors were in the show. This content will be useful for us in our EDA notebook, but in the interest of keeping our dataframe as streamlined as possible, we will not include these columns.

In [17]:
df_raw.to_csv('pilotdata.csv') #saving the file to access it in next notebook

<a id='clean'></a>

# 4. Data Cleaning

The first thing that I want to do is split the script data into sentences. While we do not have a huge number of data points, our actual text sequences are very long (in other words, a 22-minute pilot contains a lot of text). To help our model run as quickly and efficiently as possible, we are going to extract sentences from our scripts and then get get ready to split our scripts into rows line by line.

<a id='itp'></a>

### 4.1. Initial Text Preprocessing and Splitting the Text Data

To make our "split_into_sentences" function run properly, we're going to do a little bit of text pre-processing within our function so that our function is able to parse our script data as accurately as possible without mistaking certain prefixes or acronyms or other common text occurances for full sentences.

In [18]:
# defining text variables for function
alphabets= "([A-Za-z])" #take care of case-sensitive errors
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]" #don't mistake a prefix for a sentence
suffixes = "(Inc|Ltd|Jr|Sr|Co)" #don't mistake a suffix for a sentence
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"
digits   = "([0-9])" #take care of numerical errors, keep numbers like '5.5' from being labeled as sentences

#create a function to split text into sentences
def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    #text = re.sub(digits + "[.]" + digits)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

In [19]:
data_text = df_raw #making a copy of dataframe

In [20]:
#apply sentences function to script column
data_text['script'] = data_text['script'].map(lambda script: split_into_sentences(script)) 

In [21]:
data_text.head() #let's see

Unnamed: 0,title,script,label
0,How to Be a Gentleman,"[I am one of the last of my kind., Those who c...",fail
1,Accidentally on Purpose,"[Oh, I can't do another office party., I've al...",fail
2,Imaginary Mary,"[1 Mary: This is Alice from way back., Just a ...",fail
3,Rush Hour,"[1 [intense music., ] [indistinct chattering.,...",fail
4,Bad Judge,[1 I'm gonna let loose no I don't care letting...,fail


In [22]:
data_text['script'][10] #check random value

['Oh, my God!',
 "I can't believe we're really married!",
 'I know.',
 'I know.',
 'Mmm.',
 "We didn't rush into this, did we?",
 '-No.',
 'No.',
 '-No.',
 "If getting married impulsively was a bad idea, Vegas wouldn't have chapels open at 3:00 in the morning.",
 'Right.',
 'Right.',
 "It's crazy.",
 "It's just crazy.",
 "Six weeks ago, a guy comes running into my dress shop to get away from a bee, and now I'm married to him.",
 'I guess I can tell you now.',
 '-What?',
 '-I made up the bee.',
 'I saw you through the window, and you looked really cute.',
 'So wait.',
 'So so why were you waving your arms around your head like that?',
 "That's just the way I run.",
 '-So where should I put my stuff?',
 '-Anywhere you want.',
 'This is your place now.',
 'Come on, make yourself at home.',
 '-Okay.',
 'Okay!',
 '-Anywhere you want.',
 'Ah, but not there.',
 'Not there.',
 'No, I I just reorganized that closet.',
 "That's all.",
 'But anywhere else you want.',
 'Okay.',
 'Oh I know you hav

Now that our scripts have been split into sentences, let's split each script into a series of rows in our dataframe so that it is easier to manipulate later.

<a id='explode'></a>

### 4.2. Exploding Rows into More Rows

We want our newly split sentences to be cast into individual rows that are indexed in a way so that we can still see which pilot they below to, whether or not they were successful, etc. To do this, we will create a function to "explode" the rows containing our script data into more rows.

In [23]:
def explode(df, lst_cols, fill_value='', preserve_index=False): #creating a function to explode row content into more rows
    # make sure `lst_cols` is list-alike
    if (lst_cols is not None
        and len(lst_cols) > 0
        and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)
    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()
    # preserve original index values    
    idx = np.repeat(df.index.values, lens)
    # create "exploded" DF
    res = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in idx_cols},
                index=idx)
             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                            for col in lst_cols}))
    # append those rows that have empty lists
    if (lens == 0).any():
        # at least one list in cells is empty
        res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                  .fillna(fill_value))
    # revert the original index order
    res = res.sort_index()
    # reset index if requested
    if not preserve_index:        
        res = res.reset_index(drop=True)
    return res

In [24]:
data_text = explode(data_text, ['script'], fill_value='') #apply our new function to dataframe

In [25]:
data_text.head() #checking our work

Unnamed: 0,label,title,script
0,fail,How to Be a Gentleman,I am one of the last of my kind.
1,fail,How to Be a Gentleman,"Those who came before me, ruled the world."
2,fail,How to Be a Gentleman,"Everyday, I carry on their proud legacy."
3,fail,How to Be a Gentleman,"I open the door for a lady, but I am not a doo..."
4,fail,How to Be a Gentleman,I protect fellow citizens but I am not a polic...


Looks good! 

<a id='data'></a>

### 4.3. Datatypes

Now we must check to see how each column is encoded so that pandas knows how to "treat" each column during.

In [26]:
data_text.dtypes #check datatypes on script dataframe

label     object
title     object
script    object
dtype: object

Data_text looks good - both title and script are encoded as objects, which makes sense because both of those columns hold our text data. Let's also take a step back and look at previous dataframe.

In [27]:
df.dtypes #check datatypes on original dataframe

program                object
year                    int64
rtscore                 int64
IMDB                  float64
label                  object
premise                object
url                    object
seasons                 int64
episodes                int64
network                object
notes                  object
actors                 object
viewpilot             float64
viewseason1average    float64
viewseason1finale     float64
viewaverage           float64
viewfinale             object
dtype: object

All of our text data columns (program, label, premise, etc.) are encoded as objects, which is correct. Our numerical columns are all correctly encoded as integers or floats except for our last column, "viewfinale" (which measures the number of viewers that tuned into the series finale), which is encoded as an object. This is probably because while collecting data for viewfinale, a couple of the successful shows in my dataframe hadn't aired their series finale yet, so instead of leaving the cell blank, I wrote in "TBD" and figured I would deal with it later. I may replace all the "TBD"s with zeros later, but I'm not sure if I will use that column in my EDA so for now I will let it be.

<a id='null'></a>

## 4.4. Null Data

The last thing that we want to check out is the amount of NaN values that are in our dataset. Since I sourced this data myself, I am already familiar with what values are there and are not there, but for the purposes of this notebook as well as to provide the reader with full information, I will still look into the amount of null data in our dataframe.

I do not plan on removing or replacing any of the null data, since most of the descriptive data is only included to get a feel for what sort of shows were included in this project and will not be included in modeling. When I get to EDA in the following notebook, I will revisit the null data and possibly leave some of the shows with missing data out from some of my visuals.

In [28]:
df.isna().sum() #sum of all null data in df

program                0
year                   0
rtscore                0
IMDB                   0
label                  0
premise                0
url                    0
seasons                0
episodes               0
network                0
notes                 23
actors                30
viewpilot              2
viewseason1average     2
viewseason1finale      1
viewaverage            3
viewfinale             1
dtype: int64

As previously specified, I was unable to get data on the viewer columns (number of viewers who watched the pilot episode, number of viewers who watched final episode of season 1, etc.) because I couldn't find it reported from a reliable source for a few shows that aired more than 15 years ago and weren't one of the sitcom giants such as Seinfeld or Friends. 

I also already discussed the instances of null data in the Notes column and Actors column, but let's take a closer look.

<a id='notes'></a>

### 4.4.1. Notes Column

In [29]:
notes_list = df.notes.dropna().unique() #list of unique notes values (without nulls)
notes_list

array(['show was in primetime spot (after HIMYM) for first 15 episodes and averaged 8.23 mil, cbs tested it by moving it to less prime timeslot for last 3 episodes, and last 3 episodes averaged 5.5 mil viewers',
       '3 additional episodes unaired', 'based on movie of same name',
       "inspired by chelsea handler's book",
       'pilot aired after summer olympics (primetime airing)',
       'launched from web-series, 2 additional episodes unaired',
       'good numbers, cbs had plenty of other sitcoms at the time though',
       '6 additional unaired episodes in US', 'animated',
       '5 additional unaired', '11 additional episodes  unaired',
       '11 additional unaired', '6 additional episodes unaired',
       '1 additional unaired, remake', 'still airing (final season)',
       'after season 7, charlie sheen was replaced by ashton kutcher',
       'still airing', 'spinoff of UK show',
       'still airing, eventually moved to nbc',
       'chris rock canceled it himself', 'mus

As I explained partially in the "Columns" section of my notebook, I only filled in values for roughly half of the rows in the "notes" column and the "actors" column. The purpose of the notes column was to list things that I wanted to keep track of while collecting data. For instance, while I was gathering data for failed pilots, I saw that a decent amount of the failed sitcoms were pulled before all of their episodes had aired. I wanted to keep track of this to see if there was a pattern among my data between shows pulled early and especially low IMDB scores or something of the like. Additionally, if and when I revisit this data someday, maybe I will make a variable for "pulled_early".

I also wanted to make notes along the way so that I could keep my failed and successful pilots data somewhat even. For example, I knew I would have atleast one animated show in my successful pilots list (The Simpsons), so I wanted to take care to include at least one animated show in my failed list. 

The most common note I made for failed pilots was regarding unaired episdes, while the most common note I made for successful pilots was the note that the show was still on the air. The note that the show was still on the air was also important for purposes of getting viewer-finale data for shows, because obviously that data would not exist if the finale hadn't aired yet.


<a id='actors'></a>

### 4.4.2. Actors Column

In [30]:
actors_list = df.actors.dropna().values #list of unique actors values (without nulls)
actors_list

array(['rachel dratch', 'chelsea handler', 'chelsea handler',
       'rob schneider', 'jonah hill', 'kelsey grammar', 'heather graham',
       'george lopez', 'william shatner', 'ashton kutcher, charlie sheen',
       'kirsten bell', 'amy poehler', 'zooey deschanel',
       'tina fey, alec baldwin, tracey morgan', 'hannibal burress',
       'chevy chase', 'steve carrell', 'andy sandberg', 'chris rock',
       'danny devito', 'donald glover'], dtype=object)

For the actors column, I knew that a lot of popular successful sitcoms that I planned on including had some "star power" to their shows - for instance, The Office had Steve Carrell, 30 Rock had Alec Baldwin and Tina Fey, etc. I was worried that the element of popular actors could act as a confounding variable in my data so I tried to keep track of any instances of famous actors so that I could spot an imbalance if there was one. but I was pleasantly surprised to find that a lot of popular actors made their way into unsuccessful shows as well, such as Jonah Hill in "Gregory Allen" or William Shatner in "Bleep My Dad Says". Also, no, that is not a mistake - Chelsea Handler actually did appear in two separate sitcoms (both of which failed)! 

### Save Data to CSV and get ready for Notebook 2!

In [31]:
data_text.to_csv('scripts.csv') #saving the file to access it in next notebook