# Purpose 

The purpose of this notebook is to read in subtitle data in vtt format and create dataframes from the text based on episode, speaker and sentence.

##  Read in libraries and data

Using the webvtt python library, I will read in all of the subtitle files that I obtained from Netflix. I will then strip just the text from the vtt files.

In [1]:
import webvtt
import pandas as pd
import numpy as np
import re

In [2]:
# creating a list of episode numbers
episodes = ['0101', '0102', '0103', '0104','0105','0106','0107','0108','0109','0110',
            '0201', '0202', '0203', '0204','0205','0206','0207','0208','0209','0210',
           '0301', '0302', '0303', '0304','0305','0306','0307','0308','0309','0310',
           '0401', '0402', '0403', '0404','0405','0406','0407','0408','0409','0410',]

In [3]:
# Create list of dictionaries by episode and text 
data = []

for e in episodes:
    episode = {}
    episode['episode_num'] = e
    episode['season'] = e[:2]
    episode['episode'] = e[2:]
    vtt = webvtt.read('gbbo_{}.vtt'.format(e))
    episode['text'] = " ".join([ele.text for ele in vtt])
    data.append(episode)

## Read into DataFrame

Now that I have all of the text, I will now read all of the episodes into a dataframe.

In [4]:
cols = ['episode_num','season','episode','text']

df = pd.DataFrame(data, columns = cols)
df.tail()

Unnamed: 0,episode_num,season,episode,text
35,406,4,6,"It's week six, and we feel like Snow White bec..."
36,407,4,7,[man] Support from viewers like you makes this...
37,408,4,8,"We've gone historical. I'm talking Henry VIII,..."
38,409,4,9,[announcer]\nSupport from viewers like you\nma...
39,410,4,10,"[Sue] In the beginning... [Paul] Gorgeous, gor..."


## Separate by Speaker 

Now that I have all of the text from each episode, I will now separate each episode into individual dialogue by speaker. There are a couple of challenges in separating the speaker from the dialogue. 

First, speakers are identified and separated from dialogue in two ways:

  * A colon ( : )
   
  * Square brackets ( [ ] )

Second, there are two ways speakers' names are formatted:
   * All capital letters 
   
   * Regular capitalization

Third, when a speaker is off camera, their dialogue is noted as a voiceover and is added as part of the speaker's name. For example, "Mary" would then become "Mary, voice-over" or "Mary voice-over".  


To address all of these challenges, I will first use RegEx to find all character names before a colon or within square brackets. Then, I will replace all the voice over text with "vo" to more easily identify these cases. Then, I will count the amount of colons and square brackets to identify which separation method was used in the episode. Depending on which case it is, the for loop will reference the RegEx statements I created and extract the names and separate the names from the dialogue.

In addition to all of the different cases, the speakers themselves are cased in different ways. Some episodes' dialogue are all capitalized, while other episodes have dialogue with regular capitalization. To address this issue, I created a counter that will count the amount of capitalized letters. Within my colon/bracket if statement, I will use another if statement that will identify which case to use for separation.

In [5]:
# RegEx 
## to get everything between square brackets: r'[[].*?[]]'
## to get everything that starts with a capital letter: r'[[][A-Z].*?[]]''
## to get everything that starts with a capital letter and no white space: r'[[][A-Z][a-z]*?[]]'

regex_person_reg = r"([A-Z][a-z].*?[^\s]*)\:"
regex_person_upper = r"([A-Z][^\s]*)\:"
regex_parens = r'[[][A-Z].*?[]]'

In [6]:
# Iterate through text in rows and split out dialogue from speaker 
final = []

for ix, row in df.iterrows():
    text = row['text'].replace(', voice-over'," vo")
    text = row['text'].replace('voice-over',"Vo")
    
    # count how many colons and brackets to determine which case to use 
    c = text.count(':')
    p = text.count('[')
    if c > p:
        
        # count how many upper case and lower case to determine which case to use 
        count_upper = 0
        count_lower = 0
        for t in row['text']:
            if t.isupper():
                count_upper += 1
            else:
                count_lower += 1
        
        if count_upper > count_lower:
            matches = re.findall(regex_person_reg, text, re.MULTILINE)
    
            characters = []
            for match in matches:
                characters.append(match)
        
            replaced_text = re.sub(regex_person_reg, '|||||||', text)
            split_text = replaced_text.split('|||||||')
            
            if len(characters) < len(split_text):
                final.append(pd.DataFrame({'episode_num':row['episode_num'], 'season':row['season']
                                       , 'episode':row['episode'],'character':characters, 'dialogue':split_text[1:]}))
            else:
                final.append(pd.DataFrame({'episode_num':row['episode_num'], 'season':row['season']
                                       , 'episode':row['episode'], 'character':characters, 'dialogue':split_text}))
        else:
            matches = re.findall(regex_person_upper, text, re.MULTILINE)
        
            characters = []
        
            for match in matches:
                characters.append(match)
        
            replaced_text = re.sub(regex_person_upper, '|||||||', text)
            split_text = replaced_text.split('|||||||')
            
            if len(characters) < len(split_text):
                final.append(pd.DataFrame({'episode_num':row['episode_num'], 'season':row['season']
                                           , 'episode':row['episode'],'character':characters, 'dialogue':split_text[1:]}))
            else:
                final.append(pd.DataFrame({'episode_num':row['episode_num'], 'season':row['season']
                                           , 'episode':row['episode'], 'character':characters, 'dialogue':split_text}))
    else:
        matches = re.findall(regex_parens, text, re.MULTILINE)
        
        characters = []
        
        for match in matches:
            characters.append(match)
        
        replaced_text = re.sub(regex_parens, '|||||||', text)
        split_text = replaced_text.split('|||||||')
        
        if len(characters) < len(split_text):
            final.append(pd.DataFrame({'episode_num':row['episode_num'], 'season':row['season'], 'episode':row['episode'],'character':characters, 'dialogue':split_text[1:]}))
        else:
            final.append(pd.DataFrame({'episode_num':row['episode_num'], 'season':row['season'], 'episode':row['episode'], 'character':characters, 'dialogue':split_text}))



Now that we have split out everyone from what they are saying, I will concatenate all of these episodes together into one big dataframe.

In [7]:
full = pd.concat(final)

In [8]:
full.index = pd.RangeIndex(len(full.index))
full.head()

Unnamed: 0,episode_num,season,episode,character,dialogue
0,101,1,1,Announcer,HELP EVERYONE EXPLORE NEW WORLDS AND IDEAS. S...
1,101,1,1,Mel,THOUSANDS OF PEOPLE APPLIED. IT'S BEEN QUITE ...
2,101,1,1,Mel,"JUST 12 HAVE MADE IT THROUGH, AND OVER THE NE..."
3,101,1,1,Mel,"THEIR BAKING WILL BE SCRUTINIZED, WHATEVER TH..."
4,101,1,1,Woman,I'VE BEEN BAKING FOR 60 YEARS. I SUPPOSE I'M ...


In [9]:
full['dialogue'] = full['dialogue'].str.lower()

In [10]:
pbs = full[full['dialogue'].str.contains('support your pbs station')].index
full.drop(index=pbs, inplace=True)

Now that I have my dataframe, I will now clean up any edge cases as well as consolidate the names by formatting all names the same way. 

In [None]:
# All unique characters 
full['character'].unique()

I will also remove any blank dialogue that I might have captured. 

In [11]:
# Replace empty cells with na
full = full.replace(r'^\s*$', np.nan, regex=True)

In [12]:
#Drop rows with no dialogue
total_rows = len(full.index)

# Drop na's 
full.dropna(subset=['dialogue'],inplace=True)
data_kept = len(full.index)/total_rows
print('Data Retained:'+str(round(data_kept*100,2))+' %')

Data Retained:99.85 %


There are a couple cases where the formatting was different. I will edit these individually. 

In [13]:
# dialogue was separatd by semi colon
full[full['character']=='Sue; OK, THAT\'S IT, THE BAKE\'S OVER. BAKERS, STEP AWAY FROM YOUR ENTREMETS, S\'IL VOUS PLAIT. OH... ALL OF IT WENT... [SHEEP BLEATING] IT TAKES A LOT OF GUTS TO BE ABLE TO SHOW ALL THE LAYERS. Richard'] = full[full['character']=='Sue; OK, THAT\'S IT, THE BAKE\'S OVER. BAKERS, STEP AWAY FROM YOUR ENTREMETS, S\'IL VOUS PLAIT. OH... ALL OF IT WENT... [SHEEP BLEATING] IT TAKES A LOT OF GUTS TO BE ABLE TO SHOW ALL THE LAYERS. Richard'].replace('Sue; OK, THAT\'S IT, THE BAKE\'S OVER. BAKERS, STEP AWAY FROM YOUR ENTREMETS, S\'IL VOUS PLAIT. OH... ALL OF IT WENT... [SHEEP BLEATING] IT TAKES A LOT OF GUTS TO BE ABLE TO SHOW ALL THE LAYERS. Richard', 'Richard')

# dialogue separatd by colon and bracket
full[full['character']=='Whispering] YOU\'RE KIDDING! Paul'] = full[full['character']=='Whispering] YOU\'RE KIDDING! Paul'].replace('Whispering] YOU\'RE KIDDING! Paul', 'Paul')

# Paul Hollywood referred to as Paul H.
full[full['character']=='H.'] = full[full['character']=='H.'].replace('H.', 'Paul')

I will now remove all of the punctuation and lower case the characters and the dialogue. 

In [14]:
import string

# Remove punctuation and lower case character names 
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())

full['character'] = full['character'].map(punc_lower)
full['character'] = [character.strip() for character in full['character']]

Because there are a number of typos and combination of characters who speak together, I will fix and identify these. 

In [15]:
full[full['character']=='paul and mary'] = full[full['character']=='paul and mary'].replace('paul and mary', 'judges')
full[full['character']=='sue and mel'] = full[full['character']=='sue and mel'].replace('sue and mel', 'both')
full[full['character']=='sure'] = full[full['character']=='sure'].replace('sure', 'sue')
full[full['character']=='different man'] = full[full['character']=='different man'].replace('different man', 'man')
full[full['character']=='pual'] = full[full['character']=='pual'].replace('pual', 'paul')
full[full['character']=='mary  paul  and mel'] = full[full['character']=='mary  paul  and mel'].replace('mary  paul  and mel', 'judges')
full[full['character']=='hollywood'] = full[full['character']=='hollywood'].replace('hollywood', 'paul')
full[full['character']=='paul and sue'] = full[full['character']=='paul and sue'].replace('paul and sue', 'sue')
full[full['character']=='narration'] = full[full['character']=='narration'].replace('narration', 'narrator')
full[full['character']=='french accent'] = full[full['character']=='french accent'].replace('french accent', 'sue')

In [None]:
full['character'].unique()

I will now strip all of the action verbs next to some of the names.

In [16]:
name = full['character'].str.split(' ', n=1, expand = True)
full['character'] = name[0]

In [17]:
print(full['character'].nunique())

77


Now that I have cleaned all of the names, I will create a column with each person's role in the show (judge, commentator, contestant).

In [18]:
# function to assign character roles 
def roles(row):
    if row['character'] == 'paul' or row['character'] == 'mary' or row['character'] =='judges':
        return 'judge'
    elif row['character'] == 'sue' or row['character'] == 'mel' or row['character'] == 'both' or row['character'] == 'announcer' or row['character'] == 'narrator':
        return 'host'
    else:
        return 'contestant'
        

In [19]:
full['role'] = full.apply(roles, axis=1)

In [20]:
full.head()

Unnamed: 0,episode_num,season,episode,character,dialogue,role
1,101,1,1,mel,thousands of people applied. it's been quite ...,host
2,101,1,1,mel,"just 12 have made it through, and over the ne...",host
3,101,1,1,mel,"their baking will be scrutinized, whatever th...",host
4,101,1,1,woman,i've been baking for 60 years. i suppose i'm ...,contestant
5,101,1,1,man,the thing that worries me the most is probabl...,contestant


In addition to identifying the characters' roles in the show, for my project's purpose. I would like to distinguish between male and female characters. I will do this by creating a list of all of the unique character names, and creating a dictionary that will map out whether the character is female or male. 

In [None]:
full.character.unique().tolist()

In [21]:
# Dictionary with gender identification
mapping = {'announcer':'male',
 'mel':'female',
 'woman':'female',
 'man':'male',
 'sue':'female',
 'mary':'female',
 'paul':'male',
 'diana':'female',
 'chetna':'female',
 'claire':'female',
 'richard':'male',
 'jordan':'male',
 'enwezor':'male',
 'children':'both',
 'kate':'female',
 'martha':'female',
 'iain':'male',
 'nancy':'female',
 'luis':'male',
 'both':'female',
 'judges':'both',
 'norman':'male',
 'narrator':'male',
 'all':'both',
 'peter':'male',
 'sarah':'female',
 'tim':'male',
 'girl':'female',
 'louise':'female',
 'glenn':'male',
 'ali':'male',
 'lucy':'female',
 'howard':'male',
 'frances':'female',
 'mark':'male',
 'ruby':'female',
 'christine':'female',
 'robert':'male',
 'toby':'male',
 'deborah':'female',
 'beca':'female',
 'kimberley':'female',
 'meg':'female',
 'rob':'male',
 'kimberly':'female',
 'deirdre':'female',
 'kevin':'male',
 'natalie':'female',
 'giuseppe':'male',
 'marie':'female',
 'nadiya':'female',
 'stu':'male',
 'ian':'male',
 'sandy':'female',
 'ugne':'female',
 'dorret':'female',
 'flora':'female',
 'tamal':'male',
 'mat':'male',
 'alvin':'male',
 'jagger':'male',
 'abdal':'male',
 'shoma':'female',
 'candice':'female',
 'andrew':'male',
 'val':'female',
 'benjamina':'female',
 'michael':'male',
 'tom':'male',
 'rav':'male',
 'jane':'female',
 'selasi':'male',
 'helen':'female',
 'kay':'female',
 'nigel':'male',
 'amy':'female',
 'henry':'male'}

In [22]:
full['gender'] = full['character'].map(mapping)

In [23]:
full.replace({'\n': ' '}, regex=True, inplace=True)

In [None]:
full.head()

Next, I will save the final cleaned and separated data into a csv file for the next process.

In [24]:
full.to_csv('clean.csv', index=False)