# Notebook 2: Clean Source to Extract Transcripts

## Introduction

Now that I've extracted the source for each page of transcripts, I need to extract the actual transcript text. There is unfortunately no overarching, easily digestible method to get a the cleaned transcript, and I relied mostly on trial and error to get the text I wanted. I'm certain there are places where the transcripts are not 100% clean, and I either tried to address this at some later point with more data cleaning, or let it go, because the analysis makes sense with the data in its current state.

In [1]:
import pandas as pd
import numpy as np
import time, os
import re
import pickle
import string

## Clean deBlasio Transcripts

### Extracting transcripts from source code

This process is messy, and I ended up using two different cleaning functions - a second one to catch the ones missed by the first function. 

In [2]:
with open('../data/bdbsource_519.pickle', 'rb') as read_file:
    bdbsource_519 = pickle.load(read_file)
with open('../data/bdblinks_519.pickle', 'rb') as read_file:
    bdblinks_519 = pickle.load(read_file) 

In [3]:
def get_text(source_object):
    text_list = []
    for s in source_object:
        text = s.find_all('p')
        text_list.append(text)
    return text_list

In [4]:
transcript_length = []
for i in get_text(bdbsource_519):
    transcript_length.append(len(i))
transcript_length_array = np.array(transcript_length)

In [5]:
first_transcripts = get_text(bdbsource_519)

In [6]:
# These are the indices that need to be re-extracted with get_text2
indices = list(np.argwhere(transcript_length_array == 1).flatten())

In [7]:
def get_text2(source_object):
    text_list = []
    for s in source_object:
        text = s.find('p').parent
        text_list.append(text)
    return text_list

In [8]:
# Extract a better transcript from the bad indices
better_transcripts = []
for i in indices:
    better_transcripts.append(get_text2(bdbsource_519[i]))

In [9]:
#replace first transcripts with better_transcripts
for (indices, better_transcripts) in zip(indices, better_transcripts):
    first_transcripts[indices] = better_transcripts

### Additional Cleaning Steps

I also wanted to remove html tags, extract the date of the speech, and, in instances where the speech was actually an interview, take only the parts where Mayor de Blasio was speaking.

In [10]:
def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

In [11]:
def clean_bdb(transcript_list, link_list):
    date = []
    text = []
    for i in transcript_list:
        cleaned = remove_html_tags(str(i))
        if re.search('\[\\n(.+)\\n(.+)\]', cleaned) is not None and re.search('\[\\n(.+)\\n(.+)\]', cleaned) is not None:
            date_clean = re.search('\[\\n(.+)\\n(.+)\]', cleaned).group(1)
            date.append(date_clean)
            text_clean = re.search('\[\\n(.+)\\n(.+)\]', cleaned).group(2)
            text.append(text_clean)
        else:
            date_clean = re.search('\\n\\n(.+20.{2})\\n(.+)\\n\\ufeff', cleaned).group(1)
            date.append(date_clean)
            text_clean = re.search('\\n\\n(.+20.{2})\\n(.+)\\n\\ufeff', cleaned).group(2)
            text.append(text_clean)
    date = pd.to_datetime(date)
    df = pd.DataFrame([date, link_list, text]).T
    df.columns = ['date', 'link', 'text']
    return df

In [12]:
clean = clean_bdb(first_transcripts, bdblinks_519)

In [13]:
def take_monologue(transcript):
    if len(re.findall('sio:([^:]+)|yor:([^:]+)', str(transcript))) > 0:
        return str(re.findall('sio:([^:]+)|yor:([^:]+)', str(transcript)))
    else:
        return ''

In [14]:
toremove = "'\\,\"\[\(\]\)-–"

In [15]:
punc_lower = lambda x: re.sub('[%s]' % re.escape(toremove), '', x.lower())
remove_xa0 = lambda x: x.replace('xa0', '')
remove_space = lambda x: x.replace('  ', ' ')

In [16]:
clean['monologue'] = clean['text'].map(take_monologue).map(punc_lower).map(remove_xa0).map(remove_space)

In [17]:
# Include a column indicating this is a transcript from Mayor de Blasio
clean.insert(0, 'speaker', 'de blasio')

In [18]:
# with open('../data/bdbtranscript_519.pickle', 'wb') as to_write:
#     pickle.dump(clean, to_write)

## Clean Cuomo Transcripts

### Extract transcripts from source code

Get transcript date and text. The text requires multiple sequential cleaning steps, taking a more specific extract each time, and removing text that isn't relevant to the speech.

In [19]:
with open('../data/cuomosource_519.pickle', 'rb') as read_file:
    cuomosource_519 = pickle.load(read_file)
with open('../data/cuomolinks_519.pickle', 'rb') as read_file:
    cuomolinks_519 = pickle.load(read_file) 

In [20]:
def get_date(source_object):
    date_list = []
    for i in source_object:
        date = i.find('div', class_="published-date").text
        date_clean = re.search('\\n\\n(.+20.{2})', date).group(1).strip()
        date_list.append(date_clean)
    return date_list

In [21]:
cuomo_date = get_date(cuomosource_519)

In [22]:
def get_text(source_object):
    text_list = []
    for s in source_object:
        text = s.find('div', class_='field field--name-field-body field--type-text-long field--label-hidden')
        text_list.append(text)
    return text_list

In [23]:
cuomo_step1 = get_text(cuomosource_519)

In [24]:
def clean_cuomo(transcript_list):
    clean_transcripts = []
    for idx, i in enumerate(transcript_list):
        cleaned = remove_html_tags(str(i))
        if re.search('below:(.+)', cleaned) is not None:
            text_clean = re.search('below:(.+)', cleaned).group(1)
            clean_transcripts.append(text_clean)
        elif re.search('here.(.+)', cleaned) is not None:
            text_clean = re.search('here.(.+)', cleaned).group(1)
            clean_transcripts.append(text_clean)
    return clean_transcripts

In [25]:
cuomo_step2 = clean_cuomo(cuomo_step1)

I decided to check whether I was actually pulling transcripts for all of the pages, so I calculated the length for each transcript and looked at the shortest ones.

In [26]:
test_len = []
for i in cuomo_step2:
    test_len.append(len(i))
test_len_array = np.array(test_len)
test_len_array.argsort()

array([ 68,  99, 152,  48,  82,  37, 128,  35, 169,  77,  25,  43, 136,
        81,  22,  96, 162,  15,  45,  93, 113, 147, 116, 165, 161, 157,
        58,  63, 123,  20, 101,  44,   1,  97, 108,  76,  85,  30,  79,
       156,  14,  60, 124, 146, 110,  46, 103,  95,  39, 111,  83, 109,
       148, 132,   2, 179, 140, 182,  69,  71,  31,  16, 127,  21,  72,
       114,  88, 133, 115, 168, 174,  28,   9, 178,  51,  53,  55, 172,
        65, 171,  94, 173, 160,  70, 154,  78,  18,  24, 107,  52,  11,
        89, 102,   6,  32,  75,  13, 145,  42,   5, 153, 137,  59,  47,
       135,  17,  80,  12, 166,  61, 143, 151,  90, 129, 141,  57, 118,
       144,  92,  41,   4,  98,  86,  50,  33,  19, 158, 177, 117, 100,
         0, 139, 175, 106, 163, 180,  29, 155, 138, 112, 120, 130,  74,
       131,  40, 150,  66,   8, 126,  73,  10,  67,  27, 105,  54,  38,
        87, 104, 142, 176,  34,  23, 122, 181, 119,  84,   3, 121, 184,
       183,  64, 125,  91,  49, 134,  56, 159,  26, 149,   7, 16

In [27]:
cuomo_step2[68]

"and in TV quality (h.264, mp4) format\xa0here, with ASL interpretation available on YouTube\xa0here\xa0and in TV quality format\xa0here.\xa0\xa0AUDIO\xa0of today's remarks is available\xa0here.PHOTOS\xa0are available on the Governor's Flickr\xa0page."

This is not a valid transcript, so I will be removing this from the list. I'll look at the next shortest one, too.

In [28]:
cuomo_step2[99]

"The State has a 1-800 number. It is 1-800-942-6906. That is our domestic violence hotline. Women should know that they don't have to stay in those situations. We will help them relocate. We will help them find safe shelter. And if there is an issue where you are in immediate harm, call 911 immediately. I spoke to the State Police this morning. There is a reported uptick, as you said, some reports as high as 15 to 20 percent. It's unacceptable on any day and I want people to know that in every single case that is reported, the State Police is going to investigate fully and bring the full bear\xa0of the law behind it."

That one looks okay, so I will only delete the shortest transcript from my lists.

In [29]:
delete_index = test_len_array.argsort()[0]
del cuomo_step2[delete_index]
del cuomolinks_519[delete_index]
del cuomo_date[delete_index]

Now I will put them into a dataframe with the date, link, and transcript text, as I did for the de Blasio transcripts.

In [30]:
cuomo_date = pd.to_datetime(cuomo_date)

In [31]:
cuomo_clean = pd.DataFrame([cuomo_date, cuomolinks_519, cuomo_step2]).T
cuomo_clean.columns = ['date', 'links', 'text']

In [32]:
cuomo_clean

Unnamed: 0,date,links,text
0,2020-03-19,https://www.governor.ny.gov/news/video-audio-p...,"Good morning, everyone. Let me introduce the ..."
1,2020-01-23,https://www.governor.ny.gov/news/video-audio-p...,The topic today is transportation which is vit...
2,2020-03-04,https://www.governor.ny.gov/news/video-audio-p...,"We have some good news, we have some bad news...."
3,2020-04-28,https://www.governor.ny.gov/news/video-audio-p...,Good afternoon to everyone. I want to introduc...
4,2020-05-01,https://www.governor.ny.gov/news/video-audio-p...,Good morning. Pleasure to be with you. Everybo...
...,...,...,...
179,2020-02-04,https://www.governor.ny.gov/news/rush-transcri...,Brian Lehrer: New York's Medicaid program has ...
180,2020-04-01,https://www.governor.ny.gov/news/video-audio-p...,"Good afternoon. Lots going on today, coronavir..."
181,2020-03-19,https://www.governor.ny.gov/news/audio-rush-tr...,Alisyn Camerota: Joining us now is New York Go...
182,2020-03-25,https://www.governor.ny.gov/news/video-audio-p...,Good morning. Thank you for being here today.I...


In [33]:
# with open('../data/cuomotranscript_519.pickle', 'wb') as to_write:
#     pickle.dump(cuomo_clean, to_write)

### Further Cleaning

I ended up needing to use three different functions using different regular expressions to extract just the parts of the transcript where Cuomo was speaking.

In [34]:
def take_monologue(transcript):
    if len(re.findall('Cuomo:([^:]+)', str(transcript))) > 0:
        return str(re.findall('Cuomo:([^:]+)', str(transcript)))
    else:
        return str(transcript)

In [35]:
def take_monologue2(transcript):
    if len(re.findall('Cuomo:(.*?)[A-Z][a-z]+:', str(transcript))) > 0:
        return str(re.findall('Cuomo:(.*?)[A-Z][a-z]+:', str(transcript)))
    else:
        return str(transcript)

In [36]:
def take_monologue3(transcript):
    if len(re.findall('^(.*?)[A-Z][a-z]+:', str(transcript))) > 0:
        return str(re.findall('^(.*?)[A-Z][a-z]+:', str(transcript)))
    else:
        return str(transcript)

In [37]:
cuomo_clean['monologue'] = cuomo_clean['text'].map(take_monologue)

In [38]:
cuomo_clean['monologue2'] = cuomo_clean['text'].map(take_monologue2)

In [39]:
# Only use third function if the first two don't capture the right transcript
def extract_m3(col1, col2):
    if (len(col1)-len(col2))/len(col2) > 10:
        return take_monologue3(col1)
    else:
        return ''

In [40]:
cuomo_clean['monologue3'] = cuomo_clean.apply(lambda x: extract_m3(x.text, x.monologue), axis = 1)

I want to keep the longer text, and supplement with the text taken from the third function if it exists.

In [41]:
def compare_two(col1, col2):
    if (len(col1)-len(col2))/len(col2) > 1:
        return col1
    else:
        return col2

In [42]:
cuomo_clean['final_text'] = cuomo_clean.apply(lambda x: compare_two(x.monologue, x.monologue2), axis = 1)

In [43]:
cuomo_clean['final_text2'] = cuomo_clean.monologue3 + cuomo_clean.final_text

In [44]:
remove_sxa0 = lambda x: x.replace('\xa0', '')

In [45]:
cuomo_clean['final_clean'] = cuomo_clean['final_text2'].map(punc_lower).map(remove_xa0).map(remove_sxa0).map(remove_space)

In [46]:
cuomo_final = cuomo_clean.loc[:, ['date', 'links', 'text', 'final_clean']]
cuomo_final.columns = ['date', 'link', 'text', 'monologue']

In [47]:
# Include a column indicating this is a transcript from Governor Cuomo
cuomo_final.insert(0, 'speaker', 'cuomo')

In [48]:
cuomo_final.head()

Unnamed: 0,speaker,date,link,text,monologue
0,cuomo,2020-03-19,https://www.governor.ny.gov/news/video-audio-p...,"Good morning, everyone. Let me introduce the ...",good morning everyone. let me introduce the pe...
1,cuomo,2020-01-23,https://www.governor.ny.gov/news/video-audio-p...,The topic today is transportation which is vit...,the topic today is transportation which is vit...
2,cuomo,2020-03-04,https://www.governor.ny.gov/news/video-audio-p...,"We have some good news, we have some bad news....",we have some good news we have some bad news. ...
3,cuomo,2020-04-28,https://www.governor.ny.gov/news/video-audio-p...,Good afternoon to everyone. I want to introduc...,good afternoon to everyone. i want to introduc...
4,cuomo,2020-05-01,https://www.governor.ny.gov/news/video-audio-p...,Good morning. Pleasure to be with you. Everybo...,good morning. pleasure to be with you. everybo...


In [49]:
# with open('../data/cuomotranscript_519.pickle', 'wb') as to_write:
#     pickle.dump(cuomo_final, to_write)