# Data Cleaning

## Problem Statement

The problem is analizing the debate between Trump and Biden, analize their topics, their positiviness during the entire speech and other. 


## Getting The Data

In [39]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup
import pickle

# Scrapes transcript data from scrapsfromtheloft.com
def url_to_transcript(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    trump = ''
    biden = ''
    wallace = ''
    for p in soup.find(class_="fl-callout-text").find_all('p'):
        if p.text.split():
            name = p.text.split()[0]
            if name == 'Chris':
                wallace += p.text
            elif name == 'President':
                trump += p.text
            elif name == 'Vice':
                biden += p.text
            
    return trump,biden,wallace

# Speakers
speakers = ['Trump','Biden','Wallace']

In [40]:
# Transcript the speech, and assigning the speech to each speaker. 
url = 'https://www.rev.com/blog/transcripts/donald-trump-joe-biden-1st-presidential-debate-transcript-2020'
trump,biden,wallace = url_to_transcript(url)
transcripts = []
transcripts.append(trump)
transcripts.append(biden)
transcripts.append(wallace)

In [41]:
mkdir transcripts

Sottodirectory o file transcripts gi… esistente.


In [42]:
# # Pickle files for later use

# # Make a new directory to hold the text files
# !mkdir transcripts

for i, c in enumerate(speakers):
    with open("transcripts/" + c + ".txt", "wb") as file:
        pickle.dump(transcripts[i], file)

In [43]:
# Load pickled files
data = {}
for i, c in enumerate(speakers):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [44]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['Trump', 'Biden', 'Wallace'])

In [45]:
# More checks
# data['Trump']

In [46]:
# Let's take a look at our data again
next(iter(data.keys()))

'Trump'

In [47]:
# Notice that our dictionary is currently in key: speaker, value: string containing the speech
#next(iter(data.values()))

In [48]:
# We are going to clean some words that speakers said often during their speech. 
def combine_text(list_of_text,key):
    word_to_clean = ''
    if key == 'Trump': 
        word_to_clean = 'President Donald J. Trump'
    elif key == 'Biden':
        word_to_clean =  'Vice President Joe Biden'
    else:
        word_to_clean = 'Chris Wallace'
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ''.join(list_of_text)
    #cleaning word
    combined_text = combined_text.replace(word_to_clean,'')
    return combined_text

In [49]:
# Combine it!
data_combined = {key: [combine_text(value,key)] for (key, value) in data.items()}

In [50]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
Biden,": (02:49)\nHow you doing, man?: (02:51)\nI’m well.: (05:29)\nWell, first of all, thank you for doing this and looking forward to this, Mr. Preside..."
Trump,": (02:51)\nHow are you doing?: (04:01)\nThank you very much, Chris. I will tell you very simply. We won the election. Elections have consequences...."
Wallace,: (01:20)\nGood evening from the Health Education Campus of Case Western Reserve University and the Cleveland Clinic. I’m of Fox News and I welco...


In [51]:
# Let's take a look at the transcript for Ali Wong
#data_df.transcript.loc['Trump']

In [52]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)


In [53]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean

Unnamed: 0,transcript
Biden,\nhow you doing man \ni’m well \nwell first of all thank you for doing this and looking forward to this mr president \nthe american people have a...
Trump,\nhow are you doing \nthank you very much chris i will tell you very simply we won the election elections have consequences we have the senate we...
Wallace,\ngood evening from the health education campus of case western reserve university and the cleveland clinic i’m of fox news and i welcome you to...


In [54]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [55]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
Biden,how you doing man im well well first of all thank you for doing this and looking forward to this mr president the american people have a right to...
Trump,how are you doing thank you very much chris i will tell you very simply we won the election elections have consequences we have the senate we hav...
Wallace,good evening from the health education campus of case western reserve university and the cleveland clinic im of fox news and i welcome you to th...


## Organizing The Data

### Corpus

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [56]:
# Let's take a look at our dataframe
data_df

Unnamed: 0,transcript
Biden,": (02:49)\nHow you doing, man?: (02:51)\nI’m well.: (05:29)\nWell, first of all, thank you for doing this and looking forward to this, Mr. Preside..."
Trump,": (02:51)\nHow are you doing?: (04:01)\nThank you very much, Chris. I will tell you very simply. We won the election. Elections have consequences...."
Wallace,: (01:20)\nGood evening from the Health Education Campus of Case Western Reserve University and the Cleveland Clinic. I’m of Fox News and I welco...


In [57]:
# Let's add the speakers' full names as well
full_names = ['Vice President Joe Biden','President Donald J. Trump','Chris Wallace']

data_clean['full_name'] = full_names
data_clean

Unnamed: 0,transcript,full_name
Biden,how you doing man im well well first of all thank you for doing this and looking forward to this mr president the american people have a right to...,Vice President Joe Biden
Trump,how are you doing thank you very much chris i will tell you very simply we won the election elections have consequences we have the senate we hav...,President Donald J. Trump
Wallace,good evening from the health education campus of case western reserve university and the cleveland clinic im of fox news and i welcome you to th...,Chris Wallace


In [58]:
# Let's pickle it for later use
data_clean.to_pickle("corpus.pkl")

### Document-Term Matrix

For many of the techniques we'll be using in future notebooks, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [59]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,ability,able,abolishing,abraham,absolutely,absorbed,abuse,academic,accept,accepted,...,years,yes,york,youd,youll,young,younger,youre,youve,zero
Biden,2,17,0,0,3,1,0,0,2,3,...,3,5,0,1,0,1,0,10,2,1
Trump,0,1,0,0,3,0,0,1,1,0,...,18,5,4,2,6,2,1,19,13,0
Wallace,0,1,1,1,0,0,1,0,0,0,...,14,3,0,2,2,0,0,10,6,1


In [60]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [61]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))