## Getting and Cleaning the Data

### Introduction
This notebook goes through a necessary step of data project: getting the data and then do the data cleaning.

We will be walking through:
1. Getting the data - we will be scraping data from a website
2. Cleaning the data - we will walk through text-cleaning techniques
3. Organizing the data - we will organize the cleaned data into a way that is easy to input into other algorithms

The ouput of this notebook will be clean, organized data in two standard text formats:
1. Corpus - a collection of text
2. Document-Term Matrix - word counts in matrix format

### Problem Statement
My goal is look at transcripts of various speakers and note their similiarities and differences. Specifically, I'd like to know if Donald Trump's speech style is different than other speakers, since he's a controversial Man.

### Getting the Data
There are many web that keep track of speech transcript of public figure. I randomly pick whose speech I will analyze, and I think it's free if I decide to add more.

In [1]:
# Simple web scraping
import requests
from bs4 import BeautifulSoup

# Scrapes transcript
def url_to_transcript(url):
    '''Returns transcript data from web'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    text = [p.text for p in soup.find_all(class_ = 'wpb_wrapper')[1].find_all('p')[6:]]
    if len(text) == 0:
        text = [p.text for p in soup.find_all(class_ = 'wpb_wrapper')[24].find_all('p')]
    print(url, "done")
    return text

# URLs transcript
urls = ['https://www.englishspeecheschannel.com/english-speeches/nelson-mandela-speech/',
        'https://www.englishspeecheschannel.com/english-speeches/president-kennedy-speech/',
        'https://www.englishspeecheschannel.com/english-speeches/simon-sinek-speech/',
        'https://www.englishspeecheschannel.com/english-speeches/greta-thunberg-speech/',
        'https://www.englishspeecheschannel.com/english-speeches/tim-cook-speech/',
        'https://www.englishspeecheschannel.com/english-speeches/donald-trump-speech/',
        'https://www.englishspeecheschannel.com/english-speeches/hillary-clinton-speech/',
        'https://www.englishspeecheschannel.com/english-speeches/steve-jobs-speech/',
        'https://www.englishspeecheschannel.com/english-speeches/sundar-pichai-speech/',
        'https://www.englishspeecheschannel.com/english-speeches/justin-trudeau-nyu-speech/',
        'https://www.englishspeecheschannel.com/english-speeches/jacinda-ardern-speech/'
        ]

# Speaker names
speakers = ['Nelson Mandela',
            'John F Kennedy',
            'Simon Sinek',
            'Greta Thunberg',
            'Tim Cook',
            'Donald Trump',
            'Hillary Clinton',
            'Steve Jobs',
            'Sundar Pichai',
            'Justin Trudeau',
            'Jacinda Ardern']

In [2]:
transcript = [url_to_transcript(u) for u in urls]

for i in range(0, len(transcript)):
    transcript[i] = [' '.join(transcript[i])]

https://www.englishspeecheschannel.com/english-speeches/nelson-mandela-speech/ done
https://www.englishspeecheschannel.com/english-speeches/president-kennedy-speech/ done
https://www.englishspeecheschannel.com/english-speeches/simon-sinek-speech/ done
https://www.englishspeecheschannel.com/english-speeches/greta-thunberg-speech/ done
https://www.englishspeecheschannel.com/english-speeches/tim-cook-speech/ done
https://www.englishspeecheschannel.com/english-speeches/donald-trump-speech/ done
https://www.englishspeecheschannel.com/english-speeches/hillary-clinton-speech/ done
https://www.englishspeecheschannel.com/english-speeches/steve-jobs-speech/ done
https://www.englishspeecheschannel.com/english-speeches/sundar-pichai-speech/ done
https://www.englishspeecheschannel.com/english-speeches/justin-trudeau-nyu-speech/ done
https://www.englishspeecheschannel.com/english-speeches/jacinda-ardern-speech/ done


In [3]:
# Pack it into dictionary
data = {}
for i, l in enumerate(speakers):
    data[l] = transcript[i]

In [4]:
import pandas as pd

pd.set_option('max_colwidth',150)

df = pd.DataFrame.from_dict(data).transpose()
df.columns = ['Transcript']
df = df.sort_index()
df

Unnamed: 0,Transcript
Donald Trump,"“Thank you very much, everybody. And congratulations to the class of 2017. That’s some achievement. This is your day and you’ve earned every minut..."
Greta Thunberg,"“Greta your first climate strike was a lonely event a little over a year ago, and in the intervening time, you have sparked the interest of millio..."
Hillary Clinton,"“Do all the good you can, for all the people you can, in all the ways you can, as long as you can.” Hillary Clinton “Being here with you brings ba..."
Jacinda Ardern,"“Mr. President, Mr. Secretary-General, Friends. I greet you in te reo Māori, language of the tangata whenua, or first people, of Aotearoa New Zeal..."
John F Kennedy,"“Vice President Johnson, Mr. Speaker, Mr. Chief Justice, President Eisenhower, Vice President Nixon, President Truman, Reverend Clergy, fellow cit..."
Justin Trudeau,"“I have to say, to be here now, speaking with all of you — in Yankee Stadium, one of the greatest places in one of the greatest cities on Earth — ..."
Nelson Mandela,“Ladies and Gentlemen. This may very well be our last official visit to the United States before retiring from office next year. There could not b...
Simon Sinek,“So I had the chance to meet with some of the kids in the program today. Where are you? Scream out. There you go. I love those kids. What I though...
Steve Jobs,\n“Stay hungry. Stay Foolish.” Steve Jobs\n “I am honored to be with you today at your commencement from one of the finest universities in the wor...
Sundar Pichai,"“Hello, everyone. And congratulations to the Class of 2020, as well as your parents, your teachers, and everyone who helped you get to this day. I..."


In [5]:
# Apply a first round of text-cleaning
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]','',text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    text = re.sub('—', '', text)
    return text

In [6]:
data_clean = pd.DataFrame(df['Transcript'].apply(lambda x: clean_text_round1(x)))
data_clean

Unnamed: 0,Transcript
Donald Trump,thank you very much everybody and congratulations to the class of thats some achievement this is your day and youve earned every minute of it and...
Greta Thunberg,greta your first climate strike was a lonely event a little over a year ago and in the intervening time you have sparked the interest of millions ...
Hillary Clinton,do all the good you can for all the people you can in all the ways you can as long as you can hillary clinton being here with you brings back a fl...
Jacinda Ardern,mr president mr secretarygeneral friends i greet you in te reo māori language of the tangata whenua or first people of aotearoa new zealand i do s...
John F Kennedy,vice president johnson mr speaker mr chief justice president eisenhower vice president nixon president truman reverend clergy fellow citizens we o...
Justin Trudeau,i have to say to be here now speaking with all of you in yankee stadium one of the greatest places in one of the greatest cities on earth is mor...
Nelson Mandela,ladies and gentlemen this may very well be our last official visit to the united states before retiring from office next year there could not been...
Simon Sinek,so i had the chance to meet with some of the kids in the program today where are you scream out there you go i love those kids what i thought i wo...
Steve Jobs,stay hungry stay foolish steve jobs i am honored to be with you today at your commencement from one of the finest universities in the world i neve...
Sundar Pichai,hello everyone and congratulations to the class of as well as your parents your teachers and everyone who helped you get to this day i never imag...


### Organizing the Data
The output of this notebook will be clean, organized data in two standards text formats:
1. Corpus
2. Document-Term Matrix

#### Corpus
Corpus is a collection of texts, and they are all put together neatly in a pandas dataframe.

In [7]:
df = data_clean
df.head(3)

Unnamed: 0,Transcript
Donald Trump,thank you very much everybody and congratulations to the class of thats some achievement this is your day and youve earned every minute of it and...
Greta Thunberg,greta your first climate strike was a lonely event a little over a year ago and in the intervening time you have sparked the interest of millions ...
Hillary Clinton,do all the good you can for all the people you can in all the ways you can as long as you can hillary clinton being here with you brings back a fl...


#### Document-Term Matrix
For many that techniques that we will be using, the text must be tokenized, meaning is broken down into smaller pieces. We can do this technique using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, CountVectorizer can remove stop words.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(df['Transcript'])
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = df.index
data_dtm.head(3)

Unnamed: 0,abandon,ability,able,abolish,abraham,absence,absolute,absolutely,abundantly,accent,...,youd,youll,young,youre,youtube,youve,youwe,zealand,zero,zones
Donald Trump,0,1,1,0,0,0,0,1,0,0,...,1,8,4,11,0,5,0,0,0,0
Greta Thunberg,0,0,0,0,0,0,0,0,0,0,...,0,0,2,1,0,0,1,0,0,0
Hillary Clinton,2,1,2,0,0,0,0,1,0,0,...,0,0,1,4,0,7,0,0,1,0


In [9]:
# Save as pickle
data_dtm.to_pickle('../1 - Data/dtm.pkl')

### Additional step
Some improvements that maybe still relevant:
1. Stemming/lemmatization - e.g. mark "cheering" and "cheer" as the same word
2. Bi-grams - combine "thank" and "you" in one term