# Day 1 - Exercise 1 - Preprocessing of textual data

## Necessary imports

In order to handle the data properly we have to import the data and the modules we need:

In [1]:
# modules
import pandas as pd
import numpy as np
import re
import nltk
import urllib.request

### Import the data set
We will work on a data set of speeches held by Donald Trump.  
It is to be found on this GitHub repository: https://github.com/ryanmcdermott/trump-speeches

In [2]:
data = urllib.request.urlopen("https://raw.githubusercontent.com/ryanmcdermott/trump-speeches/master/speeches.txt")
speeches = [line.decode('utf-8') for line in data]
speeches = " ".join(speeches)

### Inspect the data
What is the format? How do we handle it?

In [3]:
print(speeches[:1000])

﻿SPEECH 1
 
 
 ...Thank you so much.  That's so nice.  Isn't he a great guy.  He doesn't get a fair press; he doesn't get it.  It's just not fair.  And I have to tell you I'm here, and very strongly here, because I have great respect for Steve King and have great respect likewise for Citizens United, David and everybody, and tremendous resect for the Tea Party.  Also, also the people of Iowa.  They have something in common.  Hard-working people.  They want to work, they want to make the country great.  I love the people of Iowa.  So that's the way it is.  Very simple.
 With that said, our country is really headed in the wrong direction with a president who is doing an absolutely terrible job.  The world is collapsing around us, and many of the problems we've caused.  Our president is either grossly incompetent, a word that more and more people are using, and I think I was the first to use it, or he has a completely different agenda than you want to know about, which could be possib


### Can we find out how many speeches there are in our data set?  
(Hint: Use regex patterns in a clever way)

In [4]:
re.findall(r"SPEECH [0-9]+", speeches)

['SPEECH 1',
 'SPEECH 2',
 'SPEECH 3',
 'SPEECH 4',
 'SPEECH 5',
 'SPEECH 6',
 'SPEECH 7',
 'SPEECH 8',
 'SPEECH 9',
 'SPEECH 10']

### Split up the data into the different speeches

In [5]:
list_speeches = re.split(r"SPEECH [0-9]+", speeches)

### Display the beginning of each speech  
(Hint: Use a listwise comprehension)

In [6]:
[re.sub(r"(\r\n)+", "", s[:100]) for s in list_speeches]

['\ufeff',
 "   ...Thank you so much.  That's so nice.  Isn't he a great guy.  He doesn't get a fair press;",
 '   Thank you for the opportunity to speak to you, and thank you to the Center for National Int',
 '   A hand with little fingers coming out of a stem. Like, little. Look at my hands. They’re fi',
 '     Oh boy. We love Nevada. We love Nevada. Thank you. Thank you. Oh this is a great place. T',
 '   Wow. Whoa. That is some group of people. Thousands.   So nice, thank you very much. T',
 '   Thank you. It’s true, and these are the best and the finest. When Mexico sends its people, ',
 '     Chris, thank you very much. I appreciate it. This has been an amazing evening. Alre',
 '   This is so — so incredible. —    We — we have had, no matter where we go — you know, ',
 '   (AUDIENCE MEMBERS SHOUTING)    — let me just tell you — let me just tell you a little s',
 '   Wow. Thank you. Thank you so much. Thank you. We start by paying our great respects to Pe']

### Extract just the first speech from the list of speeches

In [7]:
speech = list_speeches[1]
speech[:300]

"\r\n \r\n \r\n ...Thank you so much.  That's so nice.  Isn't he a great guy.  He doesn't get a fair press; he doesn't get it.  It's just not fair.  And I have to tell you I'm here, and very strongly here, because I have great respect for Steve King and have great respect likewise for Citizens United, Davi"

### Transform the whole speech to lowercase

In [8]:
speech = speech.lower()
print(speech[:300])


 
 
 ...thank you so much.  that's so nice.  isn't he a great guy.  he doesn't get a fair press; he doesn't get it.  it's just not fair.  and i have to tell you i'm here, and very strongly here, because i have great respect for steve king and have great respect likewise for citizens united, davi


### Use a regex pattern to find all contractions in the speech

(Hint: You can use `sorted()` to get the contractions in the same order as below)

In [9]:
contractions = sorted(list(set(re.findall(r"\w+'\w+", speech))))
print(contractions)

["aren't", "can't", "country's", "didn't", "doesn't", "don't", "everyone's", "he's", "i'll", "i'm", "i've", "isn't", "it's", "that's", "they'd", "they're", "they've", "we'll", "we're", "we've", "weren't", "what's", "who's", "won't", "wouldn't", "you'd", "you'll", "you're", "you've"]


### Here is a list of the expanded contractions for replacing the them in the next step

In [10]:
expanded = ['are not', 'can not', 'countrys', 'did not', 'does not', 'do not', 'everyone is', 'he is', 'i will', 'i am', 'i have', 'is not', 'it is', 'that is', 
            'they would', 'they are', 'they have', 'we will', 'we are', 'we have', 'were not', 'what is', 'who is', 'will not', 'would not', 'you would', 
            'you will', 'you are', 'you have']

### Use a for loop to replace each contraction by its expanded version

In [11]:
for i in range(len(contractions)):
    speech = re.sub(contractions[i], expanded[i], speech)

In [12]:
speech[:100]

'\r\n \r\n \r\n ...thank you so much.  that is so nice.  is not he a great guy.  he does not get a fair pre'

### Tokenize that speech and transform every token to lowercase

In [13]:
speech = nltk.word_tokenize(speech)
print(speech[:50])

['...', 'thank', 'you', 'so', 'much', '.', 'that', 'is', 'so', 'nice', '.', 'is', 'not', 'he', 'a', 'great', 'guy', '.', 'he', 'does', 'not', 'get', 'a', 'fair', 'press', ';', 'he', 'does', 'not', 'get', 'it', '.', 'it', 'is', 'just', 'not', 'fair', '.', 'and', 'i', 'have', 'to', 'tell', 'you', 'i', 'am', 'here', ',', 'and', 'very']


### Assume we have the following stopword list

In [14]:
from nltk.corpus import stopwords
stopwords = stopwords.words("english")

### Remove these stopwords from the speech

In [15]:
speech = [word for word in speech if word not in stopwords]

### Remove the unwanted tokens from the speech

In [16]:
speech = re.sub(r"[.,;:!?\$\-'`´]", "", " ".join(speech))

### Remove numbers from the speech

In [17]:
speech = re.sub(r"[0-9]+", "", speech)

### Print the speech one more time

In [18]:
print(speech[:500])

 thank much  nice  great guy  get fair press  get  fair  tell  strongly  great respect steve king great respect likewise citizens united  david everybody  tremendous resect tea party  also  also people iowa  something common  hardworking people  want work  want make country great  love people iowa  way  simple  said  country really headed wrong direction president absolutely terrible job  world collapsing around us  many problems caused  president either grossly incompetent  word people using  t


## Congratulations, you have successfully pre-processed one of Trump's speeches

![](https://media.giphy.com/media/l2JhIUyUs8KDCCf3W/giphy.gif)