## Content

1. Import libraries
2. Data Cleaning
    - Removing Twitter Handles (@user)
    - Removing Punctuations, Numbers, and Special Characters
    - Removing Short Words
    - Lower Casing
    - Remove html tags
    - Decontract text
    - Remove stopwords
    - Remove frequent words
    - Remove rare words
    - Spelling Correction
3. Tweets before and after cleaning

## 1. Import libraries

In [1]:
import re  # for regular expressions 
import nltk # for text manipulation 
import string
import warnings
import numpy as np
import pandas as pd
import seaborn as sns # for visualization
import matplotlib.pyplot as plt # for visualization
from IPython.display import display

# setting up the background style for the plots
plt.style.use('fivethirtyeight')

from nltk.stem import WordNetLemmatizer #word stemmer class
lemma = WordNetLemmatizer()
from nltk import FreqDist
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from textblob import TextBlob
%matplotlib inline

pd.set_option("display.max_colwidth", 200)
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [2]:
# Read data
path = ''
df = pd.read_csv(path + 'trump.csv')
# test_df = pd.read_csv(path + 'test.csv')

In [6]:
df = df.drop('Unnamed: 0', axis=1)

In [7]:
df.head()

Unnamed: 0,id,script
0,0,"\n\n\n\n \nDonald Trump: (00:13)\nHello, Iowa. Congratulations to the Iowa hawkers. That was a big win today. I’m thrilled to be back. That was a big win. But I am thrilled to be back especially o..."
1,1,"\n\n\n\n \nDonald Trump: (03:37)\nWe have great, great people running. Many of them are right here. I love Marjorie. With your help, we’re going to take back the House and send Nancy Pelosi back t..."
2,2,"\n\n\n\n \nGreg Gutfeld: (00:05)\nAll right. Why should I say a thing? Let’s just get to round two. I want to ask you a question about COVID because you had COVID and my wife, you met my wife, I d..."
3,3,"\n\n\n\n \nDonald Trump: (00:00)\nAs one nation, America mourns the loss of our brave and brilliant American service members in a savage and barbaric terrorist attack in Afghanistan. These noble A..."
4,4,"\n\n\n\n \nDonald Trump: (08:53)\nThank you. Thank you. Wow, this is a big crowd. I’ll tell you. This goes all the way back. I wish they’d show it because they just don’t do that. They don’t like…..."


In [10]:
df.columns

Index(['id', 'script'], dtype='object')

## 2. Data Cleaning

All the following pre-processing steps are essential and help us in reducing our vocabulary clutter so that the features produced in the end are more effective.

**A. Combine training and testing data**

Before we begin cleaning, let’s first combine train and test datasets. Combining the datasets will make it convenient for us to preprocess the data. Later we will split it back into train and test data.

In [8]:
# combi = train_df.append(test_df, ignore_index=True)
df.shape

(12, 2)

Given below is a user-defined function to remove unwanted text patterns from the tweets.

In [9]:
# This function will look inside the "input text" and replace the pattern with '' if found.
def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)
    return input_txt

We will do the following pre-processing steps:

1. **Twitter handles**: We will remove the twitter handles as they are already masked as @user due to privacy concerns. These twitter handles hardly give any information about the nature of the tweet.

2. **Punctuations, numbers and even special characters**  will also be removed since they wouldn’t help in differentiating different types of tweets.

3. **Short words**: Words like ‘pdx’, ‘his’, ‘all’ do not add much value. So, we will try to remove them as well from our data.

4. **Text data normalization**: Terms like loves, loving, and lovable can be normalized to their base word, i.e., ‘love’. This will help reduce the total number of unique words in our data without losing a significant amount of information.

### 2.1 Removing Twitter Handles (@user)

Let’s create a new column tidy_tweet, it will contain the cleaned and processed tweets. Note that we have passed “@[]*” as the pattern to the remove_pattern function. It is actually a regular expression which will pick any word starting with ‘@’.

#### Check if @ exists at all in any of the scripts

In [12]:
df['script'].str.contains('@').sum()

0

In [6]:
# df['tidy_script'] = np.vectorize(remove_pattern)(df['script'], "@[\w]*")
# df.head()

Unnamed: 0,id,label,tweet,tidy_tweet
0,1,0.0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run,when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0.0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked,thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0.0,bihday your majesty,bihday your majesty
3,4,0.0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,0.0,factsguide: society now #motivation,factsguide: society now #motivation


**YOUR TURN**

See what happens if you use capital letter W instead of w!

### 2.2 Removing Punctuations, Numbers, and Special Characters

Here we will replace everything except characters and hashtags with spaces. The regular expression “[^a-zA-Z#]” means anything except alphabets and ‘#’.

#### Check if # exists at all in any of the scripts

In [13]:
df['script'].str.contains('#').sum()

0

In [14]:
df['tidy_script'] = df['script'].str.replace("[^a-zA-Z]", " ")
df.head(10)

  df['tidy_script'] = df['script'].str.replace("[^a-zA-Z]", " ")


Unnamed: 0,id,script,tidy_script
0,0,"\n\n\n\n \nDonald Trump: (00:13)\nHello, Iowa. Congratulations to the Iowa hawkers. That was a big win today. I’m thrilled to be back. That was a big win. But I am thrilled to be back especially o...",Donald Trump Hello Iowa Congratulations to the Iowa hawkers That was a big win today I m thrilled to be back That was a big win But I am thrilled to be back especially on such...
1,1,"\n\n\n\n \nDonald Trump: (03:37)\nWe have great, great people running. Many of them are right here. I love Marjorie. With your help, we’re going to take back the House and send Nancy Pelosi back t...",Donald Trump We have great great people running Many of them are right here I love Marjorie With your help we re going to take back the House and send Nancy Pelosi back to San ...
2,2,"\n\n\n\n \nGreg Gutfeld: (00:05)\nAll right. Why should I say a thing? Let’s just get to round two. I want to ask you a question about COVID because you had COVID and my wife, you met my wife, I d...",Greg Gutfeld All right Why should I say a thing Let s just get to round two I want to ask you a question about COVID because you had COVID and my wife you met my wife I don t k...
3,3,"\n\n\n\n \nDonald Trump: (00:00)\nAs one nation, America mourns the loss of our brave and brilliant American service members in a savage and barbaric terrorist attack in Afghanistan. These noble A...",Donald Trump As one nation America mourns the loss of our brave and brilliant American service members in a savage and barbaric terrorist attack in Afghanistan These noble America...
4,4,"\n\n\n\n \nDonald Trump: (08:53)\nThank you. Thank you. Wow, this is a big crowd. I’ll tell you. This goes all the way back. I wish they’d show it because they just don’t do that. They don’t like…...",Donald Trump Thank you Thank you Wow this is a big crowd I ll tell you This goes all the way back I wish they d show it because they just don t do that They don t like This ...
5,5,"\n\n\n\n \nSean Hannity: (00:00)\nMr. President, thank you for being with us.\nSean Hannity: (00:02)\nLet me go back. I have had a number of people tell me that there were very specific conditions...",Sean Hannity Mr President thank you for being with us Sean Hannity Let me go back I have had a number of people tell me that there were very specific conditions and ver...
6,6,"\n\n\n\n \nDonald Trump: (03:17)\nThank you very much, thank you. And thank you to Charlie for that introduction, which was so beautiful and for your fearless leadership of Turning Point Action an...",Donald Trump Thank you very much thank you And thank you to Charlie for that introduction which was so beautiful and for your fearless leadership of Turning Point Action and Turn...
7,7,"\n\n\n\n \nDonald Trump: (00:07)\nThank you very much. Thank you.\nAudience: (00:18)\nUSA, USA, USA, USA, USA, USA, USA.\nDonald Trump: (00:18)\nThank you very much. Thank you to Matt. What a job....",Donald Trump Thank you very much Thank you Audience USA USA USA USA USA USA USA Donald Trump Thank you very much Thank you to Matt What a job He and Me...
8,8,"\n\n\n\n \nBrooke Rollins: (00:00)\n… In the way. There’s no topic on which they, the elites, the big firms, the progressives, the office holders and the bureaucrats, there is no other topic that ...",Brooke Rollins In the way There s no topic on which they the elites the big firms the progressives the office holders and the bureaucrats there is no other topic that they a...
9,9,"\n\n\n\n \nAudience: (00:00)\nUSA, USA, USA, USA, USA, USA-\nDonald Trump: (00:00)\nWow. Thank you.\nAudience: (02:10)\n… USA, USA, USA, USA, USA, USA, USA, [crosstalk 00:02:10]-\nDonald Trump: (0...",Audience USA USA USA USA USA USA Donald Trump Wow Thank you Audience USA USA USA USA USA USA USA crosstalk Donald Trump Well ...


### 2.3 Removing Short Words

We have to be a little careful here in selecting the length of the words which we want to remove. So, I have decided to remove all the words having length 3 or less. For example, terms like “hmm”, “oh” are of very little use. It is better to get rid of them.

In [15]:
# the "join" function will tail a word only if its length is greater than or equal to 3
df['tidy_script'] = df['tidy_script'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
df.head()

Unnamed: 0,id,script,tidy_script
0,0,"\n\n\n\n \nDonald Trump: (00:13)\nHello, Iowa. Congratulations to the Iowa hawkers. That was a big win today. I’m thrilled to be back. That was a big win. But I am thrilled to be back especially o...",Donald Trump Hello Iowa Congratulations Iowa hawkers That today thrilled back That thrilled back especially such great news that that been great school great team great tradition really amazing st...
1,1,"\n\n\n\n \nDonald Trump: (03:37)\nWe have great, great people running. Many of them are right here. I love Marjorie. With your help, we’re going to take back the House and send Nancy Pelosi back t...",Donald Trump have great great people running Many them right here love Marjorie With your help going take back House send Nancy Pelosi back Francisco where work very hard bring back city which hel...
2,2,"\n\n\n\n \nGreg Gutfeld: (00:05)\nAll right. Why should I say a thing? Let’s just get to round two. I want to ask you a question about COVID because you had COVID and my wife, you met my wife, I d...",Greg Gutfeld right should thing just round want question about COVID because COVID wife wife know like years Lago still gotten vaccinated keep talking vaccinated hasn what would Donald Trump Well ...
3,3,"\n\n\n\n \nDonald Trump: (00:00)\nAs one nation, America mourns the loss of our brave and brilliant American service members in a savage and barbaric terrorist attack in Afghanistan. These noble A...",Donald Trump nation America mourns loss brave brilliant American service members savage barbaric terrorist attack Afghanistan These noble American warriors laid down their lives line duty They sac...
4,4,"\n\n\n\n \nDonald Trump: (08:53)\nThank you. Thank you. Wow, this is a big crowd. I’ll tell you. This goes all the way back. I wish they’d show it because they just don’t do that. They don’t like…...",Donald Trump Thank Thank this crowd tell This goes back wish they show because they just that They like This goes back just looked television television show they show know Because they fake news ...


You can see the difference between the raw tweets and the cleaned tweets (tidy_tweet) quite clearly. Only the important words in the tweets have been retained and the noise (numbers, punctuations, and special characters) has been removed.

**YOUR TURN**

See would happen of empty string '' with the join function like this ''.join([])..
You can also try numbers other than 3

### 2.4 Lower Casing

- Another pre-processing step which we will do is to transform our tweets into lower case. 
- This avoids having multiple copies of the same words. 
- For example, while calculating the word count, ‘Analytics’ and ‘analytics’ will be taken as different words.

In [16]:
# The 'join' function will add a word after converting it to the lower case.
df['tidy_script'] = df['tidy_script'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df.head()

Unnamed: 0,id,script,tidy_script
0,0,"\n\n\n\n \nDonald Trump: (00:13)\nHello, Iowa. Congratulations to the Iowa hawkers. That was a big win today. I’m thrilled to be back. That was a big win. But I am thrilled to be back especially o...",donald trump hello iowa congratulations iowa hawkers that today thrilled back that thrilled back especially such great news that that been great school great team great tradition really amazing st...
1,1,"\n\n\n\n \nDonald Trump: (03:37)\nWe have great, great people running. Many of them are right here. I love Marjorie. With your help, we’re going to take back the House and send Nancy Pelosi back t...",donald trump have great great people running many them right here love marjorie with your help going take back house send nancy pelosi back francisco where work very hard bring back city which hel...
2,2,"\n\n\n\n \nGreg Gutfeld: (00:05)\nAll right. Why should I say a thing? Let’s just get to round two. I want to ask you a question about COVID because you had COVID and my wife, you met my wife, I d...",greg gutfeld right should thing just round want question about covid because covid wife wife know like years lago still gotten vaccinated keep talking vaccinated hasn what would donald trump well ...
3,3,"\n\n\n\n \nDonald Trump: (00:00)\nAs one nation, America mourns the loss of our brave and brilliant American service members in a savage and barbaric terrorist attack in Afghanistan. These noble A...",donald trump nation america mourns loss brave brilliant american service members savage barbaric terrorist attack afghanistan these noble american warriors laid down their lives line duty they sac...
4,4,"\n\n\n\n \nDonald Trump: (08:53)\nThank you. Thank you. Wow, this is a big crowd. I’ll tell you. This goes all the way back. I wish they’d show it because they just don’t do that. They don’t like…...",donald trump thank thank this crowd tell this goes back wish they show because they just that they like this goes back just looked television television show they show know because they fake news ...


**YOUR TURN**

Can you achieve the same results if using the upper() function instead of lower()?

### 2.5 Remove html tags

Using a regex, you can clean everything inside <>. However, **some HTML texts can also contain entities that are not enclosed in brackets,** such as '&nsbm'. If that is the case, then you might want to write the regex as

In [17]:
# Python’s re.compile() method is used to compile a regular expression pattern provided as a 
# string into a regex pattern object (re.Pattern). Later we can use this pattern object to 
# search for a match inside different target strings using regex methods such as a re.match() 
# or re.search().
# Syntax: re.compile(pattern)

CLEANR = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')

# the function will replace the above pattern with ''.
def cleanhtml(raw_html):
    cleantext = re.sub(CLEANR, '', raw_html)
    return cleantext

In [18]:
df['tidy_script'] = df['tidy_script'].apply(cleanhtml)
df.head(10)

Unnamed: 0,id,script,tidy_script
0,0,"\n\n\n\n \nDonald Trump: (00:13)\nHello, Iowa. Congratulations to the Iowa hawkers. That was a big win today. I’m thrilled to be back. That was a big win. But I am thrilled to be back especially o...",donald trump hello iowa congratulations iowa hawkers that today thrilled back that thrilled back especially such great news that that been great school great team great tradition really amazing st...
1,1,"\n\n\n\n \nDonald Trump: (03:37)\nWe have great, great people running. Many of them are right here. I love Marjorie. With your help, we’re going to take back the House and send Nancy Pelosi back t...",donald trump have great great people running many them right here love marjorie with your help going take back house send nancy pelosi back francisco where work very hard bring back city which hel...
2,2,"\n\n\n\n \nGreg Gutfeld: (00:05)\nAll right. Why should I say a thing? Let’s just get to round two. I want to ask you a question about COVID because you had COVID and my wife, you met my wife, I d...",greg gutfeld right should thing just round want question about covid because covid wife wife know like years lago still gotten vaccinated keep talking vaccinated hasn what would donald trump well ...
3,3,"\n\n\n\n \nDonald Trump: (00:00)\nAs one nation, America mourns the loss of our brave and brilliant American service members in a savage and barbaric terrorist attack in Afghanistan. These noble A...",donald trump nation america mourns loss brave brilliant american service members savage barbaric terrorist attack afghanistan these noble american warriors laid down their lives line duty they sac...
4,4,"\n\n\n\n \nDonald Trump: (08:53)\nThank you. Thank you. Wow, this is a big crowd. I’ll tell you. This goes all the way back. I wish they’d show it because they just don’t do that. They don’t like…...",donald trump thank thank this crowd tell this goes back wish they show because they just that they like this goes back just looked television television show they show know because they fake news ...
5,5,"\n\n\n\n \nSean Hannity: (00:00)\nMr. President, thank you for being with us.\nSean Hannity: (00:02)\nLet me go back. I have had a number of people tell me that there were very specific conditions...",sean hannity president thank being with sean hannity back have number people tell that there were very specific conditions very specific warnings that gave personally taliban biden trying blame se...
6,6,"\n\n\n\n \nDonald Trump: (03:17)\nThank you very much, thank you. And thank you to Charlie for that introduction, which was so beautiful and for your fearless leadership of Turning Point Action an...",donald trump thank very much thank thank charlie that introduction which beautiful your fearless leadership turning point action turning point thank charlie very much also express incredible appre...
7,7,"\n\n\n\n \nDonald Trump: (00:07)\nThank you very much. Thank you.\nAudience: (00:18)\nUSA, USA, USA, USA, USA, USA, USA.\nDonald Trump: (00:18)\nThank you very much. Thank you to Matt. What a job....",donald trump thank very much thank audience donald trump thank very much thank matt what mercedes have done cpac item people standing outside trying would anybody like give their slot would anybod...
8,8,"\n\n\n\n \nBrooke Rollins: (00:00)\n… In the way. There’s no topic on which they, the elites, the big firms, the progressives, the office holders and the bureaucrats, there is no other topic that ...",brooke rollins there topic which they elites firms progressives office holders bureaucrats there other topic that they seeing bigger obstacle achieve their ambitions than first amendment first ame...
9,9,"\n\n\n\n \nAudience: (00:00)\nUSA, USA, USA, USA, USA, USA-\nDonald Trump: (00:00)\nWow. Thank you.\nAudience: (02:10)\n… USA, USA, USA, USA, USA, USA, USA, [crosstalk 00:02:10]-\nDonald Trump: (0...",audience donald trump thank audience crosstalk donald trump well want thank ohio incredible turnout there thousands people trying unbelievable hardworking patriots here tonight very first rally el...


### 2.6 Decontract text

Contractions are words or combinations of words that are shortened by dropping letters and replacing them by an apostrophe. For example I’ll be there within 5 min. Are u not gng there? Am I mssng out on smthng? I’d like to see u near d park.

**Removing contractions contributes to text standardization and is useful when we are working on Twitter data,** on reviews of a product as the words play an important role in sentiment analysis.

In [19]:
import re
contractions_dict = {
        'didn\'t': 'did not',
        'don\'t': 'do not',
        "can't": "cannot",
        "can't've": "cannot have",
        "'cause": "because",
        "could've": "could have",
        "couldn't": "could not",
        "couldn't've": "could not have",
        "didn't": "did not",
        "doesn't": "does not",
        "don't": "do not",
        "hadn't": "had not",
        "hadn't've": "had not have",
        "hasn't": "has not",
        "haven't": "have not",
        "he'd": "he had / he would",
        "he'd've": "he would have",
        "he'll": "he shall / he will",
        "he'll've": "he shall have / he will have",
        "he's": "he has / he is",
        "how'd": "how did",
        "how'd'y": "how do you",
        "how'll": "how will",
        "how's": "how has / how is / how does",
        "I'd": "I had / I would",
        "I'd've": "I would have",
        "I'll": "I shall / I will",
        "I'll've": "I shall have / I will have",
        "I'm": "I am",
        "I've": "I have",
        "isn't": "is not",
        "it'd": "it had / it would",
        "it'd've": "it would have",
        "it'll": "it shall / it will",
        "it'll've": "it shall have / it will have",
        "it's": "it has / it is",
        "let's": "let us",
        "ma'am": "madam",
        "mayn't": "may not",
        "might've": "might have",
        "mightn't": "might not",
        "mightn't've": "might not have",
        "must've": "must have",
        "mustn't": "must not",
        "mustn't've": "must not have",
        "needn't": "need not",
        "needn't've": "need not have",
        "o'clock": "of the clock",
        "oughtn't": "ought not",
        "oughtn't've": "ought not have",
        "shan't": "shall not",
        "sha'n't": "shall not",
        "shan't've": "shall not have",
        "she'd": "she had / she would",
        "she'd've": "she would have",
        "she'll": "she shall / she will",
        "she'll've": "she shall have / she will have",
        "she's": "she has / she is",
        "should've": "should have",
        "shouldn't": "should not",
        "shouldn't've": "should not have",
        "so've": "so have",
        "so's": "so as / so is",
        "that'd": "that would / that had",
        "that'd've": "that would have",
        "that's": "that has / that is",
        "there'd": "there had / there would",
        "there'd've": "there would have",
        "there's": "there has / there is",
        "they'd": "they had / they would",
        "they'd've": "they would have",
        "they'll": "they shall / they will",
        "they'll've": "they shall have / they will have",
        "they're": "they are",
        "they've": "they have",
        "to've": "to have",
        "wasn't": "was not",
        "we'd": "we had / we would",
        "we'd've": "we would have",
        "we'll": "we will",
        "we'll've": "we will have",
        "we're": "we are",
        "we've": "we have",
        "weren't": "were not",
        "what'll": "what shall / what will",
        "what'll've": "what shall have / what will have",
        "what're": "what are",
        "what's": "what has / what is",
        "what've": "what have",
        "when's": "when has / when is",
        "when've": "when have",
        "where'd": "where did",
        "where's": "where has / where is",
        "where've": "where have",
        "who'll": "who shall / who will",
        "who'll've": "who shall have / who will have",
        "who's": "who has / who is",
        "who've": "who have",
        "why's": "why has / why is",
        "why've": "why have",
        "will've": "will have",
        "won't": "will not",
        "won't've": "will not have",
        "would've": "would have",
        "wouldn't": "would not",
        "wouldn't've": "would not have",
        "y'all": "you all",
        "y'all'd": "you all would",
        "y'all'd've": "you all would have",
        "y'all're": "you all are",
        "y'all've": "you all have",
        "you'd": "you had / you would",
        "you'd've": "you would have",
        "you'll": "you shall / you will",
        "you'll've": "you shall have / you will have",
        "you're": "you are",
        "you've": "you have"
}

contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))

In [21]:
def expand_contractions(s, contractions_dict=contractions_dict):
    def replace(match):
        return contractions_dict[match.group(0)]
    return contractions_re.sub(replace, s)

In [22]:
df['tidy_script'] = df['tidy_script'].apply(expand_contractions)
df

Unnamed: 0,id,script,tidy_script
0,0,"\n\n\n\n \nDonald Trump: (00:13)\nHello, Iowa. Congratulations to the Iowa hawkers. That was a big win today. I’m thrilled to be back. That was a big win. But I am thrilled to be back especially o...",donald trump hello iowa congratulations iowa hawkers that today thrilled back that thrilled back especially such great news that that been great school great team great tradition really amazing st...
1,1,"\n\n\n\n \nDonald Trump: (03:37)\nWe have great, great people running. Many of them are right here. I love Marjorie. With your help, we’re going to take back the House and send Nancy Pelosi back t...",donald trump have great great people running many them right here love marjorie with your help going take back house send nancy pelosi back francisco where work very hard bring back city which hel...
2,2,"\n\n\n\n \nGreg Gutfeld: (00:05)\nAll right. Why should I say a thing? Let’s just get to round two. I want to ask you a question about COVID because you had COVID and my wife, you met my wife, I d...",greg gutfeld right should thing just round want question about covid because covid wife wife know like years lago still gotten vaccinated keep talking vaccinated hasn what would donald trump well ...
3,3,"\n\n\n\n \nDonald Trump: (00:00)\nAs one nation, America mourns the loss of our brave and brilliant American service members in a savage and barbaric terrorist attack in Afghanistan. These noble A...",donald trump nation america mourns loss brave brilliant american service members savage barbaric terrorist attack afghanistan these noble american warriors laid down their lives line duty they sac...
4,4,"\n\n\n\n \nDonald Trump: (08:53)\nThank you. Thank you. Wow, this is a big crowd. I’ll tell you. This goes all the way back. I wish they’d show it because they just don’t do that. They don’t like…...",donald trump thank thank this crowd tell this goes back wish they show because they just that they like this goes back just looked television television show they show know because they fake news ...
5,5,"\n\n\n\n \nSean Hannity: (00:00)\nMr. President, thank you for being with us.\nSean Hannity: (00:02)\nLet me go back. I have had a number of people tell me that there were very specific conditions...",sean hannity president thank being with sean hannity back have number people tell that there were very specific conditions very specific warnings that gave personally taliban biden trying blame se...
6,6,"\n\n\n\n \nDonald Trump: (03:17)\nThank you very much, thank you. And thank you to Charlie for that introduction, which was so beautiful and for your fearless leadership of Turning Point Action an...",donald trump thank very much thank thank charlie that introduction which beautiful your fearless leadership turning point action turning point thank charlie very much also express incredible appre...
7,7,"\n\n\n\n \nDonald Trump: (00:07)\nThank you very much. Thank you.\nAudience: (00:18)\nUSA, USA, USA, USA, USA, USA, USA.\nDonald Trump: (00:18)\nThank you very much. Thank you to Matt. What a job....",donald trump thank very much thank audience donald trump thank very much thank matt what mercedes have done cpac item people standing outside trying would anybody like give their slot would anybod...
8,8,"\n\n\n\n \nBrooke Rollins: (00:00)\n… In the way. There’s no topic on which they, the elites, the big firms, the progressives, the office holders and the bureaucrats, there is no other topic that ...",brooke rollins there topic which they elites firms progressives office holders bureaucrats there other topic that they seeing bigger obstacle achieve their ambitions than first amendment first ame...
9,9,"\n\n\n\n \nAudience: (00:00)\nUSA, USA, USA, USA, USA, USA-\nDonald Trump: (00:00)\nWow. Thank you.\nAudience: (02:10)\n… USA, USA, USA, USA, USA, USA, USA, [crosstalk 00:02:10]-\nDonald Trump: (0...",audience donald trump thank audience crosstalk donald trump well want thank ohio incredible turnout there thousands people trying unbelievable hardworking patriots here tonight very first rally el...


### 2.7 Remove stopwords

- Stop words (or commonly occurring words) should be removed from the text data. 
- For this purpose, we can either create a list of stopwords ourselves or we can use predefined libraries.

In [23]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Drew\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [24]:
# let imprt the stopwords and see them
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [25]:
# Only join words if not in stop words
df['tidy_script'] = df['tidy_script'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_words))
df.head()

Unnamed: 0,id,script,tidy_script
0,0,"\n\n\n\n \nDonald Trump: (00:13)\nHello, Iowa. Congratulations to the Iowa hawkers. That was a big win today. I’m thrilled to be back. That was a big win. But I am thrilled to be back especially o...",donald trump hello iowa congratulations iowa hawkers today thrilled back thrilled back especially great news great school great team great tradition really amazing started right going keep number ...
1,1,"\n\n\n\n \nDonald Trump: (03:37)\nWe have great, great people running. Many of them are right here. I love Marjorie. With your help, we’re going to take back the House and send Nancy Pelosi back t...",donald trump great great people running many right love marjorie help going take back house send nancy pelosi back francisco work hard bring back city helped much destroy like destroying nation de...
2,2,"\n\n\n\n \nGreg Gutfeld: (00:05)\nAll right. Why should I say a thing? Let’s just get to round two. I want to ask you a question about COVID because you had COVID and my wife, you met my wife, I d...",greg gutfeld right thing round want question covid covid wife wife know like years lago still gotten vaccinated keep talking vaccinated would donald trump well first kind religious thing greg gutf...
3,3,"\n\n\n\n \nDonald Trump: (00:00)\nAs one nation, America mourns the loss of our brave and brilliant American service members in a savage and barbaric terrorist attack in Afghanistan. These noble A...",donald trump nation america mourns loss brave brilliant american service members savage barbaric terrorist attack afghanistan noble american warriors laid lives line duty sacrifice country loved r...
4,4,"\n\n\n\n \nDonald Trump: (08:53)\nThank you. Thank you. Wow, this is a big crowd. I’ll tell you. This goes all the way back. I wish they’d show it because they just don’t do that. They don’t like…...",donald trump thank thank crowd tell goes back wish show like goes back looked television television show show know fake news right fake news hello alabama thrilled back incredible wonderful state ...


### 2.8 Remove frequent words

- We can also remove commonly occurring words from our text data.

- First, let’s check the 10 most frequently occurring words in our text data then take call to remove or retain.

In [27]:
# Frequency of common words in all the tweets
common_top20 = pd.Series(' '.join(df['tidy_script']).split()).value_counts()[:20]
print(common_top20)

# Remove these top 20 freq words
common = list(common_top20.index)
df['tidy_script'] = df['tidy_script'].apply(lambda x: " ".join(x for x in x.split() if x not in common))
df.head()

trump       1146
donald      1036
people       731
going        708
said         664
great        530
country      522
know         493
want         454
like         413
would        366
think        360
never        355
thank        347
right        341
election     314
ever         301
back         299
much         298
biden        281
dtype: int64


Unnamed: 0,id,script,tidy_script
0,0,"\n\n\n\n \nDonald Trump: (00:13)\nHello, Iowa. Congratulations to the Iowa hawkers. That was a big win today. I’m thrilled to be back. That was a big win. But I am thrilled to be back especially o...",hello iowa congratulations iowa hawkers today thrilled thrilled especially news school team tradition really amazing started keep number keep fairgrounds broke record tonight history fairgrounds h...
1,1,"\n\n\n\n \nDonald Trump: (03:37)\nWe have great, great people running. Many of them are right here. I love Marjorie. With your help, we’re going to take back the House and send Nancy Pelosi back t...",running many love marjorie help take house send nancy pelosi francisco work hard bring city helped destroy destroying nation destroying nation fire ultra left wing senator ralph warnock elect hers...
2,2,"\n\n\n\n \nGreg Gutfeld: (00:05)\nAll right. Why should I say a thing? Let’s just get to round two. I want to ask you a question about COVID because you had COVID and my wife, you met my wife, I d...",greg gutfeld thing round question covid covid wife wife years lago still gotten vaccinated keep talking vaccinated well first kind religious thing greg gutfeld skeptical skeptical greg gutfeld ske...
3,3,"\n\n\n\n \nDonald Trump: (00:00)\nAs one nation, America mourns the loss of our brave and brilliant American service members in a savage and barbaric terrorist attack in Afghanistan. These noble A...",nation america mourns loss brave brilliant american service members savage barbaric terrorist attack afghanistan noble american warriors laid lives line duty sacrifice loved racing time rescue fel...
4,4,"\n\n\n\n \nDonald Trump: (08:53)\nThank you. Thank you. Wow, this is a big crowd. I’ll tell you. This goes all the way back. I wish they’d show it because they just don’t do that. They don’t like…...",crowd tell goes wish show goes looked television television show show fake news fake news hello alabama thrilled incredible wonderful state record number state also states numbers tell rigged terr...


**YOUR TURN**

Check if you can achieve better results by NOT removing common words!

### 2.9 Remove rare words

- Now, we will remove rarely occurring words from the text. 
- Because they’re so rare, the association between them and other words is dominated by noise. 
- We can replace rare words with a more general form and then this will have higher counts.

In [43]:
# Frequency of common words in all the tweets
rare_top20 = pd.Series(" ".join(df['tidy_script']).split()).value_counts()[-20:]
print(rare_top20)

# Remove these top 20 common words
rare = list(rare_top20.index)
df['tidy_script'] = df['tidy_script'].apply(lambda x: " ".join(x for x in x.split() if x not in rare))
df.head()

mcshade          1
totals           1
outright         1
whopping         1
easiest          1
server           1
erasing          1
devices          1
documentation    1
cooperation      1
supervisors      1
postmen          1
error            1
pension          1
answers          1
obey             1
dress            1
review           1
valid            1
singing          1
dtype: int64


Unnamed: 0,id,script,tidy_script
0,0,"\n\n\n\n \nDonald Trump: (00:13)\nHello, Iowa. Congratulations to the Iowa hawkers. That was a big win today. I’m thrilled to be back. That was a big win. But I am thrilled to be back especially o...",hello iowa congratulations iowa hawkers today thrilled thrilled especially news school team tradition really amazing started keep number keep fairgrounds broke record tonight history fairgrounds h...
1,1,"\n\n\n\n \nDonald Trump: (03:37)\nWe have great, great people running. Many of them are right here. I love Marjorie. With your help, we’re going to take back the House and send Nancy Pelosi back t...",running many love marjorie help take house send nancy pelosi francisco work hard bring city helped destroy destroying nation destroying nation fire ultra left wing senator ralph warnock elect hers...
2,2,"\n\n\n\n \nGreg Gutfeld: (00:05)\nAll right. Why should I say a thing? Let’s just get to round two. I want to ask you a question about COVID because you had COVID and my wife, you met my wife, I d...",greg gutfeld thing round question covid covid wife wife years lago still gotten vaccinated keep talking vaccinated well first kind religious thing greg gutfeld skeptical skeptical greg gutfeld ske...
3,3,"\n\n\n\n \nDonald Trump: (00:00)\nAs one nation, America mourns the loss of our brave and brilliant American service members in a savage and barbaric terrorist attack in Afghanistan. These noble A...",nation america mourns loss brave brilliant american service members savage barbaric terrorist attack afghanistan noble american warriors laid lives line duty sacrifice loved racing time rescue fel...
4,4,"\n\n\n\n \nDonald Trump: (08:53)\nThank you. Thank you. Wow, this is a big crowd. I’ll tell you. This goes all the way back. I wish they’d show it because they just don’t do that. They don’t like…...",crowd tell goes wish show goes looked television television show show fake news fake news hello alabama thrilled incredible wonderful state record number state also states numbers tell rigged terr...


**YOUR TURN**

Check if you can achieve better results by NOT removing rare words!

### 2.10 Spelling Correction

- Now tweets can be filled with plethora of spelling mistakes. Our task is to rectify these spelling mistakes.
- In that context, spelling correction is a useful pre-processing step because this also will help us in reducing multiple copies of words. For example, “Analytics” and “analytcs” will be treated as different words even if they are used in the same sense.
- To accomplish the above task, we will use the textblob library as follows-

In [28]:
# Using textblob
from textblob import TextBlob

In [29]:
# create function
def spell_correction(df):
    return df['tidy_script'].apply(lambda x: str(TextBlob(x).correct()))

In [45]:
# call the function
df['tidy_script'] = df['tidy_script'].apply(lambda x: str(TextBlob(x).correct()))

## 3. Tweets before and after cleaning

We will compare tweets before and after cleaning

In [53]:
for s in df['tidy_script']:
    print(len(s))

37370
33227
2787
1000
32980
16345
38351
32082
19224
31712
33551
33208


**1. Positive tweets before and after cleaning**

In [46]:
# Example 1
print('BEFORE - ',df['script'][1])
print('AFTER - ',df['tidy_script'][1])
print('')

BEFORE -  



 
Donald Trump: (03:37)
We have great, great people running. Many of them are right here. I love Marjorie. With your help, we’re going to take back the House and send Nancy Pelosi back to San Francisco where she can work very hard to bring back a city which she has helped to very much destroy, just like they’re destroying our nation. They’re destroying our nation. We’re going to fire your ultra left-wing Senator Ralph Warnock, and elect the great Herschel Walker to the United States Senate. And we’re going to take back our country from these lunatics.
Donald Trump: (04:23)
In just eight months Joe Biden and the radical Democrats are well on their way to turning America into a third-world nation. That’s what’s happening. You see it here as much as anybody. I told you so during the election and during the campaign. Inflation is skyrocketing. Unemployment is rising at a level that nobody can believe. Main streets are being boarded up. Murders are through the roof. You look a

In [47]:
# Example 2
print('BEFORE - ',df['script'][4])
print('AFTER - ',df['tidy_script'][4])
print('')

BEFORE -  



 
Donald Trump: (08:53)
Thank you. Thank you. Wow, this is a big crowd. I’ll tell you. This goes all the way back. I wish they’d show it because they just don’t do that. They don’t like… This goes all the way back. I just looked at it on television, but it’s our television. We show it, but they don’t show it. You know why? Because they’re fake news. Right? They’re fake news. Hello, Alabama. And I’m thrilled to be back in your incredible, wonderful state that we won by a record number. We won this state. We also won a lot of other states by numbers that they don’t tell you about. We did have a rigged election. Didn’t we. It was terrible, terrible. And you look at what’s going on now. You look at what’s going on now and the border, but take a look at Afghanistan, what’s happening.
Donald Trump: (09:47)
But I’m with thousands of proud, hardworking, incredible American patriots. With your help, we’re going to elect our friend, Mo Brooks, to the U.S. Senate. We’re going to fir

In [48]:
# Example 3
print('BEFORE - ',df['script'][11])
print('AFTER - ',df['tidy_script'][11])
print('')

BEFORE -  



 
Donald Trump: (01:24)
Well, thank you very much. And hello, CPAC. Do you miss me yet? Do you miss me yet? A lot of things going on.
Donald Trump: (01:36)
There’s so many wonderful friends, conservatives and fellow citizens in this room and all across our country. I stand before you today to declare that the incredible journey we’ve begun together, we went through a journey like nobody else. There’s never been a journey like it. There’s never been a journey so successful. We began it together four years ago, and it is far from being over. We’ve just started.
Crowd: (02:04)
USA USA USA USA.
Donald Trump: (02:11)
Our movement of proud, hardworking, and you know what? This is the hardest working people, hard working American Patriots is just getting started. And in the end we will win. We will win.
Donald Trump: (02:29)
We’ve been doing a lot of winning as we gather this week, we’re in the middle of a historic struggle for America’s future, America’s culture, and America’s 

**2. Negative tweets before and after cleaning**

In [49]:
# Example 1
print('BEFORE - ',df['script'][3])
print('AFTER - ',df['tidy_script'][3])
print('')

BEFORE -  



 
Donald Trump: (00:00)
As one nation, America mourns the loss of our brave and brilliant American service members in a savage and barbaric terrorist attack in Afghanistan. These noble American warriors laid down their lives in the line of duty. They sacrifice themselves so the country that they loved, racing against time to rescue their fellow citizens from harm’s way. They died as American heroes and our nation will honor their memory forever. I want to express my deepest condolences to the families of those we have lost. Today all Americans grieve alongside you. Together we also pray that God will heal the other courageous American service members who were wounded in this heinous attack. In addition, our hearts are with the families of all the innocent civilians who died, and with the many men, women, and children who were terribly injured in this act of evil.
Donald Trump: (01:00)
This tragedy should never have taken place, it should never have happened, and it would 

In [50]:
# Example 2
print('BEFORE - ',df['script'][8])
print('AFTER - ',df['tidy_script'][8])
print('')

BEFORE -  



 
Brooke Rollins: (00:00)
… In the way. There’s no topic on which they, the elites, the big firms, the progressives, the office holders and the bureaucrats, there is no other topic that they are seeing as a bigger obstacle to achieve their ambitions than the first amendment. The first amendment, the bulwark of our liberties, is what enables us as citizens to resist them at every turn. The first amendment truly stands between them and us. When they try to tell us what to read, the first amendment gets in their way. When they try to tell us what to think, the first amendment gets in their way.
Brooke Rollins: (00:42)
When they try to tell us what to believe, the first amendment gets in their way. And when they try to tell us with whom to worship, with whom to associate, with whom to congregate and with whom to be friends, the first amendment gets in their way. It’s no surprise then that they want the first amendment gone. They don’t advocate for abolition of course, they kn

In [51]:
# Example 3
print('BEFORE - ',df['script'][10])
print('AFTER - ',df['tidy_script'][10])
print('')

BEFORE -  



 
Donald Trump: (00:00)
Thank you very much to Lee. Thank you, Lee. Well, he’s been so great. Michael, thank you very much and congratulations on your reelection today as chairman of the North Carolina Republican Party. We love this state. You’ve done a great job, Michael, and we thank you very much. We really appreciate it. Thank you. Thank you. I love you too. I love you too.
Donald Trump: (02:31)
It’s great to be back in Greenville with so many proud North Carolina Patriots who love our country, support our military, respect our police, honor our flag and always put America first. We don’t put America second. As we gather tonight, our country is being destroyed before our very own eyes. Crime is exploding. Police departments are being ripped apart and defunded. Can you believe that? Is that good politics, defund our police? Number one, it’s bad for our country, but think of it, defund our police. You know, I’ve long said they’re vicious, they’re violent. They in many c

In [52]:
# save dataframe
df.to_csv(path + 'trump_cleaned_df.csv')