# Data Wrangling 

In the following notebook I will be showcasing the various transformations and preprocessing steps my dataset went through.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_tweets = pd.read_csv('tweets.csv')

In [3]:
df_tweets.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


In [4]:
df_tweets.describe()

Unnamed: 0,textID,text,selected_text,sentiment
count,27481,27480,27480,27481
unique,27481,27480,22463,3
top,95a71a0799,you are a good child.,good,neutral
freq,1,1,199,11118


In [5]:
df_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27481 entries, 0 to 27480
Data columns (total 4 columns):
textID           27481 non-null object
text             27480 non-null object
selected_text    27480 non-null object
sentiment        27481 non-null object
dtypes: object(4)
memory usage: 858.9+ KB


## Balance Dataset 
In the dataset we have about 3k more records for 'neutral' sentiments than for 'positive' and 'negative' sentiments each. We will be downsampling the 'neutral' sentiment in order to balance the data.

In [6]:
df_tweets['sentiment'].value_counts()

neutral     11118
positive     8582
negative     7781
Name: sentiment, dtype: int64

In [7]:
## create a seperate dataframe for each sentiment 
df_positive = df_tweets[df_tweets['sentiment'] == 'positive']
df_negative = df_tweets[df_tweets['sentiment'] == 'negative']
df_neutral = df_tweets[df_tweets['sentiment'] == 'neutral']

## resample 8100 random 'neutral' records and combine all records
from sklearn.utils import resample

df_neutral_reshaped = resample(df_neutral, replace=False, n_samples=8100,random_state=42)
df_tweets = pd.concat([df_neutral_reshaped, df_positive, df_negative])
df_tweets.sort_index(axis=0, inplace=True)
df_tweets['sentiment'].value_counts()

positive    8582
neutral     8100
negative    7781
Name: sentiment, dtype: int64

In [8]:
#dataframe for machine learning models
df_tweets_final = df_tweets.drop(columns=['textID', "selected_text"], axis=1)
df_tweets_final = df_tweets_final.dropna()

#dataframe for deep learning models
df_tweets_final_dl = df_tweets_final.copy()

# Preprocessing Data (Machine Learning)

### Creating Scraper

This scraper will extract a list of commonly used internet and text acronyms, which will later be used to replace any matched acronyms within the tweets.

In [10]:
#create beautifulsoup scraper
import requests
from bs4 import BeautifulSoup

url = "https://www.netlingo.com/acronyms.php"
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <title>
   The Largest List of Chat Acronyms and Text Message Shorthand (IM, SMS) found of the Web - updated daily by NetLingo The Internet Dictionary: Online Dictionary of Internet Terms, Acronyms, Text Messaging, Smileys ;-)
  </title>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="Leading Internet dictionary defines thousands of online communication, technology and business terms :-) plus list of texting jargon and chat acronyms ;-) | NetLingo.com" name="description">
   <meta content="netlingo, online dictionary, Internet terms, what is, what does it mean, stands for, computer, web glossary, online jargon, slang, net lingo, new words, Internet help, abbreviations, text messaging, IM, SMS, smileys, :-), social media, online privacy, cyber safe

In [11]:
div = soup.find('div', class_='list_box3')

acro_dict = {}
for li in div.find_all('li'):
    link = li.find('a').get_text().lower()  ##acronym 
    divtext = li.find('div').get_text().lower()   ##meaning
    acro_dict[link] = divtext

## remove acronyms with multiple meanings
final_dic = {k: v for k, v in acro_dict.items() if "-or-" and "or" not in v}

In [12]:
#adding common Acronyms 
final_dic['lol'] = 'laughing out loud'

#removing a few uncommon acronyms 
rem_list = ['was', 'of', 'its', 'ps', 'oh', 'so']

[final_dic.pop(key) for key in rem_list] 

print(len(final_dic))

2254


## Lowercasing

Lowercasing helps to avoid having the same word with multiple meanings.

In [13]:
#lowercase all text data
df_tweets_final['text'] = df_tweets_final['text'].astype(str).str.lower()

## Replace acronyms found in tweets with full sentence 

In order to replace each Acronym used within the tweets I used a combination of a lambda function and list comprehension. This will iterate through each tweet and replace any acronyms with their full meanings.

In [14]:
df_tweets_final['clean_tweets'] = df_tweets_final['text'].str.split().apply(lambda x:' '.join([final_dic.get(word,word) for word in x]))

In [15]:
df_tweets_final.tail()

Unnamed: 0,text,sentiment,clean_tweets
27470,lol i know and haha..did you fall asleep?? o...,negative,laughing out loud i know and haha..did you fal...
27472,http://twitpic.com/663vr - wanted to visit the...,negative,http://twitpic.com/663vr - wanted to visit the...
27473,in spoke to you yesterday and u didnt respond...,neutral,in spoke to you yesterday and you didnt respon...
27474,so i get up early and i feel good about the da...,positive,so i get up early and i feel good about the da...
27475,enjoy ur night,positive,enjoy you are night
27476,wish we could come see u on denver husband l...,negative,wish whatever could come see you on denver hus...
27477,i`ve wondered about rake to. the client has ...,negative,i`ve wondered about rake to. the client has ma...
27478,yay good for both of you. enjoy the break - y...,positive,yay good for both of you. enjoy the break - yo...
27479,but it was worth it ****.,positive,but it was worth it ****.
27480,all this flirting going on - the atg smiles...,neutral,all this flirting going on - the atg smiles. y...


## Clean up tweets with REGEX

In [16]:
### removing website links, punctuation marks, numbers
import re 

def clean_data(text):
    """removing website links, punctuation marks, and numbers"""
    text=re.sub(r'https?://\S+|www\.\S+',"",text)
    text=re.sub(r'[^\s\w\s]', "", text)
    text=re.sub(r'[\d+]', "", text)
    return text

df_tweets_final['clean_tweets'] = df_tweets_final['clean_tweets'].apply(lambda x: clean_data(x))


In [17]:
df_tweets_final.tail(10)

Unnamed: 0,text,sentiment,clean_tweets
27470,lol i know and haha..did you fall asleep?? o...,negative,laughing out loud i know and hahadid you fall ...
27472,http://twitpic.com/663vr - wanted to visit the...,negative,wanted to visit the animals but whatever wer...
27473,in spoke to you yesterday and u didnt respond...,neutral,in spoke to you yesterday and you didnt respon...
27474,so i get up early and i feel good about the da...,positive,so i get up early and i feel good about the da...
27475,enjoy ur night,positive,enjoy you are night
27476,wish we could come see u on denver husband l...,negative,wish whatever could come see you on denver hus...
27477,i`ve wondered about rake to. the client has ...,negative,ive wondered about rake to the client has made...
27478,yay good for both of you. enjoy the break - y...,positive,yay good for both of you enjoy the break you ...
27479,but it was worth it ****.,positive,but it was worth it
27480,all this flirting going on - the atg smiles...,neutral,all this flirting going on the atg smiles yay...


## Stemming 

After cleaning up the text, I tokenized each word and passed it through the Porter Stemmer. Stemming will help reduce each word to its root word, therefore decreasing the number of words in our corpus.

In [18]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer

# init stemmer
porter_stemmer=PorterStemmer()

def stemSentence(sentence):
    token_words=word_tokenize(sentence)
    token_words
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(porter_stemmer.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)



df_tweets_final['clean_tweets'] = df_tweets_final['clean_tweets'].apply(lambda x:stemSentence(x))
df_tweets_final.tail()

Unnamed: 0,text,sentiment,clean_tweets
27476,wish we could come see u on denver husband l...,negative,wish whatev could come see you on denver husba...
27477,i`ve wondered about rake to. the client has ...,negative,ive wonder about rake to the client ha made it...
27478,yay good for both of you. enjoy the break - y...,positive,yay good for both of you enjoy the break you p...
27479,but it was worth it ****.,positive,but it wa worth it
27480,all this flirting going on - the atg smiles...,neutral,all thi flirt go on the atg smile yay hug


## Removing stop words

We will be removing any stop words from our corpus. Stop words are essentially common words found in the english language that dont provide much value to our algorithms. 

In [19]:
import nltk
from nltk.corpus import stopwords
stopword = stopwords.words('english')

def removestopwords(sentence):
    '''Removing Stopwords'''
    token_words=word_tokenize(sentence)
    filtered_tweet=[]
    for word in token_words:
        if word not in stopword:
            filtered_tweet.append(word)
            filtered_tweet.append(" ")
    return "".join(filtered_tweet)


df_tweets_final['clean_tweets'] = df_tweets_final['clean_tweets'].apply(lambda x:removestopwords(x))
df_tweets_final.head()

Unnamed: 0,text,sentiment,clean_tweets
0,"i`d have responded, if i were going",neutral,id respond go
1,sooo sad i will miss you here in san diego!!!,negative,sooo sad miss san diego
2,my boss is bullying me...,negative,boss bulli
3,what interview! leave me alone,negative,interview leav alon
4,"sons of ****, why couldn`t they put them on t...",negative,son whi couldnt put releas whatev alreadi bought


In [45]:
df_tweets_final.dropna(inplace=True)

In [46]:
## Saving final df for general machine learning 
#df_tweets_final.to_csv('clean_tweets.csv', index=False)

# Preprocessing Data (Deep Learning)

When preprocessing text data for use in neural networks more information is always better. For these complex networks i performed most of the preprocessing steps completed earlier except stemming and stopword removal.

In [21]:
df_tweets_final_dl.head()

Unnamed: 0,text,sentiment
0,"I`d have responded, if I were going",neutral
1,Sooo SAD I will miss you here in San Diego!!!,negative
2,my boss is bullying me...,negative
3,what interview! leave me alone,negative
4,"Sons of ****, why couldn`t they put them on t...",negative


## Lowercasing + Replacing acronyms with dictionary of full sentence

In [22]:
df_tweets_final_dl['text'] = df_tweets_final_dl['text'].astype(str).str.lower()

df_tweets_final_dl['clean_tweets'] = df_tweets_final_dl['text'].str.split().apply(lambda x:' '.join([final_dic.get(word,word) for word in x]))

df_tweets_final_dl.head()

Unnamed: 0,text,sentiment,clean_tweets
0,"i`d have responded, if i were going",neutral,"i`d have responded, if i were going"
1,sooo sad i will miss you here in san diego!!!,negative,sooo sad i will miss you here in san diego!!!
2,my boss is bullying me...,negative,my boss is bullying me...
3,what interview! leave me alone,negative,what interview! leave me alone
4,"sons of ****, why couldn`t they put them on t...",negative,"sons of ****, why couldn`t they put them on th..."


## Cleaning up tweets with Regex

In [23]:
#removing punctutation, numbers, and website links
df_tweets_final_dl['clean_tweets'] = df_tweets_final_dl['clean_tweets'].apply(lambda x: clean_data(x))
df_tweets_final_dl.head()

Unnamed: 0,text,sentiment,clean_tweets
0,"i`d have responded, if i were going",neutral,id have responded if i were going
1,sooo sad i will miss you here in san diego!!!,negative,sooo sad i will miss you here in san diego
2,my boss is bullying me...,negative,my boss is bullying me
3,what interview! leave me alone,negative,what interview leave me alone
4,"sons of ****, why couldn`t they put them on t...",negative,sons of why couldnt they put them on the rele...


In [24]:
## Saving final df for deep learning 
# df_tweets_final_dl.to_csv('clean_tweets_dl.csv', index=False)