A Natural Language Processing project that aims to classify twitter sentiments as negative or positive.

In [3]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import nltk
from nltk import pos_tag
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize, regexp_tokenize, RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
import string
import re
import os


## DATA MUNGING

In [4]:
#import stop words
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /home/ezzy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
#file path
file_path = "/home/ezzy/Desktop/repos/Text-classifier/judge-1377884607_tweet_product_company.csv"
#using os to access the file using the path
path = os.path.join(file_path)
#list of different types of encodings
encodings = ["utf-8", "latin1", "iso-8859-1", "cp1252"]
#looping through the list and creating a DataFrame 
for encode in encodings:
    try:
        tweet_df = pd.read_csv(path, encoding= encode)
        print(f"File read successfully with encoding: {encode}")
        break
    except: UnicodeDecodeError 
    print(f"Failed to read file with encoding {encode}: ")

Failed to read file with encoding utf-8: 
File read successfully with encoding: latin1


##### Inspecting our data 
* Look at how the DataFrame looks like(how many rows and columns are there?)
* Checking for null values
* Checking if duplicates are there

In [6]:
tweet_df

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product


Looking at the head and tail of the DataFrame we can see that we have three columns namely; `tweet_text`- consists of the username and his or her sentiment, `emotion_in_tweet_is_directed_at`- product at which the sentiment is based on, `is_there_an_emotion_directed_at_a_brand_or_product`- emotion based on the user's sentiment classified into positive, negative and no emotion toward brand or product. The DataFrame has 9093 records.

In [7]:
tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


The info of the DataFrame tells us the `tweet_text` column has 1 missing value, `emotion_in_tweet_is_directed_at` has the highest missing values with a record of 5802 missing records and `is_there_an_emotion_directed_at_a_brand_or_product` with no missing record. We can also see the data types we are working with are objects which in our case are strings 

In [8]:
tweet_df.describe()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
count,9092,3291,9093
unique,9065,9,4
top,RT @mention Marissa Mayer: Google Will Connect...,iPad,No emotion toward brand or product
freq,5,946,5389


From the descriptive statistics we see that all tweets are unique meaning that no user's tweeted the same thing. The most tweeted product is `ipad` and the most common emotion directed towards the product is `No emotion toward brand or product`.
#### Dealing with null values
We have two columns having nan values in there records. To deal with this problem, I will either drop the nan values or impute them using the mode since we are generally using categorical data. `emotion_in_tweet_is_directed_at` column has 5803 records missing that is almost 63.81% of missing data. This will be hard to impute because almost half the data is missing and also imputing such a huge number of missing values can lead to miscommunication from the data and we can end up making wrong decisions or gaining wrong insights. The `tweet_text` column also has one missing value and since these are unique tweets from different users we will drop the nan value.

In [26]:
tweet_df["emotion_in_tweet_is_directed_at"].value_counts()

emotion_in_tweet_is_directed_at
iPad                               946
Apple                              661
iPad or iPhone App                 470
Google                             430
iPhone                             297
Other Google product or service    293
Android App                         81
Android                             78
Other Apple product or service      35
Name: count, dtype: int64

#### Text preprocessing

This involves;
* Removing stop words(a, an, and, that etc) - words that appear many times in our text but has low semantic value or meaning.
* Removing punctuation- hyphens, fullstops, hashtags, parenthesis
* Lowering all the words
* Removing words that are not in the vocabulary(i.e usernames)
I will utilize NLTk to perform the above steps

In [22]:
#function to help us preprocess our text using regex
def pattern_remover(text, pattern):
    rem = re.findall(pattern, text)
    
    for word in rem:
        
        text = re.sub(word, "", text)
        
    return text

Removing user names from the sentiments.

In [30]:
tweet_df["is_there_an_emotion_directed_at_a_brand_or_product"].value_counts()

is_there_an_emotion_directed_at_a_brand_or_product
No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: count, dtype: int64