<b>In this notebook, I explain the different encoding techniques such as one-hot encoding, bag_of_words and with term frequency encoding from sklearn.</b>

<b>And to evaluate the accuracy of data which build using one of encoding techniques from above, I am going to use model MultinomialNB which is one of the popular text classification algorithm.</b>

<b>And finally test accuracy using accuracy_score metric.</b>

# Importing Libraries

<b>First part is to import all necessary libraries, and then start coding. And import libraries which is needed in program further as I proceed.</b>

In [1]:
# general data preprocessing libraries
import pandas as pd           
import numpy as np
from tqdm import tqdm

# for text preprocessing
import spacy
from nltk.corpus import stopwords
from string import punctuation as punctuations
from nltk.tokenize import word_tokenize
import re
from nltk.stem import PorterStemmer
# nltk.download('punkt')

# methods i used to build training and testing data
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# model
from sklearn.naive_bayes import MultinomialNB

# metrics
from sklearn.metrics import accuracy_score

# to dump model
import pickle

stemmer = PorterStemmer()
stop_words = stopwords.words('english')
nlp = spacy.load('en_core_web_sm')




# Collecting Dataset

<b>Let's create a dataset after importing libraries ( using pandas ).</b>

In [2]:
df = pd.read_csv('tweets/training.csv', header=None)
df = df.rename(columns={0:'id', 1:'company', 2:'sentiment', 3:'raw_tweet'})
df = df[df['sentiment'] != 'Neutral']
df = df[df['sentiment'] != 'Irrelevant']
df = df.dropna()
train_tweets = df['raw_tweet'].values

In [3]:
test_df = pd.read_csv('tweets/validation.csv', header=None)
test_df = test_df.rename(columns={0:'id', 1:'company', 2:'sentiment', 3:'raw_tweet'})
test_df = test_df[test_df['sentiment'] != 'Neutral']
test_df = test_df[test_df['sentiment'] != 'Irrelevant']

test_tweets = test_df['raw_tweet'].values

# Text Preprocessing

<b>Now, create a helping functions which preprocessed the text data, using other buil-in functions. Preprocessed text, </b>

    - removing punctuations, stopwords
    - stemming word to reduce the size of dataset and avoid multiple words which has same meaning, etc.

In [4]:
def text_preprocessing(sentences):
    preprocessed_text_data = []
    for sentence in tqdm(sentences):
        doc = nlp(sentence)

        link_free_doc = []
        for token in doc:
            if not token.like_url:
                link_free_doc.append(token)
        lowering_docs          = [token.text.lower() for token in link_free_doc]
        without_puns_tokens = []
        for token in lowering_docs:
            words = []
            for char in token:
                if char not in punctuations:
                    words.append(char)
            words = ''.join(words)
            if len(words) > 2:
                without_puns_tokens.append(words)

        without_stopwords_tokens = [token for token in without_puns_tokens if token not in stop_words]
        stemmer_tokens           = [stemmer.stem(token) for token in without_stopwords_tokens]
        
        preprocessed_text = ' '.join(stemmer_tokens)
        preprocessed_text_data.append(preprocessed_text)
    return preprocessed_text_data

In [5]:
train_preprocessed_tweet = text_preprocessing(train_tweets)
test_preprocessed_tweet = text_preprocessing(test_tweets)

100%|██████████| 43013/43013 [08:57<00:00, 79.96it/s] 
100%|██████████| 543/543 [00:06<00:00, 87.11it/s]


In [6]:
def combine_sents(preprocessed_tweet):
    sent_data = []
    for sent in preprocessed_tweet:
        sent_data.append([sent])
    return sent_data

In [7]:
train_tweet = combine_sents(train_preprocessed_tweet)
test_tweet = combine_sents(test_preprocessed_tweet)

<b>First and simple encoding technique is one_hot encoding. </b>

<b>So let's start from it.</b>

# 1 One-Hot Encoding

In [8]:
encoder = OneHotEncoder(handle_unknown='infrequent_if_exist')

encoder.fit(train_tweet) # adapt the data to the encoder

one_hot_X_train = encoder.transform(train_tweet)
one_hot_X_test = encoder.transform(test_tweet)

one_hot_y_train = pd.get_dummies(df['sentiment'], dtype='int').values[:, 1:]
one_hot_y_test = pd.get_dummies(test_df['sentiment'], dtype='int').values[:, 1:]

In [9]:
print(one_hot_X_train.shape, one_hot_y_train.shape)
print(one_hot_X_test.shape, one_hot_y_test.shape)

(43013, 34223) (43013, 1)
(543, 34223) (543, 1)


In [10]:
one_hot_nb_model = MultinomialNB()
one_hot_nb_model.fit(one_hot_X_train, one_hot_y_train)

  y = column_or_1d(y, warn=True)


In [11]:
one_hot_y_pred = one_hot_nb_model.predict(one_hot_X_test)

In [12]:
accuracy_score(one_hot_y_test, one_hot_y_pred)

0.858195211786372

In [13]:
print('Training accuracy', accuracy_score(one_hot_nb_model.predict(one_hot_X_train), one_hot_y_train))
print('Testing accuracy', accuracy_score(one_hot_nb_model.predict(one_hot_X_test), one_hot_y_test))

Training accuracy 0.982656406202776
Testing accuracy 0.858195211786372


<b>It is quite good accuracy.</b>

<b>But clearly, we can see this model is overfitting the data.</b>

<b>Let's try with another method TF-IDF</b>

# 4. TF-IDF

In [14]:
tfidf_vectorizer = TfidfVectorizer()

tfidf_vectorizer.fit(train_preprocessed_tweet)

tfidf_X_train = tfidf_vectorizer.transform(train_preprocessed_tweet)
tfidf_X_test = tfidf_vectorizer.transform(test_preprocessed_tweet)

tfidf_y_train = one_hot_y_train
tfidf_y_test = one_hot_y_test

In [15]:
print(tfidf_X_train.shape, tfidf_y_train.shape)
print(tfidf_X_test.shape, tfidf_y_test.shape)

(43013, 14521) (43013, 1)
(543, 14521) (543, 1)


In [16]:
tfidf_nb_model = MultinomialNB()
tfidf_nb_model.fit(tfidf_X_train, tfidf_y_train)

  y = column_or_1d(y, warn=True)


In [17]:
tfidf_y_pred = tfidf_nb_model.predict(tfidf_X_test)

In [18]:
accuracy_score(tfidf_y_test, tfidf_y_pred)

0.9208103130755064

In [19]:
print('Training accuracy', accuracy_score(tfidf_nb_model.predict(tfidf_X_train), tfidf_y_train))
print('Testing accuracy', accuracy_score(tfidf_nb_model.predict(tfidf_X_test), tfidf_y_test))

Training accuracy 0.8934973147653035
Testing accuracy 0.9208103130755064


<b>It give very good accuracy than previous one, on both data training as well as testing.</b>

<b>It means it perform great on unseen data.</b>

<b>Let's try with last but not least method CountVecotrizer. Which is similar to the one-hot encoding, but it use bag_of word technique. As it store a number of words appear in sentence, instead of 1 (which indiate word is present or not in senetece, else 0 if not.) </b>

# Count Vectorizer

In [20]:
count_vectorizer = CountVectorizer()

count_vectorizer.fit(train_preprocessed_tweet)

cv_X_train = count_vectorizer.transform(train_preprocessed_tweet)
cv_X_test = count_vectorizer.transform(test_preprocessed_tweet)

cv_y_train = tfidf_y_train
cv_y_test = tfidf_y_test

In [21]:
print(cv_X_train.shape, cv_y_train.shape)
print(cv_X_test.shape, cv_y_test.shape)

(43013, 14521) (43013, 1)
(543, 14521) (543, 1)


In [22]:
cv_nb_model = MultinomialNB()
cv_nb_model.fit(cv_X_train, cv_y_train)

  y = column_or_1d(y, warn=True)


In [23]:
cv_y_pred = cv_nb_model.predict(cv_X_test)

In [24]:
accuracy_score(cv_y_test, cv_y_pred)

0.9005524861878453

In [25]:
print('Training accuracy', accuracy_score(cv_nb_model.predict(cv_X_train), cv_y_train))
print('Testing accuracy', accuracy_score(cv_nb_model.predict(cv_X_test), cv_y_test))

Training accuracy 0.8846395275846837
Testing accuracy 0.9005524861878453


<b>This one also give better accuracy then One-Hot encoding but less then Tf-IDF.</b>

# Save model

<b>Save best model.</b>

In [26]:
pickle.dump(tfidf_nb_model, open('best_nb_model.mdl', 'wb'))

In [27]:
model = pickle.load(open('best_nb_model.mdl', 'rb'))