# Import:

In this cell, everything needed later is imported.

1. pands as pd: for reading the input files.
2. numpy as np.
3. string, nltk: for tokenizing and preprocessing the data.
4. gensim: to train the word2vec model.
5. sklearn: for svm classifier, SVD decomposision and metrics.

In [9]:
import pandas as pd
import numpy as np
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from sklearn.decomposition import TruncatedSVD
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.feature_extraction.text import TfidfVectorizer

# Tokenize:

In this cell, every text files, convert to a list whitout any puntuation and stop word.

1. Punctuation are not important at all, so removing them helps the document retreival.
2. Removing stop words is important beacuse they don't add much value to a text and make our search more sufficient and faster.
3. Also we convert every character to it's lower case. And we do the same with query So that makes search and comparing more easy.
4. And store each term stem instead of the term itself.

Also we get terms of each document as a set and get union with other document terms. So we will have all unique terms in the document collection.

In [10]:
def tokenize(documents):
    # Set the stop words for English
    stop_words = set(stopwords.words('english'))
    porter = PorterStemmer()
    
    # This is the final list of tokenized text in lists
    tokenized_list = []
    terms = set([])
    
    # For each document, we use nltk regex_tokenize to token all the text file
    for doc in documents:
        tokenized_text =  nltk.regexp_tokenize(doc, r'\d+,\d+|\w+')
        
        # Here we handle ',' in the numbers (beacuse nltk doesn't handle this)
        for i in range(len(tokenized_text)):
            if ',' in tokenized_text[i]:
                w = ''
                for c in tokenized_text[i]:
                    if c != ',':
                        w += c 
                    tokenized_text[i] = w
                    
        # And remove the stop words cause they don't add much to a text
        text_without_stop_words = [word for word in tokenized_text if word not in stop_words]
        
        # Then remove any punctuation and store the stem of each term
        text_without_punctuation = [porter.stem(word) for word in text_without_stop_words if word.isalnum()]
        
        tokenized_list.append(text_without_punctuation)
        
        # Getting unique terms till now
        terms = terms.union(set(text_without_punctuation))
    
    return tokenized_list, list(terms)

# Inverted Index:

Construct inverted index, a dictionary for document and collection frequency of each term using this function.

for each term in each document:
1. First check if it is already in the dictionary.
2. If not, add it to the dictionary with the value of dictionary storing it's document and it's frequency in it.
3. If is, Check if the document is new to the dictionary value of the term.
4. If is add a document to it's documents.
5. If not, increament the term frequency of that document.

- The needed parts for constructing the TF-IDF matrix, is commented.

Meanwhile, Update the collection frequecy (cf) and document frequency (df) for each term.
1. df will be increamented if it is a new document.
2. cf will be increased each time the term is seen.

In [11]:
def inverted_index(n_documents, tokenized_list):
    # This is the dictionary for inverted index that we will construct
    # inv_ind = dict()
    # Dictionaries to store document and collection frequency of each term.
    # df = dict()
    cf = dict()
    
    for doc_id in range(n_documents):
        for token_index in range(len(tokenized_list[doc_id])):
            token = tokenized_list[doc_id][token_index]
            # Checking if the term is already in the dictionary
            # if token in inv_ind:
            if token in cf:
                # If the document has already been added to the term's documents
                # if doc_id in list(inv_ind[token].keys()):
                #     inv_ind[token][doc_id] += 1
                #     cf[token] += 1
                # If this is a new document
                # else:
                #     inv_ind[token][doc_id] = 1
                #     df[token] += 1
                    cf[token] += 1
            # If this is a new term
            else:
                # inv_ind[token] = {doc_id:1}
                # df[token] = 1
                cf[token] = 1
                
    return cf #, df, inv_ind

# Construct tf_idf:

We will construct tf-idf matrix, by calculating the tf and idf:

1. for each term, for each document that has the term inside:
2. get the tf (stored in the inside dictionary with key the document).
3. get the idf using the formula: $$idf = \log(\frac{N}{df})$$ where df is stored in df dictionary for the term.
4. At last, divide the column vector (each document embeding) by it's norm to normalize the matrix.

$\bullet$ $tf\_idf_{t,d} = tf_{t, d}\cdot\log(\frac{N}{df_{t}})$ 

In [12]:
def construct_tf_idf_mat(inverted_index, df, terms, n_documents, n_terms):
    # The tf_idf matrix
    tf_idf = np.zeros((n_terms, n_documents), dtype='float')
    
    for i in range(n_terms):
        for j in list(inverted_index[terms[i]].keys()):
            tf = inverted_index[terms[i]][j]
            idf = np.log2(n_documents/df[terms[i]])
            tf_idf[i][j] = tf*idf
            
    for j in range(n_documents):
        # Calculating the norm of column vector
        s = 0
        for i in range(n_terms):
            s += tf_idf[i][j]**2
        s = s**(1/2)
        # Normalizing the column
        for i in range(n_terms):
            tf_idf[i][j] = tf_idf[i][j]/s
            
    return tf_idf

# Naive Bayes:

## Naive bayes train function:

In this function, the probabilities needed in the score calculating for naive bayes algorithm, will be calculated.
$$ P(t|c) = \frac{T_{t,c} + 1}{\sum_{t'\in V} T_{t',c} + 1} $$

In [13]:
def Naive_bayes_trainer(cf, terms):
    sum_of_all = 0
    for term in terms:
        if term in cf:
            sum_of_all += cf[term] + 1
        else:
            sum_of_all += 1
    
    naive_bayes = dict({})
    for term in list(cf.keys()):
        naive_bayes[term] = (cf[term] + 1)/sum_of_all
        
    return naive_bayes, sum_of_all

# Predict:

We score each document based on this (the prior probabilities for documents are considered to be equal):

$$ P(d|c) = -\sum_{t \in d} \log(P(t|c)) $$

In [14]:
def predict_naive_bayes(doc, stat_pos, stat_neg, p_pos, p_neg, den_pos, den_neg):
    pos = 0
    neg = 0
    
    for term in doc:
        if term in stat_pos:
            pos += np.log(stat_pos[term])
        else:
            pos -= np.log(den_pos)
            
        if term in stat_neg:
            neg += np.log(stat_neg[term])
        else:
            neg -= np.log(den_neg)
            
    pos *= p_pos
    neg *= p_neg
    
    if pos > neg:
        return 1
    elif neg > pos:
        return -1
    return np.random.choice([1, -1])

what we are doing in this cell:
1. Collect data
2. Tokenize each document
3. Get the collection frequency for each term

In [15]:
train_pos = pd.read_csv("./train_pos.csv", index_col=0).iloc[:, 1].values
train_neg = pd.read_csv("./train_neg.csv", index_col=0).iloc[:, 1].values
test_pos = pd.read_csv("./test_pos.csv", index_col=0).iloc[:, 1].values
test_neg = pd.read_csv("./test_neg.csv", index_col=0).iloc[:, 1].values

train_pos_doc, train_pos_terms = tokenize(train_pos)
train_neg_doc, train_neg_terms = tokenize(train_neg)
test_pos_doc, test_pos_terms = tokenize(test_pos)
test_neg_doc, test_neg_terms = tokenize(test_neg)

train_pos_cf = inverted_index(len(train_pos), train_pos_doc)
train_neg_cf = inverted_index(len(train_neg), train_neg_doc)
test_pos_cf = inverted_index(len(test_pos), test_pos_doc)
test_neg_cf = inverted_index(len(test_neg), test_neg_doc)

# tf_idf_train_pos = construct_tf_idf_mat(train_pos_inv_ind, trian_pos_df, train_pos_terms, len(train_pos), len(train_pos_terms))
# tf_idf_train_neg = construct_tf_idf_mat(train_neg_inv_ind, trian_neg_df, train_neg_terms, len(train_neg), len(train_neg_terms))
# tf_idf_test_pos = construct_tf_idf_mat(test_pos_inv_ind, test_pos_df, test_pos_terms, len(test_pos), len(test_pos_terms))
# tf_idf_test_neg = construct_tf_idf_mat(test_neg_inv_ind, test_neg_df, test_neg_terms, len(test_neg), len(test_neg_terms))

1. Train the Naive Bayes classifier using train data.
2. Test on the test data.
3. Calculate the accuracy for positive class, negative class, and general train data.

As we can see, the accuracy is 81.7% which is not so good.

In [16]:
terms = list(set(train_pos_terms).union(set(train_neg_terms)))

naive_bayes_pos, den_pos = Naive_bayes_trainer(train_pos_cf, terms)
naive_bayes_neg, den_neg = Naive_bayes_trainer(train_neg_cf, terms)

n_train_pos = len(train_pos)
n_train_neg = len(train_neg)
n_train = n_train_pos + n_train_neg
p_pos = n_train_pos/n_train
p_neg = n_train_neg/n_train

right_predict_pos = 0
for doc in test_pos_doc:
    p = predict_naive_bayes(doc, naive_bayes_pos, naive_bayes_neg, p_pos, p_neg, den_pos, den_neg)
    if p == 1:
        right_predict_pos += 1
        
right_predict_neg = 0
for doc in test_neg_doc:
    p = predict_naive_bayes(doc, naive_bayes_pos, naive_bayes_neg, p_pos, p_neg, den_pos, den_neg)
    if p == -1:
        right_predict_neg += 1
        
pos_per = right_predict_pos/len(test_pos)
neg_per = right_predict_neg/len(test_neg)
total_per = (right_predict_pos + right_predict_neg)/(len(test_pos) + len(test_neg))

print(f"accuracy of positive test prediction: {pos_per}")
print(f"accuracy of negative test prediction: {neg_per}")
print(f"total accuracy of test prediction: {total_per}")

accuracy of positive test prediction: 0.7564
accuracy of negative test prediction: 0.87416
total accuracy of test prediction: 0.81528


# Word2Vec:

# Training:

1. First we set the train data.
2. Train the Word2Vec model with vector size 500 while the model window is 20.
3. Then get the model vectors.

In [9]:
train_data = list(train_pos_doc) + list(train_neg_doc)
model_w2v = Word2Vec(sentences=train_data, vector_size=200, window=20, min_count=1, workers=8)
word_vectors = model_w2v.wv

1. In this cell, we construnct a $25000\times 500$ matrix for train and test.
2. This matrix is based on Word2Vec model Trained in the previous cell.
3. The embedding for each document will be the average on it's terms. 

In [10]:
doc_embeddings_train = []
for doc in train_pos_doc + train_neg_doc:
    embedding = np.array([0 for i in range(200)])
    
    for word in doc:
        embedding = embedding + word_vectors[word]
    
    doc_embeddings_train.append([weight/len(doc) for weight in embedding])
    

doc_embeddings_test = []
for doc in test_pos_doc + test_neg_doc:
    embedding = np.array([0 for i in range(200)])
    
    for word in doc:
        if word in word_vectors:
            embedding = embedding + word_vectors[word]
    
    doc_embeddings_test.append([weight/len(doc) for weight in embedding])

1. Set the SVM classifier
2. Predict the test data
3. calculate the accuracy on predicting test data

In [13]:
train_labels = [1 for i in range(len(train_pos))] + [0 for i in range(len(train_neg))]
test_labels = [1 for i in range(len(test_pos))] + [0 for i in range(len(test_neg))]

# Set the SVM classifier
svm_Word2Vec = SVC()
svm_Word2Vec.fit(doc_embeddings_train, train_labels)

# Predictions on the test set
predictions = svm_Word2Vec.predict(doc_embeddings_test)

# Evaluate accuracy
accuracy = accuracy_score(test_labels, predictions)
print(f"Accuracy on SVM for Word2Vec embeddings with length 100: {accuracy}")

Accuracy on SVM for Word2Vec embeddings with length 100: 0.83668


These are the precision, recall and F1-score metrics. They're all near 83.5%, same as the accuracy.

In [14]:
precision = precision_score(test_labels, predictions)
recall = recall_score(test_labels, predictions)
f1 = f1_score(test_labels, predictions)

print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Precision: 0.839477292893442
Recall: 0.83256
F1 Score: 0.8360043378720328


# LSA:

In [17]:
data_train = list(train_pos) + list(train_neg)
data_text = list(test_pos) + list(test_neg)

# Set the vectorizer
vectorizer = TfidfVectorizer(stop_words='english')
tf_idf_train = vectorizer.fit_transform(data_train)
tf_idf_test = vectorizer.transform(data_text)

1. Set the SVD using TruncatedSVD.
2. Train The LSA based on constructed matricies in the previous cell.
3. Transform The train and Test data based on the trained LSA model.

In [18]:
lsa = TruncatedSVD(n_components=200, random_state=123)
train_lsa = lsa.fit_transform(tf_idf_train)
test_lsa = lsa.transform(tf_idf_test)

1. Set the labels for train and test (the way we separated them).
2. Set the SVM classifier.
3. Train the SVM using Train data.
4. Predict the test data.
5. Calculate the accuracy.

The accuracy using LSA combined with SVD on documents embedding is 84.2% which is a little bit better than Naive Bayes.

In [19]:
train_labels = [1 for i in range(len(train_pos))] + [0 for i in range(len(train_neg))]
test_labels = [1 for i in range(len(test_pos))] + [0 for i in range(len(test_neg))]

svm_LSA = SVC()
svm_LSA.fit(train_lsa, train_labels)

# Predictions on the test set
predictions_lsa = svm_LSA.predict(test_lsa)

# Evaluate accuracy
accuracy = accuracy_score(test_labels, predictions_lsa)
print(f"Accuracy for SVM on document embeddings gained from LSA: {accuracy}")

Accuracy for SVM on document embeddings gained from LSA: 0.86608


These are the precision, recall and F1-score metrics. They're all near 84%, same as the accuracy.

In [21]:
precision_LSA = precision_score(test_labels, predictions_lsa)
recall_LSA = recall_score(test_labels, predictions_lsa)
f1_LSA = f1_score(test_labels, predictions_lsa)

print("Precision:", precision_LSA)
print("Recall:", recall_LSA)
print("F1 Score:", f1_LSA)

Precision: 0.8708866915221267
Recall: 0.8596
F1 Score: 0.8652065383686288


# Conclusion:

Both of the approaches are somehow same at predicting and classifing the test data, based on the train data. But using SVM on LSA, is doing a better predictions on testset, the accuracy is 86.6%. While the accuracy using Naive Bayes and SVM on Word2Vec is about 84%.