<a href="https://colab.research.google.com/github/rabimist/Deep-Learning-for-Natural-Language-Processing/blob/main/Movie_Review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Name:** Deen Mohammad Abdullah

**Student ID:** 001209782

The purpose of this project is to train and test Movie Review dataset of NLTK using word embedding. Here, I have used CountVectorizer of sklearn to covert words into vector and Logistic Regression as the model.

**Deep Learning for Natural Language Processing**


Execute the following code to run the project:

In [None]:
####################### Importing Packages ######################
import numpy as np
import nltk
nltk.download('movie_reviews') # -------------------------------- downloading the movie_review dataset
nltk.download('stopwords') # ------------------------------------ downloading the stopwords from NLTK
nltk.download('wordnet')  # ------------------------------------- for lematization we need to download wordnet (WordNetLemmatizer)
from nltk.corpus import movie_reviews # ------------------------- importing the movie_review dataset
from nltk.corpus import stopwords # ----------------------------- importing the stopwords
from nltk.stem import WordNetLemmatizer # ----------------------- importing wordNetLemmatizer for lemmatization
import re # ----------------------------------------------------- importing regularExpression to extract only text from the dataset
from sklearn import preprocessing  # ---------------------------- to process the movie categories into vectors as labels
from sklearn.feature_extraction.text import CountVectorizer  # -- to perform the wordToVector operation and create featuresMatrix
from sklearn.model_selection import train_test_split # ---------- splitting the dataset into training and testing sets
from sklearn.linear_model import LogisticRegression # ----------- the regression model to classify the reviews
from sklearn import metrics # ----------------------------------- to print the accuracy, precision and recall scores of the model
#################################################################

#-------- This function remove special charecters from text -----------
def removeSpecialCharacter(word_list):
  cleanWordList = []
  
  for word in word_list:
    if (re.match('[a-zA-Z]+', word)):
      cleanWordList.append(word.lower())
  
  return cleanWordList
#-----------------------------------------------------------------------

#--------- This function remove special characters from text -----------
def removeStopWords (word_list):
  stop_words = set(stopwords.words('english'))
  
  filteredWords = [] 
  
  for word in word_list:
    if word not in stop_words: 
      filteredWords.append(word)
      
  return filteredWords
#------------------------------------------------------------------------

#----------This function uses WordNet and lematizes the text ------------
def lemmatize (word_list):
  lemmatizer = WordNetLemmatizer()
  
  filteredWords = []
  
  for word in word_list:
    filteredWords.append(lemmatizer.lemmatize(word))
    
  return filteredWords
#------------------------------------------------------------------------

############# The Executable Statements start from here ########################

# taking all the reviews and their corresponding category into document
document = [(movie_reviews.words(file_id),category) for file_id in movie_reviews.fileids() for category in movie_reviews.categories(file_id)] 

review = []
categories = []

# For each review:
#     removing the special characters and stop words
#     lemmatizing the text
for (word_list,category) in document:
  word_list = removeSpecialCharacter (word_list)
  word_list = removeStopWords (word_list)
  word_list = lemmatize (word_list)
  
  text = ''
  for word in word_list:   
    text = text + word + ' '
  
  review.append(text)
  categories.append(category)

# generating featureMatrix after performing word2Vec operation
word2vec = CountVectorizer()
featuresMatrix = word2vec.fit_transform(review)

# processing the categories into labels vector
leb = preprocessing.LabelEncoder()
labels = leb.fit_transform(categories)

print ('Feature Matrix Size: ' + str(featuresMatrix.shape))
print ('Labels Size: ' + str(labels.shape))

# splitting the dataset into training and testing sets (training 90% and testing 10%)
X_train, X_test, y_train, y_test = train_test_split(featuresMatrix, labels, test_size=0.10)

# Train the model
model = LogisticRegression(random_state = 0, solver = 'lbfgs', max_iter = 20000, multi_class = 'auto')
model.fit (X_train, y_train)

# Test the model
result = model.predict (X_test)

# Displaying the performance
print ('Precision score of the model: ' + str(metrics.precision_score(y_test, result, average='macro', labels = np.unique (result))))
print('Recall score of the model: ' + str(metrics.recall_score (y_test, result, average = 'macro',labels = np.unique(result))))
print ('Accuracy of the model: ' + str(metrics.accuracy_score(y_test, result)))


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Feature Matrix Size: (2000, 34442)
Labels Size: (2000,)
Precision score of the model: 0.8243107769423559
Recall score of the model: 0.824668807707748
Accuracy of the model: 0.825
