# NLP- Text Classification Model on Movie Review

This text classification model for sentiment analysis on movie reviews is a tool that automatically analyzes and classifies the sentiment expressed in movie reviews. Using advanced natural language processing techniques and machine learning algorithms, this model examines the text of movie reviews and categorizes them as positive, or negative sentiment. By accurately identifying the sentiment conveyed in the reviews, our model provides valuable insights into audience reactions, enabling filmmakers, critics, and movie enthusiasts to understand the reception of movies more effectively. Experience the efficiency and accuracy of our text classification model in analyzing movie reviews and unlocking valuable sentiment information.

In [186]:
import numpy as np
import re
import pickle
from sklearn.datasets import load_files
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\freta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [187]:
cd "C:\Users\freta\Desktop\NLP Data"

C:\Users\freta\Desktop\NLP Data


# Importing Dataset (for small size data only)

In [188]:
review = load_files('sent_txt')

In [189]:
X, y = review.data, review.target

# Persisting Data for large Size Data

In [190]:
# Serialize the data (storing the data as Pickle files)

In [191]:
with open ('X.pickle', 'wb') as f:
    pickle.dump(X, f)

In [192]:
with open ('y.pickle' , 'wb') as f:
    pickle.dump(y, f)

In [193]:
# Deserialize the data

In [194]:
with open('X.pickle' , 'rb') as f:
    XX=pickle.load(f)

In [195]:
with open('y.pickle' , 'rb') as f:
    yy=pickle.load(f)

# Pre-processing The Data

In [196]:
# Creating Corpus

corpus = []
for i in range(len(X)):
    data = re.sub(r'\W', ' ', str(X[i])) #replace all the non word charachters with space
    data = data.lower()
    data = re.sub(r'\s+[a-z]\s+', ' ', data) # remove single charachters in the middle of a sentences
    data = re.sub(r'^[a-z]\s+', ' ', data) # remove single characters from the beginning of sentences
    data = re.sub(r'\s+', ' ', data) # remove extra spaces and replace by single space
    corpus.append(data)

#### Create Bag of Words Model - (Optional)

In [197]:
#from sklearn.feature_extraction.text import CountVectorizer

In [198]:
#vectorizer = CountVectorizer(max_features = 2000, min_df = 3, max_df = 0.6, stop_words = stopwords.words('english'))

In [199]:
#X=vectorizer.fit_transform(corpus).toarray()

#### Transform Bag of Words Model into TF-IDF Model - (Optional)

In [200]:
#from sklearn.feature_extraction.text import TfidfTransformer 

In [201]:
#transformer = TfidfTransformer()

In [202]:
#X = transformer.fit_transform(X).toarray()

# Creating TF-IDF Vectorizer 

In [234]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [236]:
vectorizer = TfidfVectorizer(max_features = 2000, min_df = 3, max_df = 0.6, stop_words = stopwords.words('english') )
X = vectorizer.fit_transform(corpus).toarray()

## Creating a Training and Testing Set

In [203]:
from sklearn.model_selection import train_test_split

In [204]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 0)

## Training our Classifier 

In [205]:
from sklearn.linear_model import LogisticRegression 

In [206]:
lr = LogisticRegression()

In [207]:
lr.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [208]:
y_pred = lr.predict(X_test)

## Testing Model Performance 

In [209]:
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score

In [218]:
cm = confusion_matrix(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)
roc = roc_auc_score (y_test, y_pred)

In [233]:
print('CM: \n', cm )
print('Accuracy: \n', acc )
print('AUROC: \n', roc)

CM: 
 [[168  40]
 [ 21 171]]
Accuracy: 
 0.8475
AUROC: 
 0.8491586538461539


# Saving Our Model

In [238]:
# Pickling the classifier
with open('classifier.pickle', 'wb') as f:
    pickle.dump(lr, f)

In [239]:
# Pickleing the vectorizer
with open ('vectorizer.pickle', 'wb') as f:
    pickle.dump(vectorizer, f) 

# Importing and Using our Model

In [290]:
# Unpickling the classifier and the vectorizer
with open('classifier.pickle', 'rb') as f:
    lr_clf = pickle.load(f)
    


In [289]:
with open('vectorizer.pickle', 'rb') as f:
    tfidf = pickle.load(f)

In [294]:
# New data to be classfied
new_data = ["We are lucky to have you, I have never seen anyone like you, you are an amazing person."]

In [295]:
# Vetorize the new data
newdata_vect= tfidf.transform(new_data).toarray()

In [296]:
# Predict new data
print(lr_clf.predict(newdata_vect))

[1]
