<a href="https://colab.research.google.com/github/as-bestinclass/anand4MahiGaming/blob/main/Scenario_V.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Dependencies

In [112]:
import requests
import tarfile
import os
import matplotlib.pyplot as pyplt
import seaborn as sborn
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

## 1. Data Collection and Extraction

In [113]:
#defining the online location where the dataset is saved
path_of_url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
data_directory = "aclImdb_v1.tar.gz"

# code to write the response form the url into a file and save it
returned_response = requests.get(path_of_url)
with open(data_directory, 'wb') as file_to_download:
    file_to_download.write(returned_response.content)

# with the helo of tarfile library, we can extract the file locally in this environment
with tarfile.open(data_directory, 'r:gz') as tar:
    tar.extractall()

# checking download status
if os.path.exists("aclImdb"):
    print("The Dataset has been successfully downloaded and extracted.")
else:
    print("Error: Could not locate the downloaded file!")

The Dataset has been successfully downloaded and extracted.


## 2. Data Pre-Processing

#### File and Directory Preprocessor Functions

In [114]:
#processing each text file
def file_preprocessor(data):

    the_tokens = data.split()

    list_stopwords = set(stopwords.words("english")) #selecting the language of the data to preprocess

    #Complete Tokenized Words
    final_tokens = []
    for each in the_tokens:
      lower_word = each.lower()
      if lower_word not in list_stopwords:
        final_tokens.append(each)

    #Lemmatizing to preserve context in language
    the_lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [the_lemmatizer.lemmatize(word) for word in final_tokens]

    return " ".join(lemmatized_tokens)

In [115]:
# preprocessing complete directory
def directory_preprocessor(directory_path):

    data_list = []
    labels_list = []

    #processing each directory from binary class
    for each in ["pos", "neg"]:
        label_directory = os.path.join(directory_path, each)

        for filename in os.listdir(label_directory):

            with open(os.path.join(label_directory, filename), "r", encoding="utf-8") as file:
                output_data = file.read()
                #after reading each entry in the folder, it calls file_preprocessor to cleanup data
                preprocessed_text = file_preprocessor(output_data)
                data_list.append(preprocessed_text)
                labels_list.append(each)


    return data_list, labels_list

#### Calling To Preprocess

In [None]:
# Preprocessing Train Directory
train_data, train_labels = directory_preprocessor("/content/aclImdb/train")

#Preprocessing Test Directory
test_data, test_labels = directory_preprocessor("/content/aclImdb/test")

## 3. Feature Extraction

In [None]:
#Defining the Vectorizer Function and number of max features
the_vectorizer = TfidfVectorizer(max_features=10000)

#vectorized_train_data
vectorized_train_features = the_vectorizer.fit_transform(train_data)

#vectorized_test_data
vectorized_test_features = the_vectorizer.transform(test_data)

## 4. Model Selection

In [None]:
PredictionModel = LogisticRegression()

## 5. Model Training and Evaluation

#### Training

In [None]:
# Fitting the model to train the data into an equation/function
PredictionModel.fit(vectorized_train_features, train_labels)

#### Testing

In [None]:
# Calculating sample predictions on test data to check quality of newly constructed model
predicted_results = PredictionModel.predict(vectorized_test_features)

In [None]:
the_accuracy_score = accuracy_score(test_labels, predicted_results)
print("The Accuracy Score for the generated model is ", the_accuracy_score)

In [None]:
the_confusion_matrix = confusion_matrix(test_labels, predicted_results)

#plotting heatmap of confusion matrix
pyplt.figure(figsize=(5, 5))
sborn.heatmap(the_confusion_matrix, annot=True, fmt='g', cbar=False,
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'])

pyplt.xlabel('Predicted Labels')
pyplt.ylabel('True Labels')
pyplt.title('Confusion Matrix')
pyplt.show()

In [None]:
the_classification_report = classification_report(test_labels, predicted_results)
print("Classification Report:\n\n", the_classification_report)

## Model Interpretation and Improvement

#### Sample Data

In [None]:
#these are some sample of moview reviews extracted from the website above to demo the model
#https://evolutionwriters.com/samples_and_examples/movie_reviews/

sample_reviews = [
    {
        "title": "Forrest Gump",
        "summary": "The film follows the life of Alabama native Forest Gump, a good man with an ideally low IQ of 75, as historical events occur through his eyes. Jenny Curran was one of Forrest Gump’s few childhood friends and his first and only romantic interest. Throughout the film, Gump regularly experiences adversity but never loses his positive outlook. He may have been physically impaired, but he always retained sight of the big picture. He overcame adversity with his mother’s support and received a football scholarship."
    },
    {
        "title": "Stranger Things",
        "summary": "The following Movie Review On Stranger Things and Its Visual Effects is an important topic for everyone to think about. If you need some great paper writing services to help you craft a similar one, don’t hesitate to address us."
    },
    {
        "title": "Alice in Wonderland",
        "summary": "Tim Burton’s 'Alice in Wonderland' was probably the most anticipated movie of the year 2010. But is it any good? First of all, I should probably mention that hardcore Lewis Carroll fans may find very disappointing. The movie is merely based on the book. It features most of the same characters and a couple of the same locations, but the story is completely different. Alice is no longer a little girl, but an unconventional young woman who has the same dream of finding herself in Wonderland every night, until one day it happens in real life. Her arrival was expected. Alice is supposed to save the inhabitants from the evil Red Queen and get the reign back to her kind sister, the White Queen. She knows nothing about it, but she is meant to be the knight in shining armor, both figuratively and literally speaking."
    },
    {
        "title": "Beautiful Mind",
        "summary": "Beautiful Mind is a biographical movie about John Forbes Nash junior, mathematical genius with hard fate. At the beginning of his career he has made enormous contribution in the field of Game Theory, which was a revolution in this mathematical area and almost brought international fame to the author. The movie was filmed in 2001 by Ron Howard and today it is called modern classics: a true drama, making you think and evoking emotions."
    }
]

#### Sample Predictions

In [None]:
def AnalyzeSentiment(review):
    vectorized_data = the_vectorizer.transform([file_preprocessor(review)])
    prediction = PredictionModel.predict(vectorized_data)
    if prediction== ['pos']:
      return('Positive Sentiment')
    elif prediction== ['neg']:
      return('Negative Sentiment')
    else:
      return('Sentiment Not Figuredout')

#list to store sentiments from prediction
predicted_list = []

#predicting each review from sample reviews
for each in sample_reviews:
    summary_text = each['summary']
    review_sentiment = AnalyzeSentiment(summary_text)
    each['sentiment'] = review_sentiment
    predicted_list.append(each)

#Print out reviews
for each in predicted_list:
    print('Movie:', each['title'])
    print('Sentiment:', each['sentiment'], '\n')

## My Suggestions for Improvement

We all know that making models is very computation and resources sensitive task. Due to limited access of computing power, I chose a simple Logistic regression model for binary classification.

Additionally, we can tune number of features while performing feature extraction to scale or text data.

If I had more computing power, I would chose pre-trained BERT model to perfrom re-training with dataset provided. BERT is a NLP model and it has higher scores and metrics for performance however, it takes significantly more time and computing power than Logistic regression to construct a model