### IMDB Sentiment Analysis 
This model uses IMDB  dataset from Stanford NLP via Hugging Face (stanfordnlp/imdb) 

#### Dataset Summary
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. It includes 25,000 highly polar movie reviews for training, and 25,000 for testing. 

Data values:
- text: a string feature.
- label: a classification label, with possible values including neg (0), pos (1).

In [174]:
import torch
from datasets import load_dataset
from transformers import pipeline, AutoModel, AutoTokenizer, BertTokenizer, Trainer, TrainingArguments    
import torch.nn.functional as f

from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm

import re
import numpy as np
import pandas as pd 

import pyarrow as pa


In [106]:
import nltk
#nltk.download('punkt') # one time execution
# nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.tokenize import word_tokenize

In [202]:
# Loading IMDB Dataset 
dataset = load_dataset("imdb", split='train').to_pandas()
#dataset

# Splitting Text and Labels 
reviews = dataset['text']
labels = dataset['label']


In [220]:
# Preprocessing the data 
# Remove <br /><br /> tags
# Remove ***** 
# Remove stop words 

def clean_text(txt): 
    try: 
        word_tokens = word_tokenize(txt)
        filtered_word = [w for w in word_tokens if w.lower() not in stop_words] 
        filtered_word = [w for w in filtered_word if w.lower() not in ["<br /><br />", "*****", "* * * *", '<', 'br', '/', '>', '<', 'br', '/', '>']] 
        filtered_word = [w + " " for w in filtered_word]
        return "".join(filtered_word)
    except:
        return txt
    
    txt["text"] = "Review: " + txt["text"]



# a = 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn\'t have much of a plot.'

# clean_text(a)

dataset2 = dataset
dataset2['cleaned_text'] = dataset2['text'].apply(clean_text)

dataset2

In [224]:
from datasets import Dataset
hg_dataset = Dataset(pa.Table.from_pandas(dataset2))

In [225]:
hg_dataset

Dataset({
    features: ['text', 'label', 'cleaned_text'],
    num_rows: 25000
})

In [226]:
################################################################
########  Using pipeline function for Sentiment Analysis ########
classifier = pipeline('sentiment-analysis')
classifier_results_dict = {}
counter = 0
results = []

for out in classifier(KeyDataset(hg_dataset, "cleaned_text"), batch_size=8, truncation="only_first"):    
    print(out)
    results.append(out)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'label': 'NEGATIVE', 'score': 0.9922564029693604}
{'label': 'NEGATIVE', 'score': 0.9993064403533936}
{'label': 'NEGATIVE', 'score': 0.6541439890861511}
{'label': 'NEGATIVE', 'score': 0.9501397609710693}
{'label': 'NEGATIVE', 'score': 0.9966205358505249}
{'label': 'NEGATIVE', 'score': 0.9996757507324219}
{'label': 'NEGATIVE', 'score': 0.9996438026428223}
{'label': 'NEGATIVE', 'score': 0.9964983463287354}
{'label': 'NEGATIVE', 'score': 0.9957259893417358}
{'label': 'NEGATIVE', 'score': 0.9967143535614014}
{'label': 'NEGATIVE', 'score': 0.8005483746528625}
{'label': 'NEGATIVE', 'score': 0.9992138147354126}
{'label': 'NEGATIVE', 'score': 0.9994953870773315}
{'label': 'NEGATIVE', 'score': 0.9910323023796082}
{'label': 'NEGATIVE', 'score': 0.9948911666870117}
{'label': 'NEGATIVE', 'score': 0.999721109867096}
{'label': 'NEGATIVE', 'score': 0.9975783228874207}
{'label': 'NEGATIVE', 'score': 0.999669075012207}
{'label': 'NEGATIVE', 'score': 0.9996323585510254}
{'label': 'NEGATIVE', 'score': 0.

In [233]:
## Writing Model Outputs to csv 

# import csv

# keys = results[0].keys()

# with open('results.csv', 'w', newline='') as output_file:
#     dict_writer = csv.DictWriter(output_file, keys)
#     dict_writer.writeheader()
#     dict_writer.writerows(results)

In [266]:
df = pd.DataFrame(results)
df["actual_label"] = labels
# df["actual_label"] = ['NEGATIVE' if x == 0 else "POSITIVE" for x in labels ]

df["label"] = [0 if x == "NEGATIVE" else 1 for x in df["label"] ]


In [275]:
# print(df["label"].value_counts())
# print(df["actual_label"].value_counts())
df

Unnamed: 0,label,score,actual_label
0,0,0.992256,0
1,0,0.999306,0
2,0,0.654144,0
3,0,0.950140,0
4,0,0.996621,0
...,...,...,...
24995,0,0.996695,1
24996,1,0.677708,1
24997,1,0.999657,1
24998,0,0.996958,1


In [274]:
### Evaluating Model Performance Metrocs 

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print("Accuracy Score:", round(accuracy_score(df["actual_label"], df["label"]),2))  # ratio of correctly predicted instances to the total instances in the dataset ((TP + TN) / (TP + TN + FP + FN))
print("Precision Score:", round(precision_score(df["actual_label"], df["label"]),2)) #  how many of the predicted positive instances are actually positive ( TP / (TP + FP))
print("Recall Score:", round(recall_score(df["actual_label"], df["label"]),2)) # proportion of actual positive instances that were correctly predicted (TP / (TP + FN))
print("F1 Score:", round(f1_score(df["actual_label"], df["label"]),2)) # the harmonic mean of precision and recall (2 * (Precision * Recall) / (Precision + Recall))


Accuracy Score: 0.84
Precision Score: 0.9
Recall Score: 0.77
F1 Score: 0.83


In [228]:

# # Store the model we want to use
# MODEL_NAME = "bert-base-cased"

# # We need to create the model and tokenizer
# model = AutoModel.from_pretrained(MODEL_NAME)
# tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# # Testing that the model and tokenizer work as intended 
# tokens = tokenizer.tokenize("This is an input example")
# print("Tokens: {}".format(tokens))


# training_args = TrainingArguments("test-trainer")

# trainer = Trainer(
#     model,
#     training_args,
#     train_dataset = tokenized_datasets["train"]

# )