# Sentiment Analysis on IMDB Movie Reviews




## Task Description

> Sentiment analysis using transformers.

# Getting Data

#### Dataset Description

The IMDb dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative (this is the polarity). The dataset contains of an even number of positive and negative reviews (balanced). Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. There are 25,000 highly polar movie reviews for training, and 25,000 for testing.

#### Pulling the data from `huggingface/datasets`

We use Hugging Face's awesome datasets library to get the pre-processed version of the original [IMDB dataset](https://ai.stanford.edu/~amaas/data/sentiment/).

The code below pulls the train and test datasets from [huggingface/datasets](https://github.com/huggingface/datasets) using `load_dataset('imdb')` and transform them into `pandas` dataframes for use with the `simpletransformers` library to train the model.

In [11]:
import pandas as pd
from datasets import load_dataset

dataset_train = load_dataset('imdb',split='train')
train_df=pd.DataFrame(dataset_train)
train_df = train_df.rename(columns={'label': 'labels'})

dataset_test = load_dataset('imdb',split='test')
test_df=pd.DataFrame(dataset_test)
test_df = test_df.rename(columns={'label': 'labels'})

Once done we can take a look at the `head()` of the training set to check if our data has been retrieved properly.

In [7]:
train_df.head()

Unnamed: 0,text,labels
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [8]:
data = [[train_df.labels.value_counts()[0], test_df.labels.value_counts()[0]],
        [train_df.labels.value_counts()[1], test_df.labels.value_counts()[1]]]
# Prints out the dataset sizes of train test and validate as per the table.
pd.DataFrame(data, columns=["Train", "Test"])

Unnamed: 0,Train,Test
0,12500,12500
1,12500,12500


# Training and Testing the Model

In [9]:
train_args = {
    'reprocess_input_data': True,
    'overwrite_output_dir': True,
    'sliding_window': True,
    'max_seq_length': 64,
    'num_train_epochs': 1,
    'learning_rate': 0.00001,
    'weight_decay': 0.01,
    'train_batch_size': 128,
    'fp16': True,
    'output_dir': '/outputs/',
}

In [10]:
from simpletransformers.classification import ClassificationModel
import pandas as pd
import logging
import sklearn

logging.basicConfig(level=logging.DEBUG)
transformers_logger = logging.getLogger('transformers')
transformers_logger.setLevel(logging.WARNING)

# We use the XLNet base cased pre-trained model.
model = ClassificationModel('xlnet', 'xlnet-base-cased', num_labels=2, args=train_args)

# Train the model, there is no development or validation set for this dataset
# https://simpletransformers.ai/docs/tips-and-tricks/#using-early-stopping
model.train_model(train_df)

# Evaluate the model in terms of accuracy score
result, model_outputs, wrong_predictions = model.eval_model(test_df, acc=sklearn.metrics.accuracy_score)

config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/467M [00:00<?, ?B/s]

Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['logits_proj.bias', 'logits_proj.weight', 'sequence_summary.summary.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/467M [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

  0%|          | 0/25000 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

  scaler = amp.GradScaler()


Running Epoch 1 of 1:   0%|          | 0/1353 [00:00<?, ?it/s]

  with amp.autocast():


  0%|          | 0/25000 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1695 [00:00<?, ?it/s]

  with amp.autocast():


We see that the output accuracy from the model after training for 1 epoch is **92.2%** ('acc': 0.92156).

## Using the Model (Running Inference)

Running the model to do some predictions/inference is as simple as calling `model.predict(input_list)`.

In [12]:
samples = ['The script is nice.Though the casting is absolutely non-watchable.No style. the costumes do not look like some from the High Highbury society. Comparing Gwyneth Paltrow with Kate Beckinsale I can only say that Ms. Beckinsale speaks British English better than Ms. Paltrow, though in Ms. Paltrow\'s acting lies the very nature of Emma Woodhouse. Mr. Northam undoubtedly is the best Mr. Knightley of all versions, he is romantic and not at all sharp-looking and unfeeling like Mr. Knightley in the TV-version. P.S.The spectator cannot see at all Mr. Elton-Ms. Smith relationship\'s development as it was in the motion version, so one cannot understand where was all Emma\'s trying of make a Elton-Smith match (besides of the portrait).']
predictions, _ = model.predict(samples)
label_dict = {0: 'negative', 1: 'positive'}
for idx, sample in enumerate(samples):
  print('{} - {}: {}'.format(idx, label_dict[predictions[idx]], sample))

  0%|          | 0/1 [00:00<?, ?it/s]

Predicting:   0%|          | 0/1 [00:00<?, ?it/s]

0 - positive: The script is nice.Though the casting is absolutely non-watchable.No style. the costumes do not look like some from the High Highbury society. Comparing Gwyneth Paltrow with Kate Beckinsale I can only say that Ms. Beckinsale speaks British English better than Ms. Paltrow, though in Ms. Paltrow's acting lies the very nature of Emma Woodhouse. Mr. Northam undoubtedly is the best Mr. Knightley of all versions, he is romantic and not at all sharp-looking and unfeeling like Mr. Knightley in the TV-version. P.S.The spectator cannot see at all Mr. Elton-Ms. Smith relationship's development as it was in the motion version, so one cannot understand where was all Emma's trying of make a Elton-Smith match (besides of the portrait).


  with amp.autocast():
