<a href="https://colab.research.google.com/github/francoisdoanp/rotten-tomatoes-dataset/blob/master/Projet_text_mining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **MATH60121 - ROTTEN TOMATOES PROJECT**

---


The goal of this project explore two aspects of text mining: sentiment analysis and text summarization. To do so, we will use the context of Rotten Tomatoes, a popular movie review aggregation website. The user is presented short summaries of reviews by various critics, and a score is assigned: rotten or fresh.

As such, our project will be divided in two parts. The first part is to train a model to accurately predict if a movie is fresh or rotten based on the short summary. For this task, we will use Nicolas Gervais' rotten tomatoes dataset, available [here](https://https://github.com/francoisdoanp/rotten-tomatoes-dataset).

In the second part, we will attempt to summarize movie reviews from the Rolling Stone magazine, which rates its movies on 5 stars. Finally, we will use our model to predict whether our summary indicates a fresh or rotten score, which we can verify based on the stars. If our summarization is pertinent, then it should indicate whether the critic's sentiment towards that movie.




# Part 1: Sentiment Analysis

---



In [0]:
# Importing libraries

import pandas as pd
import numpy as np
import re

import matplotlib.pyplot as plt
import seaborn as sns



In [0]:
# Uploading Dataset

url_base = 'https://raw.githubusercontent.com/francoisdoanp/rotten-tomatoes-dataset/master/small_rotten.txt'

rotten_db = pd.read_csv(url_base)

### Exploratory analysis

In [0]:
# Structure of the dataset

print(rotten_db.head)

# Examples of Fresh and Rotten movie reviews
print('Fresh review:', rotten_db['Review'][1])
print('Rotten review:', rotten_db['Review'][0])

In [0]:
# Distribution of fresh vs rotten

dist = rotten_db['Freshness'].value_counts()
print(dist)
sns.barplot(dist.index, dist.values)

### Preprocessing

In [0]:
# Removing end of sentences' full content available message

test_c = "Decent enough as an action pic but sorely needing some Schwarzenegger ""magic."" (Full Content Review for Parents also available)"
test2_c = "Aside from a few jokes and some puns related to birds and the eventual invading pig population, this is a pretty lame exercise. (Full Content Review for Parents - Violence, Sexual Content, etc. - Also Available)"
test3_c = "While fans of the original work might get into this filmed adaptation, I came in cold and left even chillier in terms of appreciating what the movie was trying to be and how all of that was executed. (Full Content Review for Pare"

res_c = re.sub(r'\(Full Content.*$', " ", test3_c)

print("Original Sentence:", test3_c, "\nModified Sentence:", res_c)


In [0]:
# Removing end of sentences' Spanish translation Notice

test_s = "Thor: Ragnarock is a party and everyone is invited. [Full Review in Spanish]"

res_s = re.sub(r'\[Full Re.*$', " ", test_s)

print("Original Sentence:", test_s, "\nModified Sentence:", res_s)

In [61]:
# Removing video format notice

test_f = "[VIDEO ESSAY] ""Wild"" is an unsatisfying self-help drama that exposes the limitations of Reece Witherspoon's range."
test2_f = "[VIDEO] There's nothing flashy about David Gelb's serviceable rendering of a man who has achieved an unrivaled mastery of a cuisine he helped invent. You too might come away from the movie dreaming of Jiro Ono's sushi."

res_f = re.sub(r'\[VID.*\]', "", test2_f)

print("Original Sentence:", test_f, "\nModified Sentence:", res_f)

Original Sentence: [VIDEO ESSAY] Wild is an unsatisfying self-help drama that exposes the limitations of Reece Witherspoon's range. 
Modified Sentence:  There's nothing flashy about David Gelb's serviceable rendering of a man who has achieved an unrivaled mastery of a cuisine he helped invent. You too might come away from the movie dreaming of Jiro Ono's sushi.


In [0]:
# Separating in features and labels

reviews = rotten_db['Review'].values
labels = rotten_db['Freshness'].values

# Preparing test reviews 

test1 = reviews[5]
test2 = reviews[6]
test3 = reviews[7]
test4 = reviews[11]
test5 = reviews[13]

processed_reviews = []

for rev in range(0, len(reviews)):

  # Removing special instances as demonstrated previousy
  processed_review = re.sub(r'\(Full Content.*$', " ", str(reviews[rev]))
  processed_review = re.sub(r'\[Full Re.*$', " ", processed_review)
  processed_review = re.sub(r'\[VID.*\]', "", processed_review)

  # Removing special characters
  processed_review = re.sub(r'\W', ' ', processed_review)

  # Removing single characters
  processed_review = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_review)
  processed_review = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_review)

  # Removing multiple spaces
  processed_review = re.sub(r'\s+', ' ', processed_review, flags=re.I)
  processed_review = re.sub(r'\^\s+', '', processed_review, flags=re.I)

  # Transform reviews to lower case
  processed_review = processed_review.lower()

  processed_reviews.append(processed_review)

In [0]:
print("Original Sentence:", reviews[5], "\nModified Sentence:", processed_reviews[5], "\n\nOriginal Sentence:", reviews[6], "\nModified Sentence:", processed_reviews[6],
      "\n\nOriginal Sentence:", reviews[7], "\nModified Sentence:", processed_reviews[7], "\n\nOriginal Sentence:", reviews[11], "\nModified Sentence:", processed_reviews[11],
      "\n\nOriginal Sentence:", reviews[13], "\nModified Sentence:", processed_reviews[13])

In [0]:
# Creating list of reviews

X = list(rotten_db['Review'])
Y = np.array(list(rotten_db['label']))

print(X)