# Exploratory Data Analysis with Rotten Tomatoes Data

In [1]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Introduction

Rotten Tomatoes gathers movie reviews from critics. An [entry on the website](http://www.rottentomatoes.com/m/primer/reviews/?type=top_critics) typically consists of a short quote, a link to the full review, and a Fresh/Rotten classification which summarizes whether the critic liked/disliked the movie.


When critics give quantitative ratings (say 3/4 stars, Thumbs up, etc.), determining the Fresh/Rotten classification is easy. However, publications like the New York Times don't assign numerical ratings to movies, and thus the Fresh/Rotten classification must be inferred from the text of the review itself.

## The Data

You will be starting with a database of Movies, derived from the MovieLens dataset. This dataset includes information for about 10,000 movies, including the IMDB id for each movie.

In [2]:
# pull in data
critics = pd.read_csv('../data/critics.csv')

#let's drop rows with missing quotes
critics = critics[~critics.quote.isnull()]

## Exploratory Data Analysis

#### How many reviews, critics, and movies are there in this data set?

#### What is the shape of the data set?

#### Extract the top 5 rows of the data set.

#### List the 5 publications with the most reviews. hint: use `.goupby()` and `.count()`

#### List the 5 critics with the most reviews, along with the publication they write for hint: use `.goupby()` and `.count()`

#### Create a column 'fresh_binary' based on the 'fresh' column. If the value is 'fresh', give me 1, otherwise 0.

#### Plot the "fresh" rating proportions as a function of year. Comment on the result -- is there a trend? What do you think it means? FYI-- you must createa a 'year' column

## Next Steps
You could look into text data analysis using the `sklearn`, `nltk`, or `spacy` packages. Transform the critic review text into data to be used in a random forest model. Train a random forest model with your transformed data and run cross validation to see what the model fit is like. Then, you can identify the most important features in your model. Interested in doing so but not sure how to start? Email anahita at *abahri@bu.edu*.