# Dirty Rotten Tomatoes?

## Goal - Jupyter notebook real world tutorial
* Web Scraping
    - Section 1 - Scrape critic reviews from rotten tomatoes and load into dataframe
    - Exercise 1 - Scrape audio reviews
* Data Enhancement
    - Section 2 - Basic dataframe exploration and data enhancement
    - Exercise 2 - Create new grading column based on original scores
* Analysis and Visualizations
    - Section 3 - Build a story notebook - hypothesis: tomatometer scores are suspicious
    - Exercise 3 - Build a story notebook - hypothesis: some audience scores are bots
* NLP/Machine Learning
    - Section 4 - build a bot detector
    - Exercise 4 - Enhance the bot detector

In [70]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [52]:
def calculate_score(original_score):
    if '/' in original_score:
        fraction = original_score.split('/')
        score = (float(fraction[0])/float(fraction[1]))*100
    elif original_score is not "":
        if 'A' in original_score:
            score = 100
        elif 'B' in original_score:
            score = 85
        elif 'C' in original_score:
            score = 75
        elif 'D' in original_score:
            score = 65
        elif 'F' in original_score:
            score = 50
        if '+' in original_score:
            score+=5
        elif '-' in original_score:
            score-=5
    else:
        score = 100
    return score
        

In [53]:
# Collect first page of artists’ list
#page = requests.get('https://www.rottentomatoes.com/m/paddington_2014/reviews/?page=1&sort=')
#soup = BeautifulSoup(page.text, 'html.parser')

In [67]:
def scrape_page(review_array, soup, movie):
    
    # Pull all text from the review table
    review_table = soup.find(class_='review_table')
    reviews = review_table.find_all(class_='row review_table_row')
    for row in reviews:
        review_dict = {}
        
        #print(row)
        critic_thumb_item = row.find(class_='critic_thumb fullWidth')
        critic_thumb_src = critic_thumb_item.get('src')
        critic_name = row.find(class_='articleLink').contents[0].lstrip()
        critic_pub = row.find(class_='subtle').contents[0].lstrip()
        if len(row.find(class_='small').contents) > 0:
            top_critic = True
        else:
            top_critic = False
        review_date = row.find(class_='review_date').contents[0].lstrip()
        review = row.find(class_='the_review').contents[0].lstrip()
        original_score_string = row.find(class_='small subtle').contents[2].lstrip()
        if "Original Score: " in original_score_string:
            original_score = original_score_string.split(':')[1].lstrip()
        else:
            original_score = ""
        score = calculate_score(original_score)

        review_dict['Movie'] = movie
        #review_dict['Critic Category'] = critic_category        
        review_dict['Critic Thumbnail'] = critic_thumb_src
        review_dict['Critic Name'] = critic_name
        review_dict['Critic Publication'] = critic_pub
        review_dict['Top Critic'] = top_critic
        review_dict['Review Date'] = review_date
        review_dict['Review'] = review
        review_dict['Original Score String'] = original_score_string
        review_dict['Original Score'] = original_score
        review_dict['Score'] = score
        
        #print("")
        #print(review_dict)
        #print("")
        review_array.append(review_dict)
    

In [71]:
def get_rotten_tomato_reviews(movie):
    review_array = []
    initial_page = requests.get('https://www.rottentomatoes.com/m/{}/reviews/'.format(movie))
    initial_soup = BeautifulSoup(initial_page.text, 'html.parser')
    
    page_info = initial_soup.find(class_='pageInfo').contents[0]
    #print(page_info)
    page_count_info = page_info.split("of ")
    if len(page_count_info) > 0:
        page_count = int(page_count_info[1])
    else:
        page_count = None
    #print(page_count)
    for i in range(page_count):      # Number of pages plus one 
        page = requests.get('https://www.rottentomatoes.com/m/{}/reviews/?page={}&sort='.format(movie,i+1))
        soup = BeautifulSoup(page.text, 'html.parser')
        scrape_page(review_array, soup, movie)
        
    return pd.DataFrame(review_array)

In [72]:
df = get_rotten_tomato_reviews("paddington_2014")

In [74]:
df.head()

Unnamed: 0,Critic Name,Critic Publication,Critic Thumbnail,Movie,Original Score,Original Score String,Review,Review Date,Score,Top Critic
0,Pat Padua,DCist,https://resizing.flixster.com/1n7WvueNeArY6BAq...,paddington_2014,,,Eventually gets lost in Scooby Doo plot territ...,"August 31, 2018",100.0,False
1,Joseph Walsh,CineVue,https://resizing.flixster.com/nl2LXtB9-_dYgvhQ...,paddington_2014,4/5,| Original Score: 4/5,"Paddington is delectably twee, yet it's the bi...","August 22, 2018",80.0,False
2,Felix Vasquez Jr.,Cinema Crazed,https://resizing.flixster.com/6X_sshl1bKskMnfq...,paddington_2014,,,It's a gem of an adaptation I hope becomes a c...,"June 7, 2018",100.0,False
3,Andre Meadows,Black Nerd Comedy,https://d2a5cgar23scu2.cloudfront.net/static/i...,paddington_2014,,,"A super adorable movie, totally not at all wha...","February 20, 2018",100.0,False
4,Camilla Long,Sunday Times (UK),https://d2a5cgar23scu2.cloudfront.net/static/i...,paddington_2014,4/5,| Original Score: 4/5,It is quite unlike any other meditation on par...,"February 15, 2018",80.0,False


In [75]:
df.describe()

Unnamed: 0,Score
count,148.0
mean,85.513514
std,12.795347
min,40.0
25%,75.75
50%,82.5
75%,100.0
max,100.0
