# Fake Movie Review Generation

To build the dataset, we will first scrape reviews of Joker, a 2019 film whose reviews are abundant and diverse in opinion, from professional critics on Rotten Tomatoes using Selenium. The collected reviews will be labelled as genuine since the identity of the critics are verified and highly credible. We then pass the genuine reviews collected into a text generation AI, namely GPT-2, to produce fake reviews while preserving the positive-negative ratio. We will eventually obtain a class-balanced dataset with half genuine and half fake reviews, where each record contains the review text and an indicator of whether it likes the movie.

## Scrape Reviews

In [2]:
'''
Source: https://stackoverflow.com/questions/69963743/scraping-all-reviews-of-a-movie-from-rotten-tomato-using-soup
'''
import pandas as pd
import requests
import re
import time

headers = {
    'Referer': 'https://www.rottentomatoes.com/m/notebook/reviews?type=user',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
}

s = requests.Session()
        
def get_reviews(url):
    r = requests.get(url)
    movie_id = re.findall(r'(?<=movieId":")(.*)(?=","type)',r.text)[0]

    api_url = f"https://www.rottentomatoes.com/napi/movie/{movie_id}/criticsReviews/all" #use reviews/userfor user reviews
    
    payload = {
        'direction': 'next',
        'endCursor': '',
        'startCursor': '',
    }
    
    review_data = []
    
    while True:
        r = s.get(api_url, headers=headers, params=payload)
        data = r.json()
        print(data['pageInfo'])

        if data['pageInfo']['hasNextPage']:
            payload['endCursor'] = data['pageInfo']['endCursor']
            payload['startCursor'] = data['pageInfo']['startCursor'] if data['pageInfo'].get('startCursor') else ''

        review_data.extend(data['reviews'])
        
        if not data['pageInfo']['hasNextPage']:
            break
        
        time.sleep(1)
    
    return review_data

data = get_reviews('https://www.rottentomatoes.com/m/joker_2019/reviews')
df = pd.json_normalize(data)
df.to_csv('critic_reviews.csv')

hello


In [None]:
import pandas as pd
import html

df = pd.read_csv("critic_reviews_complete.csv")
quotes = df.apply(lambda row : html.unescape(row['quote']),axis=1)
info = df[['isFresh','scoreOri']]

cleaned = pd.concat([info, quotes], axis=1)
cleaned.to_csv("cleaned.csv")
