## Web scraping Rotten Tomatoes data, cleaning and pre-processing it in predicting scores (on TPU)

In this Jupyter Notebook I will web scrape Rotten Tomatoes data in order to get as much review and score data as possible. I will use this data, clean and pre-process it and try to make audience score predictions using BERT.

### Initial setup

First I install the relevant libraries and initiate the TPU.

In [2]:
import tensorflow as tf
import logging
from tensorflow.keras.layers import (
    Dense,
    Flatten,
    Conv1D,
    Dropout,
    Input,
)
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import Model
from tensorflow.keras import regularizers
from transformers import BertTokenizer, TFBertModel
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tqdm import tqdm
tqdm.pandas()
import re
import random

In [3]:
os.environ["WANDB_API_KEY"] = "0" ## to silence warning

In [4]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
except ValueError:
    strategy = tf.distribute.get_strategy() # for CPU and single GPU
    print('Number of replicas:', strategy.num_replicas_in_sync)

### Web-scraping Rotten Tomatoes data

Before starting the process of web-scraping the data, I first collected as many films (movie_urls) as I could. Initially, I attempted to collect at least 4 films afferent to each Tomatometer score. However, after running frequency checks, I soon realized the audience ratings were strongly imbalanced; with around 30% of the ratings being 5 stars and very few 1 star and 2 stars ratings. In order to reduce the imbalanced, I replaced some of the top films with bottom films with a higher number of reviews. I collected a total of 230 films.

In [34]:
movie_urls = {'Wonder Woman 1984': 'https://www.rottentomatoes.com/m/wonder_woman_1984/reviews?type=user', 
              'Soul':'https://www.rottentomatoes.com/m/soul_2020/reviews?type=user', 
              'Mulan': 'https://www.rottentomatoes.com/m/mulan_2020/reviews?type=user',
              'Birds of Prey': 'https://www.rottentomatoes.com/m/birds_of_prey_2020/reviews?type=user',
              'Sonic': 'https://www.rottentomatoes.com/m/sonic_the_hedgehog_2020/reviews?type=user'
             }

movie_urls["Captain Marvel"] = "https://www.rottentomatoes.com/m/captain_marvel/reviews?type=user"
movie_urls["Lion King"] = "https://www.rottentomatoes.com/m/the_lion_king_2019/reviews?type=user"
movie_urls["Aladdin"] = "https://www.rottentomatoes.com/m/aladdin/reviews?type=user"
movie_urls["Joker"] = "https://www.rottentomatoes.com/m/joker_2019/reviews?type=user"
movie_urls["Shazam!"] = "https://www.rottentomatoes.com/m/shazam/reviews?type=user"
movie_urls["Godzilla: King of the Monsters"] = "https://www.rottentomatoes.com/m/godzilla_king_of_the_monsters_2019/reviews?type=user"

movie_urls['Tenet'] = 'https://www.rottentomatoes.com/m/tenet/reviews?type=user'
movie_urls['Scoob'] = 'https://www.rottentomatoes.com/m/scoob/reviews?type=user'
movie_urls['The Marksman'] = 'https://www.rottentomatoes.com/m/the_marksman_2021/reviews?type=user'
movie_urls['Artemis Fowl'] = 'https://www.rottentomatoes.com/m/artemis_fowl/reviews?type=user'
movie_urls['Lovebirds'] = 'https://www.rottentomatoes.com/m/the_lovebirds_2020/reviews?type=user'

movie_urls["DISASTER MOVIE"] ='https://www.rottentomatoes.com/m/disaster_movie/reviews?type=user'
movie_urls['KEEPING UP WITH THE STEINS'] = 'https://www.rottentomatoes.com/m/keeping_up_with_the_steins/reviews?type=user'
movie_urls['DIABOLIQUE'] = 'https://www.rottentomatoes.com/m/1069985-diabolique/reviews?type=user'
movie_urls['THE WHITE HAIRED WITCH OF LUNAR KINGDOM'] = 'https://www.rottentomatoes.com/m/the_white_haired_witch_of_lunar_kingdom_3d/reviews?type=user'
movie_urls['INFERNO'] = 'https://www.rottentomatoes.com/m/1095979-inferno/reviews?type=user'
movie_urls["KICKIN' IT OLD SKOOL"] = 'https://www.rottentomatoes.com/m/kickin_it_old_school/reviews?type=user'
movie_urls["BIG MAMMA'S BOY"] = 'https://www.rottentomatoes.com/m/big_mammas_boy/reviews?type=user'
movie_urls['RENEGADES'] = 'https://www.rottentomatoes.com/m/1017405-renegades/reviews?type=user'
movie_urls['NEVER TALK TO STRANGERS'] = 'https://www.rottentomatoes.com/m/never_talk_to_strangers/reviews?type=user'
movie_urls['DR. HECKYL AND MR. HYPE'] = 'https://www.rottentomatoes.com/m/dr_heckyl_and_mr_hype/reviews?type=user'
movie_urls['WHISPERS IN THE DARK'] = 'https://www.rottentomatoes.com/m/whispers_in_the_dark/reviews?type=user'
movie_urls['LIVE WIRE'] ='https://www.rottentomatoes.com/m/1040204-live_wire/reviews?type=user'
movie_urls['NATURE OF THE BEAST'] ='https://www.rottentomatoes.com/m/1061646-nature_of_the_beast/reviews?type=user'
movie_urls['THE RIVERMAN'] ='https://www.rottentomatoes.com/m/the-riverman/reviews?type=user'
movie_urls['SCANNER COP'] ='https://www.rottentomatoes.com/m/scanner_cop/reviews?type=user'
movie_urls['VAMPIRES SUCK'] ='https://www.rottentomatoes.com/m/vampires_suck/reviews?type=user'
movie_urls['THE LOVE GURU'] ='https://www.rottentomatoes.com/m/love_guru/reviews?type=user'
movie_urls["YOU DON'T MESS WITH THE ZOHAN"] ='https://www.rottentomatoes.com/m/you_dont_mess_with_the_zohan/reviews?type=user'
movie_urls['BLACK AND WHITE'] ='https://www.rottentomatoes.com/m/1090027-black_and_white/reviews?type=user'
movie_urls['POUND OF FLESH'] ='https://www.rottentomatoes.com/m/pound_of_flesh_2015/reviews?type=user'
movie_urls['THE FIRST PURGE'] ='https://www.rottentomatoes.com/m/the_first_purge/reviews?type=user'
movie_urls['CELL'] ='https://www.rottentomatoes.com/m/cell_2016/reviews?type=user'
movie_urls['THE GALLOWS'] ='https://www.rottentomatoes.com/m/the_gallows/reviews?type=user'
movie_urls['A GOOD DAY TO DIE HARD'] ='https://www.rottentomatoes.com/m/a_good_day_to_die_hard/reviews?type=user'
movie_urls['NORM OF THE NORTH'] ='https://www.rottentomatoes.com/m/norm_of_the_north/reviews?type=user'
movie_urls['SHOW DOGS'] ='https://www.rottentomatoes.com/m/show_dogs/reviews?type=user'
movie_urls['SHERLOCK GNOMES'] ='https://www.rottentomatoes.com/m/sherlock_gnomes/reviews?type=user'
movie_urls['THE HOUSE'] ='https://www.rottentomatoes.com/m/the_house_2017/reviews?type=user'
movie_urls['COLLIDE'] ='https://www.rottentomatoes.com/m/collide_2017/reviews?type=user'
movie_urls['THE COLD LIGHT OF DAY'] ='https://www.rottentomatoes.com/m/the_cold_light_of_day/reviews?type=user'
movie_urls['EMMANUELLE'] ='https://www.rottentomatoes.com/m/emmanuelle/reviews?type=user'
movie_urls["ARGENTO'S DRACULA"] ='https://www.rottentomatoes.com/m/dracula_3d/reviews?type=user'
movie_urls['KISS OF THE DAMNED'] ='https://www.rottentomatoes.com/m/kiss_of_the_damned_2012/reviews?type=user'
movie_urls['UNFORGETTABLE'] ='https://www.rottentomatoes.com/m/unforgettable_2017/reviews?type=user'
movie_urls['AFTERMATH'] ='https://www.rottentomatoes.com/m/aftermath_2017/reviews?type=user'
movie_urls['THE WOMAN IN BLACK 2: ANGEL OF DEATH'] ='https://www.rottentomatoes.com/m/the_woman_in_black_2_angel_of_death/reviews?type=user'
movie_urls['THE LAST DAYS ON MARS'] ='https://www.rottentomatoes.com/m/the_last_days_on_mars/reviews?type=user'
movie_urls['EXPOSED'] ='https://www.rottentomatoes.com/m/exposed_2016/reviews?type=user'
movie_urls['REVENGE OF THE GREEN DRAGONS'] ='https://www.rottentomatoes.com/m/revenge_of_the_green_dragons/reviews?type=user'
movie_urls['PAY THE GHOST'] ='https://www.rottentomatoes.com/m/pay_the_ghost/reviews?type=user'
movie_urls['AMERICAN HEIST'] ='https://www.rottentomatoes.com/m/american_heist/reviews?type=user'
movie_urls['THE QUIET ONES'] ='https://www.rottentomatoes.com/m/the_quiet_ones_2013/reviews?type=user'
movie_urls['RETURN TO SENDER'] ='https://www.rottentomatoes.com/m/return_to_sender_2015/reviews?type=user'
movie_urls['HICK'] ='https://www.rottentomatoes.com/m/hick_2012/reviews?type=user'
movie_urls['BAD ASS'] ='https://www.rottentomatoes.com/m/bad_ass_2012/reviews?type=user'
movie_urls['SON OF SARDAAR'] ='https://www.rottentomatoes.com/m/son_of_sardaar_2012/reviews?type=user'
movie_urls['THE CIRCLE'] ='https://www.rottentomatoes.com/m/the_circle_2017/reviews?type=user'
movie_urls['KAZAAM'] ='https://www.rottentomatoes.com/m/kazaam/reviews?type=user'
movie_urls['BENEATH THE DARKNESS'] ='https://www.rottentomatoes.com/m/beneath_the_darkness/reviews?type=user'
movie_urls['THE FALLING'] ='https://www.rottentomatoes.com/m/the_falling_2014/reviews?type=user'
movie_urls['Miss Congeniality'] ='https://www.rottentomatoes.com/m/miss_congeniality/reviews?type=user'
movie_urls['A Night at the Roxbury'] ='https://www.rottentomatoes.com/m/night_at_the_roxbury/reviews?type=user'
movie_urls['Cloverfield'] ='https://www.rottentomatoes.com/m/cloverfield/reviews?type=user'
movie_urls['The Girl Who Played with Fire'] ='https://www.rottentomatoes.com/m/girl_who_played_with_fire/reviews?type=user'
movie_urls['The First Wives Club'] ='https://www.rottentomatoes.com/m/first_wives_club/reviews?type=user'
movie_urls["Nick and Norah's Infinite Playlist"] ='https://www.rottentomatoes.com/m/nick_and_norahs_infinite_playlist/reviews?type=user'
movie_urls['The Road to El Dorado'] ='https://www.rottentomatoes.com/m/road_to_el_dorado/reviews?type=user'
movie_urls['Joe Dirt'] ='https://www.rottentomatoes.com/m/joe_dirt/reviews?type=user'
movie_urls['The Other Boleyn Girl'] ='https://www.rottentomatoes.com/m/other_boleyn_girl/reviews?type=user'
movie_urls['Bangkok Dangerous'] ='https://www.rottentomatoes.com/m/bangkok_dangerous_1999/reviews?type=user'
movie_urls["Everything's Gone Green"] ='https://www.rottentomatoes.com/m/everythings_gone_green/reviews?type=user'
movie_urls['Chasing Liberty'] ='https://www.rottentomatoes.com/m/chasing_liberty/reviews?type=user'
movie_urls['Derailed'] ='https://www.rottentomatoes.com/m/derailed/reviews?type=user'
movie_urls['Bruce Almighty'] ='https://www.rottentomatoes.com/m/bruce_almighty/reviews?type=user'
movie_urls['Good Luck Chuck'] ='https://www.rottentomatoes.com/m/good_luck_chuck/reviews?type=user'
movie_urls['Cheaper by the Dozen'] ='https://www.rottentomatoes.com/m/cheaper_by_the_dozen/reviews?type=user'
movie_urls['Head over Heels'] ='https://www.rottentomatoes.com/m/head_over_heels_2001/reviews?type=user'
movie_urls['8 MIle'] ='https://www.rottentomatoes.com/m/8_mile/reviews?type=user'
movie_urls['Employee of the Month'] ='https://www.rottentomatoes.com/m/1141892_employee_of_the_month?/reviews?type=user'
movie_urls['Love & Other Drugs'] ='https://www.rottentomatoes.com/m/love_and_other_drugs?/reviews?type=user'
movie_urls['Drive Me Crazy'] ='https://www.rottentomatoes.com/m/drive_me_crazy/reviews?type=user'
movie_urls['Failure to Launch'] ='https://www.rottentomatoes.com/m/failure_to_launch/reviews?type=user'
movie_urls['Shrek the Third'] ='https://www.rottentomatoes.com/m/shrek_3/reviews?type=user'
movie_urls['No Strings Attached'] ='https://www.rottentomatoes.com/m/no_strings_attached_2011/reviews?type=user'
movie_urls['The Bounty Hunter'] ='https://www.rottentomatoes.com/m/1220551_bounty_hunter?/reviews?type=user'
movie_urls['Beowulf'] ='https://www.rottentomatoes.com/m/beowulf/reviews?type=user'
movie_urls['The Seventh Sign'] ='https://www.rottentomatoes.com/m/seventh_sign/reviews?type=user'
movie_urls['The Bedroom Window'] ='https://www.rottentomatoes.com/m/bedroom_window/reviews?type=user'
movie_urls['Curly Sue'] ='https://www.rottentomatoes.com/m/curly_sue/reviews?type=user'
movie_urls['The Postman'] ='https://www.rottentomatoes.com/m/postman/reviews?type=user'
movie_urls['Confessions of a Shopaholic'] ='https://www.rottentomatoes.com/m/confessions_of_a_shopaholic/reviews?type=user'
movie_urls['Along Came Polly'] ='https://www.rottentomatoes.com/m/along_came_polly/reviews?type=user'
movie_urls['Lara Croft: Tomb Raider'] ='https://www.rottentomatoes.com/m/lara_croft_tomb_raider/reviews?type=user'
movie_urls['Catch That Kid'] ='https://www.rottentomatoes.com/m/catch_that_kid/reviews?type=user'
movie_urls['Material Girls'] ='https://www.rottentomatoes.com/m/material_girls/reviews?type=user'
movie_urls['In the Army Now'] ='https://www.rottentomatoes.com/m/in_the_army_now/reviews?type=user'
movie_urls['Lawless Range'] ='https://www.rottentomatoes.com/m/lawless_range/reviews?type=user'
movie_urls['Maid in Manhattan'] ='https://www.rottentomatoes.com/m/maid_in_manhattan/reviews?type=user'
movie_urls['You Again'] ='https://www.rottentomatoes.com/m/you_again/reviews?type=user'
movie_urls['Sex and the City 2'] ='https://www.rottentomatoes.com/m/sex_and_the_city_2/reviews?type=user'
movie_urls['Dinner for Schmucks'] ='https://www.rottentomatoes.com/m/dinner_for_schmucks/reviews?type=user'
movie_urls['House of Wax'] ='https://www.rottentomatoes.com/m/house_of_wax_2005/reviews?type=user'
movie_urls['Down to Earth'] ='https://www.rottentomatoes.com/m/1104813-down_to_earth/reviews?type=user'
movie_urls["Charlie's Angels: Full Throttle"] ='https://www.rottentomatoes.com/m/charlies_angels_full_throttle/reviews?type=user'
movie_urls['Blame it on Rio'] ='https://www.rottentomatoes.com/m/blame_it_on_rio/reviews?type=user'
movie_urls['Class Reunion'] ='https://www.rottentomatoes.com/m/national_lampoons_class_reunion/reviews?type=user'
movie_urls["Max Keeble's Big Move"] ='https://www.rottentomatoes.com/m/max_keebles_big_move'
movie_urls['Staying Alive'] ='https://www.rottentomatoes.com/m/staying_alive/reviews?type=user'
movie_urls['Coneheads'] ='https://www.rottentomatoes.com/m/coneheads/reviews?type=user'
movie_urls['Firewall'] ='	https://www.rottentomatoes.com/m/firewall/reviews?type=user'
movie_urls['Love Happens'] ='https://www.rottentomatoes.com/m/love_happens/reviews?type=user'
movie_urls['The Scorpion King'] ='https://www.rottentomatoes.com/m/scorpion_king/reviews?type=user'
movie_urls['10,000 BC'] ='https://www.rottentomatoes.com/m/10000_bc/reviews?type=user'
movie_urls['Quicksilver'] ='https://www.rottentomatoes.com/m/quicksilver/reviews?type=user'
movie_urls['Space Jam (1996)'] ='https://www.rottentomatoes.com/m/space_jam/reviews?type=user'
movie_urls['The Crew'] ='https://www.rottentomatoes.com/m/1099659-crew/reviews?type=user'
movie_urls['Final Destination (2000)'] ='https://www.rottentomatoes.com/m/final_destination/reviews?type=user'
movie_urls['Little Fockers'] ='https://www.rottentomatoes.com/m/little_fockers/reviews?type=user'
movie_urls['Ace Ventura: When Nature Calls (1995)'] ='https://www.rottentomatoes.com/m/ace_ventura_when_nature_calls/reviews?type=user'
movie_urls['The Butterfly Effect (2004)'] ='https://www.rottentomatoes.com/m/butterfly_effect/reviews?type=user'
movie_urls['Balls of Fury'] ='https://www.rottentomatoes.com/m/balls_of_fury/reviews?type=user'
movie_urls['Wet Hot American Summer (2001)'] ='https://www.rottentomatoes.com/m/wet_hot_american_summer/reviews?type=user'
movie_urls['My Super Ex-Girlfriend'] ='https://www.rottentomatoes.com/m/my_super_ex_girlfriend/reviews?type=user'
movie_urls['White Noise'] ='https://www.rottentomatoes.com/m/white_noise/reviews?type=user'
movie_urls["Say it isn't so"] ='https://www.rottentomatoes.com/m/say_it_isnt_so/reviews?type=user'
movie_urls['Hook (1991)'] ='https://www.rottentomatoes.com/m/hook/reviews?type=user'
movie_urls['Thinner'] ='https://www.rottentomatoes.com/m/thinner/reviews?type=user'
movie_urls['Heavy Weights'] ='https://www.rottentomatoes.com/m/heavyweights/reviews?type=user'
movie_urls['The Stepford Wives'] ='https://www.rottentomatoes.com/m/stepford_wives/reviews?type=user'
movie_urls['The Jungle Book 2'] ='https://www.rottentomatoes.com/m/jungle_book_2/reviews?type=user'
movie_urls['Vampires Suck'] ='https://www.rottentomatoes.com/m/vampires_suck/reviews?type=user'
movie_urls['Johnny Be Good'] ='https://www.rottentomatoes.com/m/johnny_be_good/reviews?type=user'
movie_urls['Home on the Range'] ='https://www.rottentomatoes.com/m/home_on_the_range/reviews?type=user'
movie_urls['Not Another Teen Movie (2001)'] ='https://www.rottentomatoes.com/m/not_another_teen_movie/reviews?type=user'
movie_urls['The Bachelor'] ='https://www.rottentomatoes.com/m/1093976-bachelor/reviews?type=user'
movie_urls['The Bad News Bears Go to Japan'] ='https://www.rottentomatoes.com/m/bad_news_bears_go_to_japan/reviews?type=user'
movie_urls["Look Who's Talking Now"] ='https://www.rottentomatoes.com/m/look_whos_talking_now/reviews?type=user'
movie_urls['Joe somebody'] ='https://www.rottentomatoes.com/m/joe_somebody/reviews?type=user'
movie_urls['Chain Reaction'] ='https://www.rottentomatoes.com/m/1072457_chain_reaction/reviews?type=user'
movie_urls["It's Pat"] ='https://www.rottentomatoes.com/m/its_pat/reviews?type=user'
movie_urls['Supergirl'] ='https://www.rottentomatoes.com/m/supergirl/reviews?type=user'
movie_urls['National Lampoon Presents Dorm Daze'] ='https://www.rottentomatoes.com/m/dorm_daze/reviews?type=user'
movie_urls['Halloween (2007)'] ='https://www.rottentomatoes.com/m/halloween_2018/reviews?type=user'
movie_urls['Home Alone 2: Lost in New York (1992)'] ='https://www.rottentomatoes.com/m/home_alone_2_lost_in_new_york/reviews?type=user'
movie_urls["JOHN CARPENTER'S Ghosts of Mars"] ='https://www.rottentomatoes.com/m/john_carpenters_ghosts_of_mars/reviews?type=user'
movie_urls['Bad Boys II (2003)'] ='https://www.rottentomatoes.com/m/bad_boys_ii/reviews?type=user'
movie_urls['The Boondock Saints (1999)'] ='https://www.rottentomatoes.com/m/boondock_saints/reviews?type=user'
movie_urls['Dungeons & Dragons'] ='	https://www.rottentomatoes.com/m/dungeons_and_dragons/reviews?type=user'
movie_urls["Dude, Where's My Car? (2000)"] ='https://www.rottentomatoes.com/m/dude_wheres_my_car/reviews?type=user'
movie_urls['Fantastic Four (2015)'] ='https://www.rottentomatoes.com/m/fantastic_four/reviews?type=user'
movie_urls['Inspector Gadget'] ='https://www.rottentomatoes.com/m/inspector_gadget/reviews?type=user'
movie_urls['Catwoman'] ='https://www.rottentomatoes.com/m/catwoman/reviews?type=user'
movie_urls['Jingle All The Way (1996)'] ='https://www.rottentomatoes.com/m/jingle_all_the_way/reviews?type=user'
movie_urls['Caddyshack II'] ='https://www.rottentomatoes.com/m/caddyshack_2/reviews?type=user'
movie_urls["KING'S RANSOM"] ='https://www.rottentomatoes.com/m/kings_ransom/reviews?type=user'
movie_urls['Jaws: The Revenge'] ='https://www.rottentomatoes.com/m/jaws_the_revenge/reviews?type=user'

movie_urls['REBOUND'] ='https://www.rottentomatoes.com/m/rebound/reviews?type=user'
movie_urls['THE DUKES OF HAZZARD'] ='https://www.rottentomatoes.com/m/dukes_of_hazzard/reviews?type=user'
movie_urls['IN THE MIX'] ='https://www.rottentomatoes.com/m/in_the_mix/reviews?type=user'
movie_urls['THE KING AND I'] ='https://www.rottentomatoes.com/m/1087348-king_and_i/reviews?type=user'
movie_urls['THE LEGEND OF JOHNNY LINGO'] ='https://www.rottentomatoes.com/m/legend_of_johnny_lingo/reviews?type=user'
movie_urls['BEAUTY AND THE BEAST: THE ENCHANTED CHRISTMAS'] ='https://www.rottentomatoes.com/m/beauty-and-the-beast-the-enchanted-christmas/reviews?type=user'
movie_urls['STEALTH'] ='https://www.rottentomatoes.com/m/1146673-1146673-stealth/reviews?type=user'
movie_urls['THE MAN'] ='https://www.rottentomatoes.com/m/the_man/reviews?type=user'
movie_urls['ARE WE THERE YET?'] ='https://www.rottentomatoes.com/m/1141102-are_we_there_yet/reviews?type=user'
movie_urls['VENOM?'] ='https://www.rottentomatoes.com/m/1151780-venom/reviews?type=user'
movie_urls['THE WEDDING DATE'] ='https://www.rottentomatoes.com/m/wedding_date/reviews?type=user'
movie_urls['ELEKTRA'] ='https://www.rottentomatoes.com/m/elektra/reviews?type=user'

movie_urls['HOUSE OF D'] ='https://www.rottentomatoes.com/m/house_of_d/reviews?type=user'
movie_urls["WEEKEND AT BERNIE'S II"] ='https://www.rottentomatoes.com/m/weekend_at_bernies_ii/reviews?type=user'
movie_urls['CROCODILE DUNDEE II'] ='https://www.rottentomatoes.com/m/crocodile_dundee_2/reviews?type=user'
movie_urls['JOE DIRT'] ='https://www.rottentomatoes.com/m/joe_dirt/reviews?type=user'

movie_urls['THE ICE PIRATES'] ='https://www.rottentomatoes.com/m/ice_pirates/reviews?type=user'
movie_urls['THE SKULLS'] ='https://www.rottentomatoes.com/m/skulls/reviews?type=user'
movie_urls['CATWOMAN'] ='https://www.rottentomatoes.com/m/catwoman/reviews?type=user'
movie_urls['MOM AND DAD SAVE THE WORLD'] ='https://www.rottentomatoes.com/m/mom_and_dad_save_the_world/reviews?type=user'

movie_urls['FIRST DAUGHTER'] ='https://www.rottentomatoes.com/m/first_daughter/reviews?type=user'
movie_urls['NIGHT OF THE LEPUS'] ='https://www.rottentomatoes.com/m/night_of_the_lepus/reviews?type=user'
movie_urls['IT TAKES TWO'] ='https://www.rottentomatoes.com/m/1067137-it_takes_two/reviews?type=user'
movie_urls['KING DAVID'] ='https://www.rottentomatoes.com/m/king_david/reviews?type=user'

movie_urls['SURVIVING CHRISTMAS'] ='https://www.rottentomatoes.com/m/surviving_christmas/reviews?type=user'
movie_urls["CAN'T STOP THE MUSIC"] ='https://www.rottentomatoes.com/m/cant_stop_the_music/reviews?type=user'
movie_urls['WHITE NOISE'] ='https://www.rottentomatoes.com/m/white_noise/reviews?type=user'
movie_urls['THE NEW GUY'] ='https://www.rottentomatoes.com/m/1112617-new_guy/reviews?type=user'

movie_urls['IN THE ARMY NOW'] ='https://www.rottentomatoes.com/m/in_the_army_now/reviews?type=user'
movie_urls['SON OF THE MASK'] ='https://www.rottentomatoes.com/m/son_of_the_mask/reviews?type=user'
movie_urls['JOHNSON FAMILY VACATION'] ='https://www.rottentomatoes.com/m/johnson_family_vacation/reviews?type=user'
movie_urls['GLITTER'] ='https://www.rottentomatoes.com/m/glitter/reviews?type=user'

movie_urls['SUPERCROSS: THE MOVIE'] ='https://www.rottentomatoes.com/m/supercross/reviews?type=user'
movie_urls['CHRISTMAS WITH THE KRANKS'] ='https://www.rottentomatoes.com/m/christmas_with_the_kranks/reviews?type=user'
movie_urls['MAJOR LEAGUE II'] ='https://www.rottentomatoes.com/m/major_league_2/reviews?type=user'
movie_urls['YU-GI-OH!'] ='https://www.rottentomatoes.com/m/yu_gi_oh_the_movie/reviews?type=user'

movie_urls['THE FOG'] ='https://www.rottentomatoes.com/m/fog/reviews?type=user'
movie_urls['THE BRIDGE OF SAN LUIS REY'] ='https://www.rottentomatoes.com/m/10002635-bridge_of_san_luis_rey/reviews?type=user'
movie_urls['MORTAL KOMBAT ANNIHILATION'] ='https://www.rottentomatoes.com/m/mortal_kombat_annihilation/reviews?type=user'
movie_urls['BIO-DOME'] ='https://www.rottentomatoes.com/m/biodome/reviews?type=user'

movie_urls["MCHALE'S NAVY"] ='https://www.rottentomatoes.com/m/1076097-mchales_navy/reviews?type=user'
movie_urls['DOWN TO YOU'] ='https://www.rottentomatoes.com/m/down_to_you/reviews?type=user'
movie_urls['BATTLEFIELD EARTH'] ='https://www.rottentomatoes.com/m/battlefield_earth/reviews?type=user'
movie_urls['HALF PAST DEAD'] ='https://www.rottentomatoes.com/m/half_past_dead/reviews?type=user'

movie_urls["BABY GENIUSES"] ='https://www.rottentomatoes.com/m/baby_geniuses/reviews?type=user'
movie_urls['THE IN CROWD'] ='https://www.rottentomatoes.com/m/1098652-in_crowd/reviews?type=user'
movie_urls['TEXAS RANGERS'] ='https://www.rottentomatoes.com/m/1111103-texas_rangers/reviews?type=user'
movie_urls['CROSSOVER'] ='https://www.rottentomatoes.com/m/crossover/reviews?type=user'

movie_urls["THE MASTER OF DISGUISE"] ='https://www.rottentomatoes.com/m/master_of_disguise/reviews?type=user'
movie_urls['ALONE IN THE DARK'] ='https://www.rottentomatoes.com/m/alone_in_the_dark/reviews?type=user'
movie_urls['TWISTED'] ='https://www.rottentomatoes.com/m/twisted/reviews?type=user'
movie_urls['DADDY DAY CAMP'] ='https://www.rottentomatoes.com/m/daddy_day_camp/reviews?type=user'

movie_urls["MULAN II"] ='https://www.rottentomatoes.com/m/mulan_ii/reviews?type=user'
movie_urls['POLAROID'] ='https://www.rottentomatoes.com/m/polaroid_2019/reviews?type=user'
movie_urls['ALL I WANT FOR CHRISTMAS'] ='https://www.rottentomatoes.com/m/1039460-all_i_want_for_christmas/reviews?type=user'
movie_urls['PINOCCHIO'] ='https://www.rottentomatoes.com/m/pinocchio/reviews?type=user'

movie_urls["DRIVE"] ='https://www.rottentomatoes.com/m/drive_2019/reviews?type=user'
movie_urls["WELCOME TO CURIOSITY"] ='https://www.rottentomatoes.com/m/welcome_to_curiosity/reviews?type=user'
movie_urls["THE INFLUENCE (LA INFLUENCIA)"] ='https://www.rottentomatoes.com/m/the_influence_2019/reviews?type=user'
movie_urls["10 MINUTES GONE"] ='https://www.rottentomatoes.com/m/10_minutes_gone/reviews?type=user'

movie_urls["STRANGE WILDERNESS"] ='https://www.rottentomatoes.com/m/strange_wilderness/reviews?type=user'
movie_urls["THE HAUNTING OF MOLLY HARTLEY"] ='https://www.rottentomatoes.com/m/haunting_of_molly_hartley/reviews?type=user'
movie_urls["SEMI-PRO"] ='https://www.rottentomatoes.com/m/semi_pro/reviews?type=user'
movie_urls["MEET THE SPARTANS"] ='https://www.rottentomatoes.com/m/meet_the_spartans/reviews?type=user'

movie_urls["THE VOYEURS"] ='https://www.rottentomatoes.com/m/the_voyeurs_2021/reviews?type=user'
movie_urls["SGT. PEPPER'S LONELY HEARTS CLUB BAND"] ='https://www.rottentomatoes.com/m/sgt-peppers-lonely-hearts-club-band/reviews?type=user'
movie_urls["RHINESTONE"] ='https://www.rottentomatoes.com/m/rhinestone/reviews?type=user'
movie_urls["JOHNNY RENO"] ='https://www.rottentomatoes.com/m/johnny_reno/reviews?type=user'


movie_urls.keys()

After collecting the 230 urls in a dictionary, I can start the web-scraping.

In [None]:
import requests
import re
import json
import pandas as pd


reviews = []
result = {} 
for url in movie_urls.values():
  r = requests.get(url)
  try:
    data = re.search('movieReview\s=\s(.*);', r.text).group(1)
    data=data.replace("true","True")
    data=data.replace("false","False")
    data=data.replace("undefined","None")
    data=eval(data)
    data=json.dumps(data)
    data=json.loads(data)
    movieId = data["movieId"]
    def getReviews(endCursor):
      r = requests.get(f"https://www.rottentomatoes.com/napi/movie/{movieId}/reviews/user",params = {"direction": "next","endCursor": endCursor,"startCursor": ""})
      return r.json()
    for i in range(0, 250):
      result = getReviews(result["pageInfo"]["endCursor"] if i != 0  else "")
      reviews.extend([t for t in result["reviews"]])
  except Exception:
    pass
    
print(f"got {len(reviews)} reviews")

After web-scraping, I obtained 285930 reviews


### Cleaning and pre-processing the data

Next, I will save the data as CSV, create a dataframe based on the data, remove HTML tags, single characters, punctuation and numbers, multiple spaces and emojis. I will also undersample every class based on the minority class in order to make sure every class has the same number of oservations (to ensure perfectly balanced classes). I will also tokenize the text in order to get the embeddings and save the data in a Tensorflow format. 

In [5]:
# hyperparameters
max_length = 200
batch_size = 32
test_size = 0.1
num_class = 5

In [6]:
# Bert Tokenizer
model_name = "bert-base-multilingual-cased"
tokenizer = BertTokenizer.from_pretrained(model_name)

With 10 levels, predicting the score is a very cumbersome task. In addition to this, the choices of words of the audience along with the scores they awarded are to a certain extent subjective and inconsistent, making score prediction difficult from the start. In order to make this task more achievable and increase the accuracy of the prediction, I decided to round each of the values, subtract and try to predict the sentiment based on only 5 values (0-4) like in the dataset comprised by Kaggle.

In [7]:
RT_reviews_df = pd.read_csv('../input/rt-reviews/RT_reviews_df_v10.csv')

import numpy as np

RT_reviews_df["Sentiment"]=RT_reviews_df.loc[:,"score"].apply(np.ceil)-1
RT_reviews_df=RT_reviews_df[['review','Sentiment']]
RT_reviews_df.drop_duplicates(subset=['review'],inplace=True)
RT_reviews_df

In [8]:
import re

def preprocess_text(sen):
  #Removing HTML tags
  sentence=remove_tags(sen)
  #Remove punctuation and numbers
  sentence = re.sub('[^a-zA-Z]', ' ', sentence)
  # Single character removal
  sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)
  # Removing multiple spaces
  sentence = re.sub(r'\s+', ' ', sentence)
  return sentence

TAG_RE=re.compile(r'<[^>]+>')

def remove_tags(sen):
  return TAG_RE.sub(r'',sen)

def deEmojify(sen):
    regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'',sen)

In [9]:
revs=list(map(preprocess_text,list(RT_reviews_df["review"])))
revs=list(map(remove_tags,revs))
revs=list(map(deEmojify,revs))
print(revs[0:5])

In [10]:
x =RT_reviews_df['review']
y = RT_reviews_df['Sentiment']

RT_reviews_df["review"]=revs
train, test = train_test_split(RT_reviews_df, random_state=1, test_size=0.1)

print(train.shape, test.shape)

In [11]:
train.Sentiment.value_counts()

In [12]:
star_1=train[train["Sentiment"]==0].head(34477)
star_2=train[train["Sentiment"]==1]
star_3=train[train["Sentiment"]==2].head(34477)
star_4=train[train["Sentiment"]==3].head(34477)
star_5=train[train["Sentiment"]==4].head(34477)
train = pd.concat([star_1,star_2,star_3,star_4,star_5],axis=0)
train=train.sample(frac=1).reset_index(drop=True)
train

In [13]:
test.Sentiment.value_counts()

In [14]:
star_1_test=test[test["Sentiment"]==0].head(3772)
star_2_test=test[test["Sentiment"]==1]
star_3_test=test[test["Sentiment"]==2].head(3772)
star_4_test=test[test["Sentiment"]==3].head(3772)
star_5_test=test[test["Sentiment"]==4].head(3772)
test = pd.concat([star_1_test,star_2_test,star_3_test,star_4_test,star_5_test],axis=0)
test=test.sample(frac=1).reset_index(drop=True)
test

In [15]:
def bert_encode(data):
    tokens = tokenizer.batch_encode_plus(
        data, max_length=max_length, padding="max_length", truncation=True
    )
    return tf.constant(tokens["input_ids"])
train_encoded = bert_encode(train.review)
test_encoded = bert_encode(test.review)
train_labels = tf.keras.utils.to_categorical(train.Sentiment.values, num_classes=num_class)
test_labels = tf.keras.utils.to_categorical(test.Sentiment.values, num_classes=num_class)
train_dataset = (
    tf.data.Dataset.from_tensor_slices((train_encoded, train_labels))
    .shuffle(100)
    .batch(batch_size)
).cache()
test_dataset = (
    tf.data.Dataset.from_tensor_slices((test_encoded, test_labels))
    .shuffle(100)
    .batch(batch_size)
).cache()

In [16]:
train_labels.shape

In [17]:
test_labels.shape

In [18]:
!pip install tf-models-official
from official.nlp import optimization

epochs =2 
train_data_size =172385 
validation_data_size=18860
batch_size=32
steps_per_epoch = train_data_size // batch_size
validation_steps = validation_data_size // batch_size
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)

optimizer = optimization.create_optimizer(init_lr=3e-5,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')

### Fine-tuning and running the BERT models

Lastly, I will define and fine-tune two BERT models that only differ in the choice of loss function. I will assess the acuracy, recall and F1 score for both the model that uses the categorical cross-entropy and the model that minimizes the macro F1 loss  (define by 1 - F1 score). I will save the best model.

In [19]:
def bert_model():
    bert_encoder = TFBertModel.from_pretrained(model_name, output_attentions=True)
    input_word_ids = Input(
        shape=(max_length,), dtype=tf.int32, name="input_ids"
    )
    last_hidden_states = bert_encoder(input_word_ids)[0]
    clf_output = Flatten()(last_hidden_states)
    output = Dense(num_class, activation="softmax")(clf_output)
    model = Model(inputs=input_word_ids, outputs=output)
    return model

In [20]:
from tensorflow.keras.metrics import Recall
import tensorflow_addons as tfa

with strategy.scope():
    model = bert_model()
    adam_optimizer = Adam(learning_rate=1e-5)
    model.compile(
        loss="categorical_crossentropy", optimizer=optimizer,metrics = ["accuracy",Recall(name='recall'),
                          tfa.metrics.F1Score(num_classes=5,average='macro',name='macro_f1')]
    )
    model.summary()

In [21]:
from numpy.random import seed
seed(1)
tf.random.set_seed(2)

print(f'Training model with {model_name}')

history = model.fit(
    train_dataset,
    batch_size=batch_size,
    epochs=2,
    validation_data=test_dataset,
    verbose=1,
)

In [22]:
model.save_weights('weights_RT_Bert_model.h5', overwrite=True)

In [23]:
import matplotlib.pyplot as plt

def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
plot_graphs(history, "macro_f1")
plot_graphs(history, "loss")

In [24]:
def macro_f1_loss(true, pred): #shapes (batch, 4)

    #for metrics include these two lines, for loss, don't include them
    #these are meant to round 'pred' to exactly zeros and ones
    #predLabels = K.argmax(pred, axis=-1)
    #pred = K.one_hot(predLabels, 4) 

    from tensorflow.keras import backend as K
    ground_positives = K.sum(true, axis=0) + K.epsilon()       # = TP + FN
    pred_positives = K.sum(pred, axis=0) + K.epsilon()         # = TP + FP
    true_positives = K.sum(true * pred, axis=0) + K.epsilon()  # = TP
        #all with shape (4,)
    
    precision = true_positives / pred_positives 
    recall = true_positives / ground_positives
        #both = 1 if ground_positives == 0 or pred_positives == 0
        #shape (4,)

    f1 = 2 * (precision * recall) / (precision + recall + K.epsilon())
        #still with shape (4,)

    macro_f1 = tf.reduce_mean(f1) 
    macro_f1_loss=1 - macro_f1
    
    return macro_f1_loss

loss =macro_f1_loss

In [25]:
def bert_model_2():
    bert_encoder = TFBertModel.from_pretrained(model_name, output_attentions=True)
    input_word_ids = Input(
        shape=(max_length,), dtype=tf.int32, name="input_ids"
    )
    last_hidden_states = bert_encoder(input_word_ids)[0]
    clf_output = Flatten()(last_hidden_states)
    output = Dense(num_class, activation="softmax")(clf_output)
    model_2 = Model(inputs=input_word_ids, outputs=output)
    return model_2

In [26]:
from tensorflow.keras.metrics import Recall
import tensorflow_addons as tfa

with strategy.scope():
    model_2 = bert_model_2()
    adam_optimizer = Adam(learning_rate=1e-5)
    model_2.compile(
        loss=loss, optimizer=optimizer,metrics = ["accuracy",Recall(name='recall'),
                          tfa.metrics.F1Score(num_classes=5,average='macro',name='macro_f1')]
    )
    model_2.summary()

### Conclusion

The model based on the categorical cross-entropy loss performed much better, with both an accuracy and a macro F1 score around 54%. The result is significant when compared to the Zero Rate Classifier (the proportion of observations in the minority class).