## Scrapping IMDB
Reference: https://shravan-kuchkula.github.io/scrape_imdb_movie_reviews/#construct-a-dataframe

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import itertools
from imdbUtils import *
import builtins

pd.options.display.max_colwidth=500

In [2]:
# API call to select:
## feature films
## which are rated atleast 4.0
## having atleast 50,000 votes
## in the Thriller genre
## sorted by user rating
## limit to 250 movies
url = '''https://www.imdb.com/search/title/?title_type=feature&user_rating=4.0,10.0
&num_votes=50000,&genres=thriller&view=simple&sort=user_rating,desc&count=250'''

# get the soup object for main api url
movies_soup = getSoup(url)

In [4]:
# find all a-tags with class:None
movie_tags = movies_soup.find_all('a', attrs={'class': None})

# filter the a-tags to get just the titles
movie_tags = [tag.attrs['href'] for tag in movie_tags 
              if tag.attrs['href'].startswith('/title') & tag.attrs['href'].endswith('/')]

In [5]:
print("There are a total of " + str(len(movie_tags)) + " movie titles")
print("Displaying 10 titles")
movie_tags[:10]

There are a total of 500 movie titles
Displaying 10 titles


['/title/tt0468569/',
 '/title/tt0468569/',
 '/title/tt1375666/',
 '/title/tt1375666/',
 '/title/tt6751668/',
 '/title/tt6751668/',
 '/title/tt0114369/',
 '/title/tt0114369/',
 '/title/tt0102926/',
 '/title/tt0102926/']

In [6]:
# remove duplicate links
movie_tags = list(dict.fromkeys(movie_tags))

print("There are a total of " + str(len(movie_tags)) + " movie titles")
print("Displaying 10 titles")
movie_tags[:10]

There are a total of 250 movie titles
Displaying 10 titles


['/title/tt0468569/',
 '/title/tt1375666/',
 '/title/tt6751668/',
 '/title/tt0114369/',
 '/title/tt0102926/',
 '/title/tt0482571/',
 '/title/tt0407887/',
 '/title/tt0114814/',
 '/title/tt0110413/',
 '/title/tt0054215/']

In [7]:
# movie links
base_url = "https://www.imdb.com"
movie_links = [base_url + tag + 'reviews' for tag in movie_tags]
print("There are a total of " + str(len(movie_links)) + " movie user reviews")
print("Displaying 10 user reviews links")
movie_links[:10]

There are a total of 250 movie user reviews
Displaying 10 user reviews links


['https://www.imdb.com/title/tt0468569/reviews',
 'https://www.imdb.com/title/tt1375666/reviews',
 'https://www.imdb.com/title/tt6751668/reviews',
 'https://www.imdb.com/title/tt0114369/reviews',
 'https://www.imdb.com/title/tt0102926/reviews',
 'https://www.imdb.com/title/tt0482571/reviews',
 'https://www.imdb.com/title/tt0407887/reviews',
 'https://www.imdb.com/title/tt0114814/reviews',
 'https://www.imdb.com/title/tt0110413/reviews',
 'https://www.imdb.com/title/tt0054215/reviews']

In [8]:
# get a list of soup objects
movie_soups = [getSoup(link) for link in movie_links]

# get all 500 movie review links
movie_review_list = [getReviews(movie_soup) for movie_soup in movie_soups]

movie_review_list = list(itertools.chain(*movie_review_list))
print(len(movie_review_list))

print("There are a total of " + str(len(movie_review_list)) + " individual movie reviews")
print("Displaying 10 reviews")
movie_review_list[:10]

500
There are a total of 500 individual movie reviews
Displaying 10 reviews


['https://www.imdb.com/review/rw6513945/',
 'https://www.imdb.com/review/rw6457886/',
 'https://www.imdb.com/review/rw2300362/',
 'https://www.imdb.com/review/rw4692192/',
 'https://www.imdb.com/review/rw5388270/',
 'https://www.imdb.com/review/rw5195256/',
 'https://www.imdb.com/review/rw1097795/',
 'https://www.imdb.com/review/rw0370669/',
 'https://www.imdb.com/review/rw3476006/',
 'https://www.imdb.com/review/rw1198894/']

In [8]:
# get review text from the review link
review_texts = [getReviewText(url) for url in movie_review_list]

# get movie name from the review link
movie_titles = [getMovieTitle(url) for url in movie_review_list]

# construct a dataframe
df = pd.DataFrame({'movie': movie_titles,
             'user_review': review_texts})

In [None]:
#Build the Movie Recommender System
#Using features to personalize the recommendation: cast, director, keywords, genres take the soup column (reference data from kaggle movie dataset) based on title


In [9]:
df.head()

Unnamed: 0,movie,user_review
0,The Dark Knight,"If someone else acted as Joker, I would give the movie 7-8 stars. The majority of people ended up loving the villain more than the hero, and that rarely happends in movies.Rest in peace Heath Ledger."
1,The Dark Knight,"Totally one of the greatest movie titles ever made. Everything was great, filming, acting, story. Nothing to complain about"
2,Inception,"I will try not to repeat some of what others have so brilliantly written in some reviews. I just add this in order to contradict the hype that has allowed this movie to be ranked so high in IMDb. The same has been happening with other movies, and that is a shame for IMDb, which is becoming unreliable.I want to stress the fact that the only complexity in this movie is trying to figure out how you can invest so much money in a script that continuously makes a fool of the average critic intelli..."
3,Inception,"My 3rd time watching this movie! Yet, it still stunned my mind, kept me enjoyed its every moment and left me with many thoughts afterward.For someone like me, who've rarely slept without dream, it's so exciting watching how Christopher Nolan had illustrated every single characteristic of dream on the big screen. As it's been done so sophisticatedly, I do believe the rumour that Nolan had spent 10 years to finish the script of Inception. In my opinion, it's been so far the greatest achievemen..."
4,Parasite,"After reading all the glowing reviews, especially about how this film is one of the best of the decade I had to see it for myself- the plot it self wasn't anything ground breaking, and while the technical aspect of the film were flawless and i enjoyed it for the first hour, the ending was a total let down (lots of plot holes that I won't get into here) I agree with some other reviewers, the Park family weren't your typical rich snobs, the father worked for his money and came home home and wa..."


In [12]:
# save the dataframe to a csv file.
df.to_csv('movieReviews_IMDB.csv', index=False)

## Add Sentiment Score

In [13]:
#Create sentiment scores

In [None]:
#pip install transformers

In [None]:
#pip install scipy.special

In [None]:
#pip install pandas

In [None]:
#pip install scipy

In [None]:
#pip install torch

In [None]:
#pip install ipywidgets

In [14]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request
import torch
import pandas as pd

In [15]:
# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []

    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)


print ("hello")

import sys

hello


In [16]:
# Tasks:
# emoji, emotion, hate, irony, offensive, sentiment
# stance/abortion, stance/atheism, stance/climate, stance/feminist, stance/hillary
task='sentiment'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

tokenizer = AutoTokenizer.from_pretrained(MODEL)

Downloading:   0%|          | 0.00/747 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [17]:
# download label mapping
labels=[]
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)

text = "Good night 😊"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)



Downloading:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [18]:
# # TF
# model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
# model.save_pretrained(MODEL)

# text = "Good night 😊"
# encoded_input = tokenizer(text, return_tensors='tf')
# output = model(encoded_input)
# scores = output[0][0].numpy()
# scores = softmax(scores)

ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")

s1 = 'If someone else acted as Joker, I would give the movie 7-8 stars. The majority of people ended up loving the villain more than the hero, and that rarely happends in movies.Rest in peace Heath Ledger.'
text = preprocess(s1)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
print (scores)



1) positive 0.8466
2) neutral 0.1458
3) negative 0.0076
[0.13826811 0.44011867 0.42161328]


In [20]:
#text = preprocess(s2)
#DIR = '/Users/Saori/Documents/Grad schol/Columbia/Classes/Summer 2021, Managing Data/Week12/Final project/'
#FILENAME = DIR +'/movieReviews_IMDB.csv'
moviesDF = pd.read_csv('movieReviews_IMDB.csv')
moviesDF.head()
#    moviesDF = moviesDF.set_index('movie')

numMovies = len(moviesDF)
'''
givenMovie = xx
rows = moviesDF.loc[moviesDF['movie'] == givenMovie] 
'''
negScores = []
for i in range(numMovies):
    text = moviesDF.iloc[i]['user_review'] #rows['user_review'].iloc[i]   # 
    encoded_input = tokenizer(text, return_tensors='pt')
    try:
        output = model(**encoded_input)
        scores = output[0][0].detach().numpy()
        scores = softmax(scores)[0]
    except:
        scores = -1
    print ("For movie ", " the negative sentiment score is ", scores)
    negScores.append(scores)
    
moviesDF['negative sentiment'] = negScores #,2)   # moviesDF.append({'NegativeScores': }, ignore_index=True)  #
moviesDF.reset_index().to_csv(DIR+'/negativeSentimentOfMovies.csv', index=False)

For movie   the negative sentiment score is  0.13826811
For movie   the negative sentiment score is  0.0034930182
For movie   the negative sentiment score is  0.8947402
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  0.039387576
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  0.38012147
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  0.12143889
For movie   the negative sentiment score is  0.13305163
For movie   the negative sentiment score is  0.009651245
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  0.0018627209
For movie   the negative sentiment score is  0.057788994
For movie   the negative sentiment score is  0.010476997
For movie   the negative sentiment score is  0.03209362
For movie   the negative sentiment score is  0.029866884
For movie   the negative

For movie   the negative sentiment score is  0.012436267
For movie   the negative sentiment score is  0.007930804
For movie   the negative sentiment score is  0.0043936074
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  0.002755492
For movie   the negative sentiment score is  0.19341359
For movie   the negative sentiment score is  0.0247336
For movie   the negative sentiment score is  0.0062536392
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  0.11001232
For movie   the negative sentiment score is  0.41836548
For movie   the negative sentiment score is  0.0020484694
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  0.0106406305
For movie   the negative sentiment sco

For movie   the negative sentiment score is  0.5539435
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  0.004178991
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  0.91377944
For movie   the negative sentiment score is  0.05752871
For movie   the negative sentiment score is  0.039485414
For movie   the negative sentiment score is  0.0020819132
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  0.58735317
For movie   the negative sentiment score is  0.0113778515
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  0.2861449
For movie   the negative sentiment score is  0.5174459
For movie   the negative sentiment score is  0.03863177
For movie   the negative sentiment score is  0.39041874
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  0.12997602
For movie   the negative sent

For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  0.02892717
For movie   the negative sentiment score is  0.18212414
For movie   the negative sentiment score is  0.039084334
For movie   the negative sentiment score is  0.0031450868
For movie   the negative sentiment score is  0.011483132
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  0.22177005
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  -1
For movie   the negative sentiment score is  0.046802055
For movie   the negative sentiment score is  0.20206791
For movie   the negative sentiment score is  0.17761172
For movie   the negative sentiment score is  0.0041124774
For movie   the negative sentiment score is  0.6050204
For movie   the negative sentiment score i

FileNotFoundError: [Errno 2] No such file or directory: '/Users/Saori/Documents/Grad schol/Columbia/Classes/Summer 2021, Managing Data/Week12/Final project//negativeSentimentOfMovies.csv'