# Sentimental Analysis: Movie reviews (Avengers Edition).

In this notebook, we extract the reviews of the movies from the saga Avangers: **'Avengers: Ultron Age'**, **'Avengers: Infinity War'**, **'Avengers: EndGame'**. 

## Extract data.

For the first step, we will extract the data from the website IMDB. This platform is a popular website for people to reviews the movies.

- Avengers (2012): https://www.imdb.com/title/tt0848228/reviews/?ref_=tt_ql_2
- Avengers: Age of Ultron (2015): https://www.imdb.com/title/tt2395427/reviews/?ref_=tt_ql_2
- Avengers: Infinity War (2018): https://www.imdb.com/title/tt4154756/reviews/?ref_=tt_ql_2
- Avengers: EndGame (2019): https://www.imdb.com/title/tt4154796/reviews/?ref_=tt_ql_2


### Web Scraping
We will do a webscraping of the websites using **selenium** and **BeautifulSoup**.

In [1]:
# Lets import all the required libraries
import selenium
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import time

### Web Scraping Avengers (2012): 

In [3]:
# Setup Selenium WebDriver
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s) #we will use chrome browser
driver.get("https://www.imdb.com/title/tt4154796/reviews/?ref_=tt_ql_2")
driver.implicitly_wait(5)

# Handle cookie consent if present on the website
try:
    accept_cookies = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Accept')]"))
    )
    accept_cookies.click()
    print("Accepted cookies.")
except Exception as e:
    print(f"An error occurred: {e}")

# Initialize counter for the number of pages loaded. We choose only the first 4500 reviews, if exist.
page_counter = 0
max_pages = 175

# Load reviews
while page_counter < max_pages:
    time.sleep(2)  # Adding a sleep time to ensure the page loads completely
    try:
        load_more = WebDriverWait(driver, 10).until( 
            EC.element_to_be_clickable((By.ID, 'load-more-trigger')) #the element with ID "load-more-trigger" is present
        )
        # Scroll to the load more button
        driver.execute_script("arguments[0].scrollIntoView(true);", load_more)
        load_more.click()
        time.sleep(3)  # Ensure the reviews load completely before moving on
        page_counter += 1
    except Exception:
        break

# Results
number_reviews_avg_2012 = len(driver.find_elements(by=By.CLASS_NAME, value='lister-item-content'))
print("Total review number for Avengers (2012)", number_reviews_avg_2012)


Accepted cookies.
Total review number for Avengers (2012) 4396


In [4]:
# Lets extract review details of the movie extracting the info such as titles, dates, review content and ratings.

titles = [] 
dates = []
contents = []
ratings = []

reviews = driver.find_elements(By.CLASS_NAME, 'lister-item-content')  # extracting from name of the class
for review in reviews:
    try:
        title = review.find_element(By.CLASS_NAME, 'title').text
    except:
        title = None
    try:
        date = review.find_element(By.CLASS_NAME, 'review-date').text
    except:
        date = None
    try:
        content = review.find_element(By.CLASS_NAME, 'text').text
    except:
        content = None
    try:
        rating_element = review.find_element(By.CLASS_NAME, 'rating-other-user-rating')
        rating = rating_element.find_element(By.TAG_NAME, 'span').text
    except:
        rating = None


    titles.append(title) #Lets append the info
    dates.append(date)
    contents.append(content)
    ratings.append(rating)


# Create a DataFrame
Avengers = pd.DataFrame({
    "review_title": titles,
    "date": dates,
    "content": contents,
    "rating": ratings
})

In [5]:
# Checking the composition of the dataframe
Avengers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4396 entries, 0 to 4395
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   review_title  4396 non-null   object
 1   date          4396 non-null   object
 2   content       4396 non-null   object
 3   rating        4295 non-null   object
dtypes: object(4)
memory usage: 137.5+ KB


In [6]:
# Results
Avengers

Unnamed: 0,review_title,date,content,rating
0,Not as good as infinity war..,29 April 2019,But its a pretty good film. A bit of a mess in...,7
1,Emotional but bit messy.,27 April 2019,,7
2,The zenith of the MCU franchise.,24 April 2019,I feel like I'm wasting my time writing down m...,9
3,Crazy in every sense,26 November 2021,This film is an emotional rollercoaster with s...,10
4,They had no right to do that,26 April 2019,First review from me. This film deserves it. A...,10
...,...,...,...,...
4391,Two Endings and About a Half Hour Too Long,30 April 2019,I won't include any spoliers in this review.\n...,7
4392,Lots of bugs,1 May 2019,,4
4393,Whatever it takes,24 April 2019,I wish I could rate this movie in 3.000 stars....,10
4394,The Cinematic Event of a Lifetime,8 July 2019,Marvel Studios' 'Avengers: Endgame' is the end...,10


In [7]:
# Now that we have the info, lets save the DataFrame to a CSV file
Avengers.to_csv('Avengers sentimental Analysis with Web Scraping/Avengers_original.csv', index=False)

# Closing the WebDriver
driver.quit()

### Web Scraping Avengers: Age of Ultron (2015):

In [8]:
# Setup Selenium WebDriver
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s) #we will use chrome browser
driver.get("https://www.imdb.com/title/tt2395427/reviews/?ref_=tt_ql_2")
driver.implicitly_wait(5)

# Handle cookie consent if present on the website
try:
    accept_cookies = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Accept')]"))
    )
    accept_cookies.click()
    print("Accepted cookies.")
except Exception as e:
    print(f"An error occurred: {e}")

# Initialize counter for the number of pages loaded. We choose only the first 4500 reviews, if exist.
page_counter = 0
max_pages = 175

# Load reviews
while page_counter < max_pages:
    time.sleep(2)  # Adding a sleep time to ensure the page loads completely
    try:
        load_more = WebDriverWait(driver, 10).until( 
            EC.element_to_be_clickable((By.ID, 'load-more-trigger')) #the element with ID "load-more-trigger" is present
        )
        # Scroll to the load more button
        driver.execute_script("arguments[0].scrollIntoView(true);", load_more)
        load_more.click()
        time.sleep(3)  # Ensure the reviews load completely before moving on
        page_counter += 1
    except Exception:
        break

# Results
number_reviews_avg_ultron = len(driver.find_elements(by=By.CLASS_NAME, value='lister-item-content'))
print("Total review number for Avengers: Age of Ultron (2015):", number_reviews_avg_ultron)

Accepted cookies.
Total review number for Avengers: Age of Ultron (2015): 1449


In [9]:
# Lets extract review details of the movie extracting the info such as titles, dates, review content and ratings.

titles = [] 
dates = []
contents = []
ratings = []

reviews = driver.find_elements(By.CLASS_NAME, 'lister-item-content')  # extracting from name of the class
for review in reviews:
    try:
        title = review.find_element(By.CLASS_NAME, 'title').text
    except:
        title = None
    try:
        date = review.find_element(By.CLASS_NAME, 'review-date').text
    except:
        date = None
    try:
        content = review.find_element(By.CLASS_NAME, 'text').text
    except:
        content = None
    try:
        rating_element = review.find_element(By.CLASS_NAME, 'rating-other-user-rating')
        rating = rating_element.find_element(By.TAG_NAME, 'span').text
    except:
        rating = None


    titles.append(title) #Lets append the info
    dates.append(date)
    contents.append(content)
    ratings.append(rating)


# Create a DataFrame
Avengers_ultron = pd.DataFrame({
    "review_title": titles,
    "date": dates,
    "content": contents,
    "rating": ratings
})

In [10]:
# Checking the composition of the dataframe
Avengers_ultron.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1449 entries, 0 to 1448
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   review_title  1449 non-null   object
 1   date          1449 non-null   object
 2   content       1449 non-null   object
 3   rating        1398 non-null   object
dtypes: object(4)
memory usage: 45.4+ KB


In [11]:
# Results
Avengers_ultron

Unnamed: 0,review_title,date,content,rating
0,The most important film in MCU,2 December 2020,,8
1,Avengers: Age of Ultron (2015) Review - 8.3/10,12 December 2020,This film isn't nearly as bad as some people m...,8
2,An enjoyable ride,18 January 2021,"Cool seeing them all together again, even if I...",7
3,The son of Megatron and Skynet meets Tony Star...,14 December 2015,"In 2012, there came a day unlike any other day...",6
4,One word: Mjollnir!,1 May 2015,I have noticed a trend of negative reviews dir...,9
...,...,...,...,...
1444,Not quite as well made as the first but still ...,25 April 2018,,
1445,Even Better Now... A Masterpiece.,2 May 2021,Given this was the only movie we got to see th...,
1446,The Best Comic Adaptation Sequel Of All-Time! ...,3 May 2015,A lot of talk had been thrown about that the f...,
1447,Best movie in the world,7 November 2015,I think it was inevitable that this sequel wou...,


In [12]:
# Now that we have the info, lets save the DataFrame to a CSV file
Avengers_ultron.to_csv('Avengers sentimental Analysis with Web Scraping/Avengers_ultron_original.csv', index=False)

# Closing the WebDriver
driver.quit()

### Web Scraping Avengers: Infinity War (2018):

In [13]:
# Setup Selenium WebDriver
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s) #we will use chrome browser
driver.get("https://www.imdb.com/title/tt4154756/reviews/?ref_=tt_ql_2")
driver.implicitly_wait(5)

# Handle cookie consent if present on the website
try:
    accept_cookies = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Accept')]"))
    )
    accept_cookies.click()
    print("Accepted cookies.")
except Exception as e:
    print(f"An error occurred: {e}")

# Initialize counter for the number of pages loaded. We choose only the first 4500 reviews, if exist.
page_counter = 0
max_pages = 175

# Load reviews
while page_counter < max_pages:
    time.sleep(2)  # Adding a sleep time to ensure the page loads completely
    try:
        load_more = WebDriverWait(driver, 10).until( 
            EC.element_to_be_clickable((By.ID, 'load-more-trigger')) #the element with ID "load-more-trigger" is present
        )
        # Scroll to the load more button
        driver.execute_script("arguments[0].scrollIntoView(true);", load_more)
        load_more.click()
        time.sleep(3)  # Ensure the reviews load completely before moving on
        page_counter += 1
    except Exception:
        break

# Results
number_reviews_avg_infinity = len(driver.find_elements(by=By.CLASS_NAME, value='lister-item-content'))
print("Total review number for Avengers: Infinity:", number_reviews_avg_infinity)

Accepted cookies.
Total review number for Avengers: Age of Ultron (2015): 4393


In [14]:
# Lets extract review details of the movie extracting the info such as titles, dates, review content and ratings.

titles = [] 
dates = []
contents = []
ratings = []

reviews = driver.find_elements(By.CLASS_NAME, 'lister-item-content')  # extracting from name of the class
for review in reviews:
    try:
        title = review.find_element(By.CLASS_NAME, 'title').text
    except:
        title = None
    try:
        date = review.find_element(By.CLASS_NAME, 'review-date').text
    except:
        date = None
    try:
        content = review.find_element(By.CLASS_NAME, 'text').text
    except:
        content = None
    try:
        rating_element = review.find_element(By.CLASS_NAME, 'rating-other-user-rating')
        rating = rating_element.find_element(By.TAG_NAME, 'span').text
    except:
        rating = None


    titles.append(title) #Lets append the info
    dates.append(date)
    contents.append(content)
    ratings.append(rating)


# Create a DataFrame
Avengers_infinity  = pd.DataFrame({
    "review_title": titles,
    "date": dates,
    "content": contents,
    "rating": ratings
})

In [15]:
# Checking the composition of the dataframe
Avengers_infinity.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4393 entries, 0 to 4392
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   review_title  4393 non-null   object
 1   date          4393 non-null   object
 2   content       4393 non-null   object
 3   rating        4343 non-null   object
dtypes: object(4)
memory usage: 137.4+ KB


In [16]:
# Results
Avengers_infinity

Unnamed: 0,review_title,date,content,rating
0,EMOTIONAL ROLLER COASTER,16 February 2021,Avengers infinity war is an emotional roller c...,9
1,A superior avengers sequel,5 February 2021,Although this film has 'Avengers' in the title...,8
2,It all led to this: superhero film at its best,25 October 2021,#MCUrewatch. A confrontation that has been in ...,8
3,A film that pulled off the impossible.,20 January 2021,Avengers: Infinity War is a film that should b...,10
4,"Best movie of the MCU, incredible from start t...",24 January 2021,,10
...,...,...,...,...
4388,Ugh,8 September 2018,Went down the 'event cinema' route. Lots of fl...,5
4389,So looking forward to this movie and was disap...,27 April 2018,,5
4390,Dull,29 April 2018,Boring. I was in the cinema and bored. Want a ...,1
4391,Makes no sense whatsoever,19 April 2019,,1


In [17]:
# Now that we have the info, lets save the DataFrame to a CSV file
Avengers_infinity.to_csv('Avengers sentimental Analysis with Web Scraping/Avengers_infinity_original.csv', index=False)

# Closing the WebDriver
driver.quit()

### Web Scraping Avengers: EndGame (2019): 

In [18]:
# Setup Selenium WebDriver
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s) #we will use chrome browser
driver.get("https://www.imdb.com/title/tt4154796/reviews/?ref_=tt_ql_2")
driver.implicitly_wait(5)

# Handle cookie consent if present on the website
try:
    accept_cookies = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Accept')]"))
    )
    accept_cookies.click()
    print("Accepted cookies.")
except Exception as e:
    print(f"An error occurred: {e}")

# Initialize counter for the number of pages loaded. We choose only the first 4500 reviews, if exist.
page_counter = 0
max_pages = 175

# Load reviews
while page_counter < max_pages:
    time.sleep(2)  # Adding a sleep time to ensure the page loads completely
    try:
        load_more = WebDriverWait(driver, 10).until( 
            EC.element_to_be_clickable((By.ID, 'load-more-trigger')) #the element with ID "load-more-trigger" is present
        )
        # Scroll to the load more button
        driver.execute_script("arguments[0].scrollIntoView(true);", load_more)
        load_more.click()
        time.sleep(3)  # Ensure the reviews load completely before moving on
        page_counter += 1
    except Exception:
        break

# Results
number_reviews_avg_endgame = len(driver.find_elements(by=By.CLASS_NAME, value='lister-item-content'))
print("Total review number for Avengers: Endgame:", number_reviews_avg_endgame)


Accepted cookies.
Total review number for Avengers: Age of Ultron (2015): 4396


In [19]:
# Lets extract review details of the movie extracting the info such as titles, dates, review content and ratings.

titles = [] 
dates = []
contents = []
ratings = []

reviews = driver.find_elements(By.CLASS_NAME, 'lister-item-content')  # extracting from name of the class
for review in reviews:
    try:
        title = review.find_element(By.CLASS_NAME, 'title').text
    except:
        title = None
    try:
        date = review.find_element(By.CLASS_NAME, 'review-date').text
    except:
        date = None
    try:
        content = review.find_element(By.CLASS_NAME, 'text').text
    except:
        content = None
    try:
        rating_element = review.find_element(By.CLASS_NAME, 'rating-other-user-rating')
        rating = rating_element.find_element(By.TAG_NAME, 'span').text
    except:
        rating = None


    titles.append(title) #Lets append the info
    dates.append(date)
    contents.append(content)
    ratings.append(rating)


# Create a DataFrame
Avengers_endgame  = pd.DataFrame({
    "review_title": titles,
    "date": dates,
    "content": contents,
    "rating": ratings
})

In [20]:
# Checking the composition of the dataframe
Avengers_endgame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4396 entries, 0 to 4395
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   review_title  4396 non-null   object
 1   date          4396 non-null   object
 2   content       4396 non-null   object
 3   rating        4295 non-null   object
dtypes: object(4)
memory usage: 137.5+ KB


In [21]:
# Results
Avengers_endgame

Unnamed: 0,review_title,date,content,rating
0,Not as good as infinity war..,29 April 2019,But its a pretty good film. A bit of a mess in...,7
1,Emotional but bit messy.,27 April 2019,,7
2,The zenith of the MCU franchise.,24 April 2019,I feel like I'm wasting my time writing down m...,9
3,Crazy in every sense,26 November 2021,This film is an emotional rollercoaster with s...,10
4,They had no right to do that,26 April 2019,First review from me. This film deserves it. A...,10
...,...,...,...,...
4391,Two Endings and About a Half Hour Too Long,30 April 2019,I won't include any spoliers in this review.\n...,7
4392,Lots of bugs,1 May 2019,,4
4393,Whatever it takes,24 April 2019,I wish I could rate this movie in 3.000 stars....,10
4394,The Cinematic Event of a Lifetime,8 July 2019,Marvel Studios' 'Avengers: Endgame' is the end...,10


In [22]:
# Now that we have the info, lets save the DataFrame to a CSV file
Avengers_endgame.to_csv('Avengers sentimental Analysis with Web Scraping/Avengers_endgame_original.csv', index=False)

# Closing the WebDriver
driver.quit()