# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Import Dependencies 

In [1]:
!apt update
!pip install selenium
!pip install webdriver-manager

The operation couldn’t be completed. Unable to locate a Java Runtime that supports apt.
Please visit http://www.java.com for information on installing Java.



In [2]:
import random
import time
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium import webdriver

#因为我用的edge浏览器，所以这里用的是edge driver，如果你电脑里装的是chrome，要换成chrome driver
from selenium.webdriver.edge.service import Service as EdgeService
from webdriver_manager.microsoft import EdgeChromiumDriverManager

In [3]:
info_df = pd.read_csv("./wiki_movie_plots_deduped.csv")
info_df = info_df.drop_duplicates()
info_df["reviews"] = np.nan
info_df['reviews'] = info_df['reviews'].astype('object')
info_df.tail(15)

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot,reviews
34871,2011,Merry-Go-Round,Turkish,İlksen Başarır,"Mert Fırat, Nergis Öztürk & Sema Çeyrekbaşı",drama,https://en.wikipedia.org/wiki/Merry-Go-Round_(...,Erdem and Sevil live in a small town with thei...,
34872,2011,Zephyr,Turkish,Belma Baş,"Şeyma Uzunlar, Vahide Gördüm & Sevinç Baş",comedy-drama,https://en.wikipedia.org/wiki/Zephyr_(film),"Zephyr is a strong-willed little girl, spendin...",
34873,2011,Toll Booth,Turkish,Tolga Karaçelik,"Serkan Ercan, Zafer Diper & Nur Aysan",drama,https://en.wikipedia.org/wiki/Toll_Booth_(film),Kenan is a 35-year-old toll booth attendant st...,
34874,2011,White as Snow,Turkish,Selim Güneş,"Hakan Korkmaz, Sinem İslamoğlu & Gürkan Piri O...",drama,https://en.wikipedia.org/wiki/White_as_Snow_(f...,Hasan is a twelve-year-old boy living with his...,
34875,2011,Once Upon a Time in Anatolia,Turkish,Nuri Bilge Ceylan,"Yılmaz Erdoğan, Taner Birsel & Ufuk Karaali",drama,https://en.wikipedia.org/wiki/Once_Upon_a_Time...,"Through the night, three cars carry a small gr...",
34876,2013,Selam,Turkish,Levent Demirkale,"Bucin Abdullah, Selma Alispahic, Tina Cvitanov...",drama,https://en.wikipedia.org/wiki/Selam_(film),The film opens with a Senegalese boy named Kha...,
34877,2013,Particle (film),Turkish,Erdem Tepegöz,"Jale Arıkan, Rüçhan Caliskur, Özay Fecht, Remz...",drama film,https://en.wikipedia.org/wiki/Particle_(film),"Zeynep lost her job at weaving factory, and he...",
34878,2014,Mandıra Filozofu,Turkish,Director: Müfit Can Saçıntı,Director: Müfit Can Saçıntı\r\nCast: Rasim Özt...,unknown,https://en.wikipedia.org/wiki/Mand%C4%B1ra_Fil...,Cavit an ambitious industralist in İstanbul pl...,
34879,2014,Winter Sleep,Turkish,Director: Nuri Bilge Ceylan,Director: Nuri Bilge Ceylan\r\nCast: Haluk Bil...,unknown,https://en.wikipedia.org/wiki/Winter_Sleep_(film),"Aydın, a former actor, owns a mountaintop hote...",
34880,2014,Sivas,Turkish,Director: Kaan Müjdeci,Director: Kaan Müjdeci\r\nCast: Dogan Izci,unknown,https://en.wikipedia.org/wiki/Sivas_(film),The film follows an eleven-year-old boy named ...,


### Define url function

In [5]:
def get_url(movie_name):
    url_template = "https://www.rottentomatoes.com/m/{}/reviews"
    url = url_template.format(movie_name)
    return url

### Scrape movie reviews from critics

In [6]:
import re

#create driver instance.
driver = webdriver.Edge(service=EdgeService(EdgeChromiumDriverManager().install()))

#iterate through index, reverse() is called so it is easier to see the result and debug. 
indexes = info_df.index.tolist()
indexes.reverse()
movie_num=0
for i in indexes:
    movie_num += 1
    name_og = info_df['Title'].iloc[i]
    movie_name = re.sub("[ -]", '_', name_og)
    
    driver.get(get_url(movie_name)) #The entire website's html file
    
    driver.implicitly_wait(2)     #Wait time to avoid human verification, 

    reviews = driver.find_elements(By.CLASS_NAME, 'the_review') #Find specific element/elements by class name
     
    review_collections = []
    for review in reviews:
        result_html = review.get_attribute('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser') #Process the driver-returned elements.
        
        review_collections.append(str(soup).strip())
        
    info_df.at[i, 'reviews'] = review_collections
    print("{}: Review added for movie: {}".format(movie_num, movie_name))
driver.quit()

Review added for movie: İstanbul_Kırmızısı




Review added for movie: Non_Transferable
Review added for movie: Olanlar_Oldu
Review added for movie: Çalgı_Çengi_İkimiz
Review added for movie: The_Water_Diviner
Review added for movie: Sivas
Review added for movie: Winter_Sleep
Review added for movie: Mandıra_Filozofu
Review added for movie: Particle_(film)
Review added for movie: Selam
Review added for movie: Once_Upon_a_Time_in_Anatolia
Review added for movie: White_as_Snow
Review added for movie: Toll_Booth
Review added for movie: Zephyr
Review added for movie: Merry_Go_Round
Review added for movie: Press
Review added for movie: Signora_Enrica
Review added for movie: Love_Likes_Coincidences
Review added for movie: Scapegoat
Review added for movie: Paper
Review added for movie: Free_Man
Review added for movie: Eyyvah_Eyvah_2
Review added for movie: Hayde_Bre
Review added for movie: Other_Angels
Review added for movie: Secret_of_the_Sultan
Review added for movie: Jackal
Review added for movie: Five_Minarets_in_New_York
Review added 

### Save results

In [9]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
info_df.to_csv(date + "_" + "movie_info" + "_" + "with_reviews" + ".csv", index=False)

### Some unfinished data cleaning process

In [29]:
# read csv 
info_df = pd.read_csv('movie_with_rottentomatoes_reviews.csv')

In [30]:
len(info_df.loc[0]['reviews'])

2

In [31]:
for i in indexes:
    if len(info_df.loc[i]['reviews']) < 4:
        info_df.drop(i, axis =0, inplace = True)
info_df

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot,reviews
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...,['[F]reaky and not a little psychedelic...']
10,1906,Dream of a Rarebit Fiend,American,Wallace McCutcheon and Edwin S. Porter,,short,https://en.wikipedia.org/wiki/Dream_of_a_Rareb...,The Rarebit Fiend gorges on Welsh rarebit at a...,['The film pays homage to the long history of ...
16,1908,The Adventures of Dollie,American,D. W. Griffith,"Arthur V. Johnson, Linda Arvidson",drama,https://en.wikipedia.org/wiki/The_Adventures_o...,On a beautiful summer day a father and mother ...,['One of the most remarkable cases of child-st...
28,1910,Frankenstein,American,J. Searle Dawley,"Augustus Phillips, Charles Stanton Ogle, Mary ...",unknown,https://en.wikipedia.org/wiki/Frankenstein_(19...,"Described as ""a liberal adaptation of Mrs. She...",['']
70,1914,Cinderella,American,James Kirkwood,"Mary Pickford, Owen Moore, Isobel Vernon",fantasy drama,https://en.wikipedia.org/wiki/Cinderella_(1914...,Cinderella is a kind young woman who lives wit...,['Some of the Tom and Jerry stuff goes a hair ...
...,...,...,...,...,...,...,...,...,...
34873,2011,Toll Booth,Turkish,Tolga Karaçelik,"Serkan Ercan, Zafer Diper & Nur Aysan",drama,https://en.wikipedia.org/wiki/Toll_Booth_(film),Kenan is a 35-year-old toll booth attendant st...,['Though speckled here and there with uneasy c...
34875,2011,Once Upon a Time in Anatolia,Turkish,Nuri Bilge Ceylan,"Yılmaz Erdoğan, Taner Birsel & Ufuk Karaali",drama,https://en.wikipedia.org/wiki/Once_Upon_a_Time...,"Through the night, three cars carry a small gr...","['Past beyond its duration, the film grows dar..."
34880,2014,Sivas,Turkish,Director: Kaan Müjdeci,Director: Kaan Müjdeci\r\nCast: Dogan Izci,unknown,https://en.wikipedia.org/wiki/Sivas_(film),The film follows an eleven-year-old boy named ...,"[""What's there in terms of story thus often fe..."
34881,2014,The Water Diviner,Turkish,Director: Russell Crowe,Director: Russell Crowe\r\nCast: Russell Crowe...,unknown,https://en.wikipedia.org/wiki/The_Water_Diviner,"The film begins in 1919, just after World War ...",['Crowe should be commended for giving the fil...
