## Imports

In [1]:
#we want to install requests (if not already installed) using pip install requests, done in terminal
#requests helps us get the HTML
import requests

#we want to install beautifulsoup (if not already installed) using pip install beautifulsoup4, done in terminal
#beautifulsoup helps us read the html
from bs4 import BeautifulSoup as bs

#we want to install selenium (if not already installed) using pip install selenium, done in terminal
#import selenium
from selenium import webdriver

#time for being human-like
import time

#pandas dataframe
import pandas as pd

## Let's try to webscrape off one website first (aka, no selenium)
We will first try to get just the film titles from a wikipedia page

In [2]:
# let's make a dataframe to store our data
column_names = ["Film Title"]
df = pd.DataFrame(columns=column_names)
df

Unnamed: 0,Film Title


In [3]:
# Getting page HTML through request
mcu_list = requests.get('https://en.wikipedia.org/wiki/List_of_Marvel_Cinematic_Universe_films')
soup = bs(mcu_list.content, 'html.parser') # Parsing content using beautifulsoup

Let's try to just get our film titles first

In [4]:
soup.find_all("a", title=True)

[<a href="/wiki/Wikipedia:Featured_lists" title="This is a featured list. Click here for more information."><img alt="This is a featured list. Click here for more information." data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/></a>,
 <a href="/wiki/Wikipedia:Protection_policy#extended" title="This article is extended-protected."><img alt="Page extended-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/30px-Extend

woah, that's way too much html that I don't want to sift through.
Let's narrow it down to just the table regarding the infinity sage movies

In [5]:
table = soup.find("table",{"class":"wikitable plainrowheaders"})
table

<table class="wikitable plainrowheaders" style="text-align: center; width: 99%;">
<caption><style data-mw-deduplicate="TemplateStyles:r1015390333">.mw-parser-output .sr-only{border:0;clip:rect(0,0,0,0);height:1px;margin:-1px;overflow:hidden;padding:0;position:absolute;width:1px;white-space:nowrap}</style><span class="sr-only">Films in the Infinity Saga of the Marvel Cinematic Universe</span>
</caption>
<tbody><tr>
<th scope="col">Film
</th>
<th scope="col">U.S. release date
</th>
<th scope="col">Director(s)
</th>
<th scope="col">Screenwriter(s)
</th>
<th scope="col">Producer(s)
</th></tr>
<tr>
<th colspan="6" style="background-color: #ccccff;"><a href="/wiki/Marvel_Cinematic_Universe:_Phase_One" title="Marvel Cinematic Universe: Phase One">Phase One</a><span style="background-color:white;padding:1px;display:inline-block;line-height:50%"><sup class="reference" id="cite_ref-DigitalSpyPhases_25-1"><a href="#cite_note-DigitalSpyPhases-25">[24]</a></sup></span>
</th></tr>
<tr>
<th scope="ro

getting warmer! Unfortunately, there is more than just the information about films in here. 
If we google about HTML code, we will easily find that "th" means Table Header, which is where our film titles live.
Knowing this, let's narrow down the table

In [6]:
headers = table.find_all("th", scope="row")
headers

[<th scope="row"><i><a href="/wiki/Iron_Man_(2008_film)" title="Iron Man (2008 film)">Iron Man</a></i>
 </th>,
 <th scope="row"><i><a href="/wiki/The_Incredible_Hulk_(film)" title="The Incredible Hulk (film)">The Incredible Hulk</a></i>
 </th>,
 <th scope="row"><i><a href="/wiki/Iron_Man_2" title="Iron Man 2">Iron Man 2</a></i>
 </th>,
 <th scope="row"><i><a href="/wiki/Thor_(film)" title="Thor (film)">Thor</a></i>
 </th>,
 <th scope="row"><i><a href="/wiki/Captain_America:_The_First_Avenger" title="Captain America: The First Avenger">Captain America: The First Avenger</a></i>
 </th>,
 <th scope="row"><i><a href="/wiki/The_Avengers_(2012_film)" title="The Avengers (2012 film)">The Avengers</a></i>
 </th>,
 <th scope="row"><i><a href="/wiki/Iron_Man_3" title="Iron Man 3">Iron Man 3</a></i>
 </th>,
 <th scope="row"><i><a href="/wiki/Thor:_The_Dark_World" title="Thor: The Dark World">Thor: The Dark World</a></i>
 </th>,
 <th scope="row"><i><a href="/wiki/Captain_America:_The_Winter_Soldie

In [7]:
i = 0
for line in headers:
    print(line.get_text())
    i += 1
print(i, "movies scraped")
# A google search tells us that there are 23 movies in the infinity saga, so we are done!

Iron Man

The Incredible Hulk

Iron Man 2

Thor

Captain America: The First Avenger

The Avengers

Iron Man 3

Thor: The Dark World

Captain America: The Winter Soldier

Guardians of the Galaxy

Avengers: Age of Ultron

Ant-Man

Captain America: Civil War

Doctor Strange

Guardians of the Galaxy Vol. 2

Spider-Man: Homecoming

Thor: Ragnarok

Black Panther

Avengers: Infinity War

Ant-Man and the Wasp

Captain Marvel

Avengers: Endgame

Spider-Man: Far From Home

23 movies scraped


let's add to a list to add this information to our dataframe


In [8]:
film_titles = []
for line in headers:
    film_titles.append(line.get_text().strip('\n')) #there was a newline after each title that we don't need
df["Film Title"] = film_titles
df

Unnamed: 0,Film Title
0,Iron Man
1,The Incredible Hulk
2,Iron Man 2
3,Thor
4,Captain America: The First Avenger
5,The Avengers
6,Iron Man 3
7,Thor: The Dark World
8,Captain America: The Winter Soldier
9,Guardians of the Galaxy


Let's try to get our film dates now. Notice that dates in this HTML code come in the form 00000000year-month-day-0000.
Also, the attribute is more unique, called data-sort-value. Attributes are found using select rather than find

In [9]:
dates = table.select("span[data-sort-value]")
dates

[<span data-sort-value="000000002008-05-02-0000" style="white-space:nowrap">May 2, 2008</span>,
 <span data-sort-value="000000002008-06-13-0000" style="white-space:nowrap">June 13, 2008</span>,
 <span data-sort-value="000000002010-05-07-0000" style="white-space:nowrap">May 7, 2010</span>,
 <span data-sort-value="000000002011-05-06-0000" style="white-space:nowrap">May 6, 2011</span>,
 <span data-sort-value="000000002011-07-22-0000" style="white-space:nowrap">July 22, 2011</span>,
 <span data-sort-value="000000002012-05-04-0000" style="white-space:nowrap">May 4, 2012</span>,
 <span data-sort-value="000000002013-05-03-0000" style="white-space:nowrap">May 3, 2013</span>,
 <span data-sort-value="000000002013-11-08-0000" style="white-space:nowrap">November 8, 2013</span>,
 <span data-sort-value="000000002014-04-04-0000" style="white-space:nowrap">April 4, 2014</span>,
 <span data-sort-value="000000002014-08-01-0000" style="white-space:nowrap">August 1, 2014</span>,
 <span data-sort-value="00

In [10]:
for line in dates:
    print(line.text)

May 2, 2008
June 13, 2008
May 7, 2010
May 6, 2011
July 22, 2011
May 4, 2012
May 3, 2013
November 8, 2013
April 4, 2014
August 1, 2014
May 1, 2015
July 17, 2015
May 6, 2016
November 4, 2016
May 5, 2017
July 7, 2017
November 3, 2017
February 16, 2018
April 27, 2018
July 6, 2018
March 8, 2019
April 26, 2019
July 2, 2019


Great! Let's add this to our dataframe

In [11]:
film_dates = []
for line in dates:
    film_dates.append(line.text)
df["U.S. Release Date"] = film_dates
df["U.S. Release Date"] = df["U.S. Release Date"].astype('datetime64[ns]') #convert String to datetime cuz why not
df

Unnamed: 0,Film Title,U.S. Release Date
0,Iron Man,2008-05-02
1,The Incredible Hulk,2008-06-13
2,Iron Man 2,2010-05-07
3,Thor,2011-05-06
4,Captain America: The First Avenger,2011-07-22
5,The Avengers,2012-05-04
6,Iron Man 3,2013-05-03
7,Thor: The Dark World,2013-11-08
8,Captain America: The Winter Soldier,2014-04-04
9,Guardians of the Galaxy,2014-08-01


## Let's try to incorporate Selenium into our program now
We want to automate our program such that it clicks the hyperlink of each movie and gets information about its box office

In [12]:
# set up webdriver, specifically chromedriver
home_list = 'https://en.wikipedia.org/wiki/List_of_Marvel_Cinematic_Universe_films'

driver = webdriver.Chrome(executable_path='Downloads/chromedriver')
driver.get(home_list)

#we should get a pop up chrome tab 

  driver = webdriver.Chrome(executable_path='Downloads/chromedriver')


In [13]:
http = "https://en.wikipedia.org/"
# we will add the href onto the end of the http string
table = soup.find("table",{"class":"wikitable plainrowheaders"})
i = 0
while i < len(df["Film Title"]):
    a = table.find_all("th", scope="row")[i]
    href = a.find("a", href=True)["href"]
    url = http + href
    driver.get(url)
    smolsoup = bs(driver.page_source, 'html.parser')
    i += 1
    # don't be mean to wikipedia!
    time.sleep(5)

NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=107.0.5304.121)
Stacktrace:
0   chromedriver                        0x00000001027492c8 chromedriver + 4752072
1   chromedriver                        0x00000001026c9463 chromedriver + 4228195
2   chromedriver                        0x000000010232cb18 chromedriver + 441112
3   chromedriver                        0x0000000102309210 chromedriver + 295440
4   chromedriver                        0x000000010238ee3d chromedriver + 843325
5   chromedriver                        0x00000001023a2719 chromedriver + 923417
6   chromedriver                        0x000000010238ab33 chromedriver + 826163
7   chromedriver                        0x000000010235b9fd chromedriver + 633341
8   chromedriver                        0x000000010235d051 chromedriver + 639057
9   chromedriver                        0x000000010271630e chromedriver + 4543246
10  chromedriver                        0x000000010271aa88 chromedriver + 4561544
11  chromedriver                        0x00000001027226df chromedriver + 4593375
12  chromedriver                        0x000000010271b8fa chromedriver + 4565242
13  chromedriver                        0x00000001026f12cf chromedriver + 4391631
14  chromedriver                        0x000000010273a5b8 chromedriver + 4691384
15  chromedriver                        0x000000010273a739 chromedriver + 4691769
16  chromedriver                        0x000000010275081e chromedriver + 4782110
17  libsystem_pthread.dylib             0x00007ff8069414e1 _pthread_start + 125
18  libsystem_pthread.dylib             0x00007ff80693cf6b thread_start + 15


Great! We can now go from hyperlink to hyperlink. Now, let's scrape each website's box office data

In [None]:
http = "https://en.wikipedia.org/"
# we will add the href onto the end of the http string
table = soup.find("table",{"class":"wikitable plainrowheaders"})
i = 0
while i < len(df["Film Title"]):
    a = table.find_all("th", scope="row")[i]
    href = a.find("a", href=True)["href"]
    url = http + href
    driver.get(url)
    smolsoup = bs(driver.page_source, 'html.parser')
    try:
        j = 0
        infobox = smolsoup.find_all(class_="infobox-label")[j]
        print(infobox.find("tr").find("th").get_text())
        while infobox.find("tr").find("th").get_text() != "Box office":
            j += 1
            infobox = smolsoup.find_all(class_="infobox-label")[j]
        print(infobox.find_all("tr")[j].find("th").get_text())
    except:
        None
    i += 1
    # don't be mean to wikipedia!
    time.sleep(10)