<a href="https://colab.research.google.com/github/momentoinesquecivel/BeautifulSoupKeithGalliExercises/blob/main/BeautifulSoup_tutorial_exercise_done.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### From video [Beautiful Soup Tutorial](https://www.youtube.com/watch?v=GjKQ6V_ViQE) by [Keith Galli](https://www.youtube.com/channel/UCq6XkhO5SZ66N04IcPbqNcw)

#### Load Web site https://keithgalli.github.io/web-scraping/webpage.html

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
url = "https://keithgalli.github.io/web-scraping/"
r = requests.get(url + "webpage.html")
print(r.status_code)

200


In [None]:
webpage = BeautifulSoup(r.content)
print(webpage.body.prettify())

#### Exercise 1: Grab all of social links from the webpage

with 3 different ways

In [None]:
social = webpage.find("ul", attrs={"class": "socials"})
links = social.select("li a")
links = [a["href"] for a in links]
for i in links: print(i)

https://www.instagram.com/keithgalli/
https://twitter.com/keithgalli
https://www.linkedin.com/in/keithgalli/
https://www.tiktok.com/@keithgalli


In [None]:
links = social.find_all("li")
for a in links:
  print(a.find("a").string)

https://www.instagram.com/keithgalli/
https://twitter.com/keithgalli
https://www.linkedin.com/in/keithgalli/
https://www.tiktok.com/@keithgalli


In [None]:
import re
links = social.find_all(string=re.compile("https://"))
for i in links: print(i)

https://www.instagram.com/keithgalli/
https://twitter.com/keithgalli
https://www.linkedin.com/in/keithgalli/
https://www.tiktok.com/@keithgalli


### Exercise 2: Scrape HTML table in pandas DataFrame

In [None]:
import pandas as pd
import numpy as np
table = webpage.find("table")
columns = table.thead.find_all("th")
columns_name = [e.string for e in columns]

rows = table.tbody.find_all("tr")
trs = [i.find_all("td") for i in rows]
tds = [[i.get_text().strip() for i in r] for r in trs]
df = pd.DataFrame(tds, columns=columns_name)
df.loc[df["Team"] != "Did not play", :]

Unnamed: 0,S,Team,League,GP,G,A,TP,PIM,+/-,Unnamed: 10,POST,GP.1,G.1,A.1,TP.1,PIM.1,+/-.1
0,2014-15,MIT (Mass. Inst. of Tech.),ACHA II,17,3,9,12,20,,|,,,,,,,
1,2015-16,MIT (Mass. Inst. of Tech.),ACHA II,9,1,1,2,2,,|,,,,,,,
2,2016-17,MIT (Mass. Inst. of Tech.),ACHA II,12,5,5,10,8,0.0,|,,,,,,,
4,2018-19,MIT (Mass. Inst. of Tech.),ACHA III,8,5,10,15,8,,|,,,,,,,


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   S       5 non-null      object
 1   Team    5 non-null      object
 2   League  5 non-null      object
 3   GP      5 non-null      object
 4   G       5 non-null      object
 5   A       5 non-null      object
 6   TP      5 non-null      object
 7   PIM     5 non-null      object
 8   +/-     5 non-null      object
 9           5 non-null      object
 10  POST    5 non-null      object
 11  GP      5 non-null      object
 12  G       5 non-null      object
 13  A       5 non-null      object
 14  TP      5 non-null      object
 15  PIM     5 non-null      object
 16  +/-     5 non-null      object
dtypes: object(17)
memory usage: 808.0+ bytes


#### Exercise 3: Grab all fun facts that contain the word "is"

In [None]:
import re
fun_facts = webpage.find("ul", attrs={"class": "fun-facts"})
fun_facts = fun_facts.find_all("li")
[ff.get_text() for ff in fun_facts if "is" in ff.get_text()]

['Middle name is Ronald',
 'Dunkin Donuts coffee is better than Starbucks',
 "A favorite book series of mine is Ender's Game",
 'Current video game of choice is Rocket League',
 "The band that I've seen the most times live is the Zac Brown Band"]

#### Exercise 4: Use BeautifulSoup to help download image from webpage

In [None]:
images_div = webpage.select("div.row .column")
images_urls = [url + div.img["src"] for div in images_div]
images_urls 

['https://keithgalli.github.io/web-scraping/images/italy/lake_como.jpg',
 'https://keithgalli.github.io/web-scraping/images/italy/pontevecchio.jpg',
 'https://keithgalli.github.io/web-scraping/images/italy/riomaggiore.jpg']

In [None]:
for i in range(len(images_urls)):
  with open("image"+str(i+1)+".jpg", "wb") as handler:
    handler.write(requests.get(images_urls[i]).content)
    # <----------- now see the files on file bar 

#### Exercise 5: Solve the mystery challenge

In [None]:
h2_mystery_challenge = webpage.find("h2", string=re.compile("Mystery"))
h2_next_siblings = h2_mystery_challenge.find_next_sibling("div")
list_file_links = h2_next_siblings.find_all("a")
file_links = [url + a["href"] for a in list_file_links]
file_links


['https://keithgalli.github.io/web-scraping/challenge/file_1.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_2.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_3.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_4.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_5.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_6.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_7.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_8.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_9.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_10.html']

In [None]:
for file_link in file_links:
  file_req = requests.get(file_link).content
  soup = BeautifulSoup(file_req)
  paragraph = soup.find("p", attrs={"id": "secret-word"})
  if paragraph:
    # print(paragraph.string, "-> figured out on: " + file_link.split("/")[-1])
    # for view all secret message
    print(paragraph.string)

Make
sure
to
smash
that
like
button
and
subscribe
!!!


# I solved all BeautifulSoup of Keith! Thank You Brother, and God Bless You!!!