Data scraping to data analysis.<br>
Following a tutorial from source: https://www.youtube.com/watch?v=GjKQ6V_ViQE&list=PLFCB5Dp81iNVmuoGIqcT5oF4K-7kTI5vp<br>
beautifulsoup doc: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [117]:
import requests
from bs4 import BeautifulSoup as bs

Loading page


In [118]:
r=requests.get('https://keithgalli.github.io/web-scraping/example.html')

# Convert to bs4 object
soup=bs(r.content)

# print html
print(soup.prettify())

<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



find & find_all

In [119]:
first_header=soup.find('h2')
first_header

<h2>A Header</h2>

In [120]:
headers=soup.find_all('h2')
headers

[<h2>A Header</h2>, <h2>Another header</h2>]

A list could be passed as well: <br>
first_header=soup.find(['h1','h2'])<br>
headers=soup.find_all(['h1','h2'])

Attributes (ie. id)

In [121]:
paragraph=soup.find_all('p',attrs={'id':'paragraph-id'})
paragraph

[<p id="paragraph-id"><b>Some bold text</b></p>]

Nested find & find_all


In [122]:
body=soup.find('body')
body

<body>
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h2>A Header</h2>
<p><i>Some italicized text</i></p>
<h2>Another header</h2>
<p id="paragraph-id"><b>Some bold text</b></p>
</body>

Searching for specific strings

Need re library, otherwise can't search for specific word/s out of context.

(S|s) - Upper or lowercase syntax 


In [123]:
import re
paragraphs=soup.find_all('p',string=re.compile('(s|S)ome'))
paragraphs

[<p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

A bit more advanced:<br>
CSS  selectors: https://www.w3schools.com/cssref/css_selectors.php

In [124]:
print(soup.body.prettify())

<body>
 <div align="middle">
  <h1>
   HTML Webpage
  </h1>
  <p>
   Link to more interesting example:
   <a href="https://keithgalli.github.io/web-scraping/webpage.html">
    keithgalli.github.io/web-scraping/webpage.html
   </a>
  </p>
 </div>
 <h2>
  A Header
 </h2>
 <p>
  <i>
   Some italicized text
  </i>
 </p>
 <h2>
  Another header
 </h2>
 <p id="paragraph-id">
  <b>
   Some bold text
  </b>
 </p>
</body>



In [125]:
content=soup.select('div p') # the first paragraph (closest) to the first div
content

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>]

In [126]:
par=soup.select('h2 ~ p') # the paragraphs inside a h2
par

[<p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [127]:
bold_txt=soup.select("p#paragraph-id b") # the bold '<b>' type text after the paragraph with the mentioned id
bold_txt

[<b>Some bold text</b>]

In [128]:
pars=soup.select('body > p') 
pars

[<p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [129]:
for par in pars:
    print(par.select('i'))

[<i>Some italicized text</i>]
[]


Printing the text without the tags

In [130]:
div=soup.find('div')
print(div.get_text())


HTML Webpage
Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html



In [131]:
link=soup.find('a')
link['href']

'https://keithgalli.github.io/web-scraping/webpage.html'

Next page:<br>
task: find all social media links in 3 dif. ways

In [132]:
r=requests.get('https://keithgalli.github.io/web-scraping/webpage.html')

# Convert to bs4 object
soup=bs(r.content)

In [133]:
links=soup.select('ul.socials a ')
links=[link['href']for link in links]
links

['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

In [134]:
ul= soup.find('ul',attrs={'class':'socials'})
links=ul.find_all('a')
the_links=[link['href'] for link in links]
the_links

['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

In [139]:
links=soup.select('li.social a') # no idea why li.social works and li.socials doesnt work
links=[link['href'] for link in links]
links

['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

Scraping a table

In [149]:
import pandas as pd
table=soup.select('.hockey-stats')[0]

rows=table.select('tr')[1:]
data=[]
for row in rows:
    cells = row.select('td')
    row_data = [cell.get_text(strip=True) for cell in cells]
    data.append(row_data)

df=pd.DataFrame(data, columns=['S', 'Team', 'League', 'GP', 'G', 'A', 'TP', 'PIM', '+/-',' ', 'Post','GP','G','A','TP','PIM','+/-'])
df.head()

Unnamed: 0,S,Team,League,GP,G,A,TP,PIM,+/-,Unnamed: 10,Post,GP.1,G.1,A.1,TP.1,PIM.1,+/-.1
0,2014-15,MIT (Mass. Inst. of Tech.),ACHA II,17.0,3.0,9.0,12.0,20.0,,|,,,,,,,
1,2015-16,MIT (Mass. Inst. of Tech.),ACHA II,9.0,1.0,1.0,2.0,2.0,,|,,,,,,,
2,2016-17,MIT (Mass. Inst. of Tech.),ACHA II,12.0,5.0,5.0,10.0,8.0,0.0,|,,,,,,,
3,2017-18,Did not play,,,,,,,,|,,,,,,,
4,2018-19,MIT (Mass. Inst. of Tech.),ACHA III,8.0,5.0,10.0,15.0,8.0,,|,,,,,,,


Grabbing all texts under 'fun facts' that have the word 'is' in them

In [164]:
facts=soup.select('ul.fun-facts li')
actual=[fact.find(string=re.compile('is')) for fact in facts]
actual=[fact for fact in actual if fact] # removes NULL
actual

['Middle name is Ronald',
 'Dunkin Donuts coffee is better than Starbucks',
 'A favorite book series of mine is ',
 'Current video game of choice is ',
 "The band that I've seen the most times live is the "]

Image downloading

In [165]:
url='https://keithgalli.github.io/web-scraping/'
images=soup.select('div.row div.column img')
image_url=images[0]['src']
full_url=url+image_url
# downloading
img_data=requests.get(full_url).content
with open('lake_como.jpg', 'wb') as handler:
    handler.write(img_data)