## Scraping 

- Using BeautifulSoup

    using a sample html doc- It's part of a story from Alice in Wonderland

In [1]:
from bs4 import BeautifulSoup

1. Creating a soup(bs object/ data structure): It will store the parsed html doc

In [3]:
with open ('scrap.html', 'r') as f:
    html_doc = f.read()

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



2. Navigating through soup object

In [None]:
print(soup.title)
print(soup.title.name)
print(f"The title of the story is '{soup.title.string}'.")
print(soup.title.parent.name)

link1 = soup.a   # returns first hyperlink or <a> tag and the href attr.
print(link1)
print(link1['href'])

all_hlinks = soup.find_all('a')
for link in all_hlinks:
    print(link)
# finding a link by its id 
soup.find(id='link3')

<title>The Dormouse's story</title>
title
The title of the story is 'The Dormouse's story'.
head
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
http://example.com/elsie
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

3. Extracting all the text from the from the page/html doc

In [17]:
content = soup.get_text()
print(f"The story is: \n{content}")

The story is: 
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


### Web Scraping

- Scraping Github for Extracting data of Top 5 Trending Repos as on 19-02-25

1. Retrieving data from the webpage url using requests module's get()

In [32]:
import requests

url = "https://github.com/trending"
r = requests.get(url)
print(r.text)



<!DOCTYPE html>
<html
  lang="en"
  
  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"
  data-a11y-animated-images="system" data-a11y-link-underlines="true"
  
  >



  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">
  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>
  <link rel="preconnect" href="https://avatars.githubusercontent.com">

  

  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-7aa84bb7e11e.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-f65db3e8d171.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media=

2. Storing the retrieved html data in a file

In [33]:
with open ('github.html', 'w') as f:
    f.write(r.text)

with open ('github.html') as f:
    github_html = f.read()

3. Using BeautifulSoup to scrap the data

In [None]:
soup = BeautifulSoup(github_html, 'html.parser') #parsing the html data 
print(soup.prettify())

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
  <link href="https://avatars.githubusercontent.com" rel="preconnect"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-7aa84bb7e11e.css" media="all" rel="stylesheet">
   <link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-f65db3e8d171.css" media="all" rel="stylesheet">
    <link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://

- Extracting Data through tags

In [None]:
import pandas as pd

repo_tags = soup.find_all('h2', class_='h3 lh-condensed')

count = 0

repo_data = []

for repo in repo_tags:
    
    if count >=5:
        break
    
    #extracting basic repo details like name, author and url
    repo_link = repo.find('a', class_='Link')
    repo_info_tag = repo.find('span', class_='text-normal')
    author_name = repo_info_tag.text.strip().replace(' /', '')
    repo_name = repo_info_tag.find_next_sibling(text=True).strip()
    repo_url = "https://github.com" + repo_link['href']

    #extracting repo's description
    repo_desc_tag = repo.find_next('p', class_='col-9 color-fg-muted my-1 pr-4')
    repo_desc = repo_desc_tag.text.strip()

    #storing the extracted data into the repo_data list one by one
    repo_data.append({
       'Repo_Name': repo_name,
       'Author': author_name,
       'Description': repo_desc,
       'URL': repo_url
   })

    count += 1

#Creating a pandas dataframe out of teh scraped data to make it presentable and useful for analysis.
df = pd.DataFrame(repo_data)
#Saving the scraped data as a csv file
df.to_csv('github_top5_trending_repos_19-02-25', index=False)

  repo_name = repo_info_tag.find_next_sibling(text=True).strip()


Unnamed: 0,Repo_Name,Author,Description,URL
0,OmniParser,microsoft,A simple screen parsing tool towards pure visi...,https://github.com/microsoft/OmniParser
1,MoneyPrinterTurbo,harry0703,利用AI大模型，一键生成高清短视频 Generate short videos with o...,https://github.com/harry0703/MoneyPrinterTurbo
2,cobra,spf13,A Commander for modern Go CLI interactions,https://github.com/spf13/cobra
3,MoneyPrinterV2,FujiwaraChoki,Automate the process of making money online.,https://github.com/FujiwaraChoki/MoneyPrinterV2
4,exo,exo-explore,Run your own AI cluster at home with everyday ...,https://github.com/exo-explore/exo


- Exploring the dataset

In [83]:
df.head()

Unnamed: 0,Repo_Name,Author,Description,URL
0,OmniParser,microsoft,A simple screen parsing tool towards pure visi...,https://github.com/microsoft/OmniParser
1,MoneyPrinterTurbo,harry0703,利用AI大模型，一键生成高清短视频 Generate short videos with o...,https://github.com/harry0703/MoneyPrinterTurbo
2,cobra,spf13,A Commander for modern Go CLI interactions,https://github.com/spf13/cobra
3,MoneyPrinterV2,FujiwaraChoki,Automate the process of making money online.,https://github.com/FujiwaraChoki/MoneyPrinterV2
4,exo,exo-explore,Run your own AI cluster at home with everyday ...,https://github.com/exo-explore/exo


### Scraping data from https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfPMN/pmn.cfm as instructed.

1. Page 1

In [None]:

session = requests.session()

#using headers for simulating a normal browser request for avoiding too many redirects error
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    'Connection': 'keep-alive'
}

url = 'https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfPMN/denovo.cfm'

response = session.get(url, headers=headers, allow_redirects=True)

if response.status_code == 200:
    print("Request was successful!")
else:
    print(f"Request was failed: error code= {response.status_code}")


Request was successful!
b'\r\n\r\n\r\n\r\n\r\n \r\n\r\n\r\n\r\n\r\n\r\n \r\n \r\n\r\n\r\n\t \r\n\t\r\n\t\r\n\r\n\r\n\r\n\r\n\r\n \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\t\t\r\n\r\n\r\n  \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\t\r\n\r\n\r\n\r\n\t\r\n\r\n\r\n\r\n\t\r\n\r\n\r\n\r\n\t\t\r\n\r\n\r\n\r\n\t\t\r\n\r\n\r\n\r\n\t\t\r\n\r\n\r\n\r\n\t\t\r\n\r\n\r\n\r\n\r\n\t\t\r\n\r\n\r\n\r\n\t\t\r\n\r\n\r\n\t\r\n\t\r\n\t\r\n\r\n\r\n\r\n\t\r\n\r\n\r\n\r\n\t\t\r\n\r\n\r\n\r\n\t\r\n\r\n\r\n\r\n\t\t\r\n\r\n\r\n\r\n\t\t\r\n\r\n\r\n\r\n\t\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\t\t\r\n\r\n\r\n\r\n\t\r\n\r\n\r\n\r\n\t\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\t\r\n<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:addthis="http://www.addthis.com/help/api-spec">\r\n<head>\r\n\r\n  \r\n  \r\n  \r\n  \r\n  \r\n  \r\n  \r\n  \r\n  \r\n  \r\n  \r\n  \r\n  \r\n  \r\n

In [89]:
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:addthis="http://www.addthis.com/help/api-spec">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="IE=7" http-equiv="X-UA-Compatible"/>
  <script src="/cf_scripts/scripts/cfform.js" type="text/javascript">
  </script>
  <script src="/cf_scripts/scripts/masks.js" type="text/javascript">
  </script>
  <script src="/scripts/includes/js/wcm.toggle.js" type="text/javascript">
  </script>
  <script src="/scripts/includes/js/ssajax_2012.js" type="text/javascript">
  </script>
  <title>
   Device Classification Under Section 513(f)(2)(De Novo)
  </title>
  <!-- /**** Begin CSS References ****/-->
  <link href="/scripts/cdrh/cfdocs/default_style.css" media="all" rel="stylesheet" type="text/css"/>
  <link href="/scripts/includes/css/css_fda_gov_stylesheet-2013.css" media="screen" r

- Extracting data using beautifulsoup

In [None]:
title_tag = soup.find('h1', class_='head1')
page_title = title_tag.text
content_tag = soup.find('div', class_='pmn-intro')
content = content_tag.text.strip('\n ').replace('learn more...', '')

#saving the extracted data into a txt file
with open('page1_data.txt', 'w') as f:
    f.write(f"Title: {page_title} + \n")
    f.write(f"Body: {content} + \n")