#### **S27.7 - WEB SCRAPING**

**What is Web Scraping?**
- Web scraping is an automated method to obtain large amounts of data from websites.
- Data obtained is often unstructured HTML format, converted into structured data for various applications.
- Methods include online services, API usage, or creating custom code for scraping.
- Some large websites provide APIs for structured data access, but not all sites do.
- Web scraping involves a crawler (AI algorithm) and a scraper (specific tool) for data extraction.
- The crawler browses the web, searching for specific data by following links.
- The scraper is designed based on project complexity to accurately extract required data.


**How  Web Scrapers Work ?**
- Web scrapers can extract either all data or specific data from websites based on user preferences.
- It's advisable to specify desired data to streamline the scraping process.
- For example, extracting only juicer model data from an Amazon page, excluding customer reviews.
- The scraping process involves providing URLs, loading HTML code (and possibly CSS/JavaScript) from those sites.
- The scraper then extracts the required data from the HTML code.
- Output format is user-specified, commonly Excel spreadsheet, CSV file, or JSON file.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np

In [2]:
# Set a Fake User Agent - https://www.zenrows.com/blog/user-agent-web-scraping#avoid-blocking
headers = {"User-Agent": "User-Agent Received: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"}
webpage = requests.get('https://www.ambitionbox.com/list-of-companies?page=1', headers=headers).text

In [7]:
#lxml is the default in bs4, assuming you have lxml installed. So unless you happen to be working with BeautifulSoup3
soup = BeautifulSoup(webpage, 'html.parser')
#print(soup.prettify())

In [8]:
rev_elements = soup.find_all('a', {'class': 'companyCardWrapper__ActionWrapper'})
rev_count_list = [element.text.strip() for element in rev_elements]
all_reviews_concatenated = ' '.join(rev_count_list)

# Print the concatenated string and the list
rev_count_list



['71.9k Reviews',
 '854.1k Salaries',
 '5.9k Interviews',
 '626 Jobs',
 '11.4k Benefits',
 '78 Photos',
 '45.5k Reviews',
 '583.2k Salaries',
 '4.1k Interviews',
 '4k Jobs',
 '7k Benefits',
 '39 Photos',
 '41k Reviews',
 '559.4k Salaries',
 '3.5k Interviews',
 '656 Jobs',
 '5.8k Benefits',
 '62 Photos',
 '36.4k Reviews',
 '414.6k Salaries',
 '3.4k Interviews',
 '315 Jobs',
 '4.9k Benefits',
 '73 Photos',
 '33.3k Reviews',
 '131.6k Salaries',
 '1.5k Interviews',
 '247 Jobs',
 '3.2k Benefits',
 '30 Photos',
 '33k Reviews',
 '143.9k Salaries',
 '1.8k Interviews',
 '215 Jobs',
 '3.7k Benefits',
 '39 Photos',
 '31.2k Reviews',
 '463.3k Salaries',
 '4.6k Interviews',
 '914 Jobs',
 '5.1k Benefits',
 '84 Photos',
 '29.1k Reviews',
 '372.1k Salaries',
 '2.4k Interviews',
 '668 Jobs',
 '3.9k Benefits',
 '28 Photos',
 '27.9k Reviews',
 '290.1k Salaries',
 '2.4k Interviews',
 '493 Jobs',
 '4.1k Benefits',
 '35 Photos',
 '27.1k Reviews',
 '251.9k Salaries',
 '2.3k Interviews',
 '1k Jobs',
 '3.6k Be

In [23]:
from bs4 import BeautifulSoup

html_content = '''
<div class="companyCardWrapper__primaryInformation">
  <a target="_blank" href="https://www.ambitionbox.com/reviews/tcs-reviews" class="companyCardWrapper__ActionWrapper">
    <span class="companyCardWrapper__ActionCount">71.9k</span>
    <span class="companyCardWrapper__ActionTitle">Reviews</span>
  </a>
</div>
<div class="companyCardWrapper__primaryInformation">
  <a target="_blank" href="https://www.example.com" class="companyCardWrapper__ActionWrapper">
    <span class="companyCardWrapper__ActionCount">42.5k</span>
    <span class="companyCardWrapper__ActionTitle">Reviews</span>
  </a>
</div>
'''

soup = BeautifulSoup(html_content, 'html.parser')

# Find all elements with class 'companyCardWrapper__primaryInformation'
company_elements = soup.find_all('div', class_='companyCardWrapper__primaryInformation')

# Create an empty list to store review counts
review_counts = []

# Extract the review counts and append to the list using nested find
for company_element in company_elements:
    # Find the 'a' element with class 'companyCardWrapper__ActionWrapper' within the current 'div'
    action_wrapper = company_element.find('a', class_='companyCardWrapper__ActionWrapper')
    
    # If 'a' element is found, find the 'span' element with class 'companyCardWrapper__ActionCount' within it
    if action_wrapper:
        review_count_element = action_wrapper.find('span', class_='companyCardWrapper__ActionCount')
        
        # If 'span' element is found, extract the text and append to the list
        if review_count_element:
            review_count = review_count_element.text.strip()
            review_counts.append(review_count)

print("Review Counts List:", review_counts)



Review Counts List: ['71.9k', '42.5k']


In [27]:
# Code for scarping single page
company = soup.find_all('div', {'class': 'companyCardWrapper__primaryInformation'})

# Find all elements with class 'companyCardWrapper__primaryInformation'
company_elements = soup.find_all('div', class_='companyCardWrapper__primaryInformation')

names = []
sectors = []
review_counts = []
emp_counts = []
company_type = []
rating = []

for i in company:
    names.append(i.find('h2').text.strip())
    sectors.append(i.find_all('span', class_ = 'companyCardWrapper__interLinking')[0].text.strip().split('|')[0])
    emp_counts.append(i.find_all('span', class_ = 'companyCardWrapper__interLinking')[0].text.strip().split('|')[1])
    company_type.append(i.find_all('span', class_ = 'companyCardWrapper__interLinking')[0].text.strip().split('|')[2])
    rating.append(i.find('span', class_ = 'companyCardWrapper__companyRatingValue').text.strip())
    #no_of_reviews.append(i.find_all('span', class_='companyCardWrapper__ActionCount').text.strip()[element.text.strip() for element in rev_elements])
    # Use list comprehension to extract review counts
    #no_of_reviews.append([element.text.strip() for element in i.select('div.companyCardWrapper__primaryInformation a.companyCardWrapper__ActionCount')])


AttributeError: 'NoneType' object has no attribute 'text'

In [14]:
review_count

[]

In [29]:
from bs4 import BeautifulSoup


# Set a Fake User Agent - https://www.zenrows.com/blog/user-agent-web-scraping#avoid-blocking
headers = {"User-Agent": "User-Agent Received: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"}
webpage = requests.get('https://www.ambitionbox.com/list-of-companies?page=1', headers=headers).text

soup = BeautifulSoup(webpage, 'html.parser')


# Find all elements with the class 'companyCardWrapper__ActionCount' within 'a' tags
company = soup.find_all(attrs=[{'class':'companyCardWrapper__tertiaryInformation', 'class':'companyCardWrapper__primaryInformation' }])

# Create an empty list to store review counts
review_counts = []

# Extract the review counts and append to the list
for i in company:
    review_count_element = i.find('span', class_='companyCardWrapper__ActionCount')
    
    if review_count_element:
        review_count = review_count_element.text.strip()
        review_counts.append(review_count)

review_counts


[]

In [30]:
from bs4 import BeautifulSoup

# Set a Fake User Agent - https://www.zenrows.com/blog/user-agent-web-scraping#avoid-blocking
headers = {"User-Agent": "User-Agent Received: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"}
webpage = requests.get('https://www.ambitionbox.com/list-of-companies?page=1', headers=headers).text

soup = BeautifulSoup(webpage, 'html.parser')


# Use list comprehension to extract review counts
review = [element.text.strip() for element in soup.select('div.companyCardWrapper__primaryInformation span.companyCardWrapper__ActionCount')]

review


[]