#### **S27.7 - WEB SCRAPING**

**What is Web Scraping?**
- Web scraping is an automated method to obtain large amounts of data from websites.
- Data obtained is often unstructured HTML format, converted into structured data for various applications.
- Methods include online services, API usage, or creating custom code for scraping.
- Some large websites provide APIs for structured data access, but not all sites do.
- Web scraping involves a crawler (AI algorithm) and a scraper (specific tool) for data extraction.
- The crawler browses the web, searching for specific data by following links.
- The scraper is designed based on project complexity to accurately extract required data.


**How  Web Scrapers Work ?**
- Web scrapers can extract either all data or specific data from websites based on user preferences.
- It's advisable to specify desired data to streamline the scraping process.
- For example, extracting only juicer model data from an Amazon page, excluding customer reviews.
- The scraping process involves providing URLs, loading HTML code (and possibly CSS/JavaScript) from those sites.
- The scraper then extracts the required data from the HTML code.
- Output format is user-specified, commonly Excel spreadsheet, CSV file, or JSON file.

In [7]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np

In [8]:
# Set a Fake User Agent - https://www.zenrows.com/blog/user-agent-web-scraping#avoid-blocking
headers = {"User-Agent": "User-Agent Received: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"}
webpage = requests.get('https://www.ambitionbox.com/list-of-companies?page=1', headers=headers).text

In [9]:
#lxml is the default in bs4, assuming you have lxml installed. So unless you happen to be working with BeautifulSoup3
soup = BeautifulSoup(webpage, 'html.parser')
#print(soup.prettify())

In [11]:
# Code for scarping single page
company = soup.find_all('div', {'class': 'companyCardWrapper__primaryInformation'}, 'div',)

names = []
sectors = []
review_count = []
emp_counts = []
company_type = []
rating = []
location = []
age = []
for i in company:
    names.append(i.find('h2').text.strip())
    sectors.append(i.find_all('span', class_ = 'companyCardWrapper__interLinking')[0].text.strip().split('|')[0])
    emp_counts.append(i.find_all('span', class_ = 'companyCardWrapper__interLinking')[0].text.strip().split('|')[1])
    company_type.append(i.find_all('span', class_ = 'companyCardWrapper__interLinking')[0].text.strip().split('|')[2])
    rating.append(i.find('span', class_ = 'companyCardWrapper__companyRatingValue').text.strip())
    age.append(i.find_all('span', class_ = 'companyCardWrapper__interLinking')[0].text.strip().split('|')[3])
    # location.append(i.find_all('span', class_ = 'companyCardWrapper__interLinking')[0].text.strip().split('|')[4])
    #review_counts.append(i.find('span', class_ = 'companyCardWrapper__ActionCount'))

d = {
    'Name_of_company': names,
    'Sector': sectors,
    'Employee_count' : emp_counts,
    'Company_type': company_type,
    'Rating': rating,
    'Age_of_company': age,
}

df = pd.DataFrame(d)
df

AttributeError: 'NoneType' object has no attribute 'text'

In [19]:
final = pd.DataFrame()

for j in range(1, 21):
    url = 'https://www.ambitionbox.com/list-of-companies?page={}'.format(j)
    headers = {"User-Agent": "User-Agent Received: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"}
    webpage = requests.get(url, headers=headers).text
    soup = BeautifulSoup(webpage)
    company = soup.find_all('div', {'class': 'companyCardWrapper__primaryInformation'}, 'div',)

    names = []
    sectors = []
    review_count = []
    emp_counts = []
    company_type = []
    rating = []
    location = []
    age = []
    for i in company:
        names.append(i.find('h2').text.strip())
        sectors.append(i.find_all('span', class_ = 'companyCardWrapper__interLinking')[0].text.strip().split('|')[0])
        emp_counts.append(i.find_all('span', class_ = 'companyCardWrapper__interLinking')[0].text.strip().split('|')[1])
        company_type.append(i.find_all('span', class_ = 'companyCardWrapper__interLinking')[0].text.strip().split('|')[2])
        rating.append(i.find('span', class_ = 'companyCardWrapper__companyRatingValue').text.strip())
        #age.append(i.find_all('span', class_ = 'companyCardWrapper__interLinking')[0].text.strip().split('|')[3])
        # location.append(i.find_all('span', class_ = 'companyCardWrapper__interLinking')[0].text.strip().split('|')[4])
        #review_counts.append(i.find('span', class_ = 'companyCardWrapper__ActionCount'))
    
    d = {
        'Name_of_company': names,
        'Sector': sectors,
        'Employee_count' : emp_counts,
        'Company_type': company_type,
        'Rating': rating,
    }
    
    df = pd.DataFrame(d)




IndexError: list index out of range

In [20]:
df

Unnamed: 0,Name_of_company,Sector,Employee_count,Company_type,Rating
0,Statestreet HCL Services,BPO,5k-10k Employees,12 years old,3.2
1,TVS Credit,NBFC,10k-50k Employees,Public,4.5
2,Bajaj Electricals,Consumer Electronics & Appliances,1k-5k Employees,Public,4.0
3,Tata AIG,Insurance,5k-10k Employees,24 years old,4.2
4,Escorts Kubota Limited,Automobile,1k-5k Employees,Public,4.1
5,Tech Mahindra Business Services,BPO,5k-10k Employees,Public,3.5
6,IGT Solutions,BPO,5k-10k Employees,26 years old,3.3
7,Simplex Infrastructures,Engineering & Construction,1k-5k Employees,Public,3.3
8,Apollo Tyres,Auto Components,10k-50k Employees,Public,4.1
9,AFCONS Infrastructure,Engineering & Construction,5k-10k Employees,65 years old,4.3
