# WebScraping

Web Scraping is a method to extract (or scrape) relevant data/information from the website. It may also be referred to as **web data extraction**. Generally, web scraping has a certain goal in mind (to retrieve a certain piece of information) as opposed to **web crawling** in which all data is gathered.

Some examples may be:
- Gathering job information from [rozee.pk](https://rozee.pk)
- Gathering car information from pakwheels. 
- Used in Natural Language Processing (NLP) for extracting relevant data (for learning purposes).


## Challenges of WebScraping
- Web scraping is structure dependent. This means, that based on the structure of the website design, the location to look for the relevant information may be different and so the search process cannot be generalized to all types of websites.
- The scraper can become outdated if the website is modified and the structure is changed. 

# Methodology

Before we dive into details of Python code and relevant libraries, let us just understand the process of web scraping.

1. We start by choosing the website we want to scrap.
2. We go ahead and download the particular web page (using *requests* library).
3. We parse the downloaded web page and parse and store it as an html version (other formats are available too). This way, we can access individual html tags.
4. We spend some time understanding the structure of the html file. This way we can identify what information is available, how it is stored and how to access it.
5. Finally, once the location is identified, we go ahead and write Python code to access relevant data.
6. We can store that data as dataframe (other options are available too) and perform data analysis on it.

In [24]:
#required libraries
import requests
from bs4 import BeautifulSoup

# Getting the Web Page
In order to get the local copy of the webpage, we use the *requests* library from Python. It has a function `get()` that takes the URL of the webpage and stores it as a *content* attribute. 



In [25]:
#request the content from website
page = requests.get("https://www.pakwheels.com/used-cars/search/-/")
page

<Response [200]>

If the `get()` operation is successful, the *status_code* attribute is set to **200**, otherwise it is set to **404** in case the operation is not successful.

Since scraping is basically sending requests to a website for information, too many requests may slowdown their server, therefore, many websites block requests if done by robot (or have some policy in place).

In [26]:
#see response status
page.status_code

200

We can view the content of the website, using *content*

In [27]:
page.content

b'<div id="safety_precautions" class="modal">\n  <div class="modal-dialog">\n    <div class="modal-content">\n      <div class="modal-header noborder pb0">\n        <button type="button" class="close" data-dismiss="modal" aria-hidden="true">&times;</button>\n      </div>\n      <div class="modal-body nomargin" style="padding: 0px 40px 10px;">\n        <div class="tlc mb25">\n          <img alt="Tips-for-safe-deal" height="70px" src="https://wsa4.pakwheels.com/assets/tips-for-safe-deal-00805d1034ee7600820049c852bd6163.svg" />\n          <div class="mb0 fs18 generic-basic mt10 lhm fwm">Tips for Safe Deal</div>\n        </div>\n        <div class="d-flex align-center mb15">\n          <img alt="Tip-for-safe-deal-1" class="mr20" height="36px" src="https://wsa1.pakwheels.com/assets/tip-for-safe-deal-1-a3f472bbbc5249edca9fd01f449f98c3.svg" />\n          <p class="fs16 nomargin">Never make payments in advance.</p>\n        </div>\n        <div class="d-flex align-center mb15">\n          <img

Not a pretty picture, huh.

# Parsing the Webpage
Once, we have the local copy of the webpage, the next step is to parse it for scraping purposes. For this purpose, we have a Python library called **Beautiful Soup**.

## Beautiful Soup
The Beautiful Soup library is a Python library used to extract structured data from the webpage. It is a helper module to interact with HTML and allows to parse data from HTML and XML files. 

This way, it is more readable than what is acquired by the `requests` library. 

There are multiple parsers available (as discussed in the next slide), however for our exercise, we will stick to *html* parser.

In [28]:
# create beautifulSoup object
soup = BeautifulSoup(page.content, 'html.parser')


## Different Available parsers
<table class="docutils align-default">
<colgroup>
<col style="width: 18%">
<col style="width: 35%">
<col style="width: 26%">
<col style="width: 21%">
</colgroup>
<tbody>
<tr class="row-odd"><td><p>Parser</p></td>
<td><p>Typical usage</p></td>
<td><p>Advantages</p></td>
<td><p>Disadvantages</p></td>
</tr>
<tr class="row-even"><td><p>Python’s html.parser</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">"html.parser")</span></code></p></td>
<td><ul class="simple">
<li><p>Batteries included</p></li>
<li><p>Decent speed</p></li>
<li><p>Lenient (As of Python 3.2)</p></li>
</ul>
</td>
<td><ul class="simple">
<li><p>Not as fast as lxml,
less lenient than
html5lib.</p></li>
</ul>
</td>
</tr>
<tr class="row-odd"><td><p>lxml’s HTML parser</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">"lxml")</span></code></p></td>
<td><ul class="simple">
<li><p>Very fast</p></li>
<li><p>Lenient</p></li>
</ul>
</td>
<td><ul class="simple">
<li><p>External C dependency</p></li>
</ul>
</td>
</tr>
<tr class="row-even"><td><p>lxml’s XML parser</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">"lxml-xml")</span></code>
<code class="docutils literal notranslate"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">"xml")</span></code></p></td>
<td><ul class="simple">
<li><p>Very fast</p></li>
<li><p>The only currently supported
XML parser</p></li>
</ul>
</td>
<td><ul class="simple">
<li><p>External C dependency</p></li>
</ul>
</td>
</tr>
<tr class="row-odd"><td><p>html5lib</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">"html5lib")</span></code></p></td>
<td><ul class="simple">
<li><p>Extremely lenient</p></li>
<li><p>Parses pages the same way a
web browser does</p></li>
<li><p>Creates valid HTML5</p></li>
</ul>
</td>
<td><ul class="simple">
<li><p>Very slow</p></li>
<li><p>External Python
dependency</p></li>
</ul>
</td>
</tr>
</tbody>
</table>

<center><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser">Source</a></center>

On more detail on differences between parsers, click [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers)

### `prettify()`
This is a very important function that prints the html file in a better formatted string with indentation.

In [29]:
#View formatted content
print(soup.prettify())

<div class="modal" id="safety_precautions">
 <div class="modal-dialog">
  <div class="modal-content">
   <div class="modal-header noborder pb0">
    <button aria-hidden="true" class="close" data-dismiss="modal" type="button">
     ×
    </button>
   </div>
   <div class="modal-body nomargin" style="padding: 0px 40px 10px;">
    <div class="tlc mb25">
     <img alt="Tips-for-safe-deal" height="70px" src="https://wsa4.pakwheels.com/assets/tips-for-safe-deal-00805d1034ee7600820049c852bd6163.svg"/>
     <div class="mb0 fs18 generic-basic mt10 lhm fwm">
      Tips for Safe Deal
     </div>
    </div>
    <div class="d-flex align-center mb15">
     <img alt="Tip-for-safe-deal-1" class="mr20" height="36px" src="https://wsa1.pakwheels.com/assets/tip-for-safe-deal-1-a3f472bbbc5249edca9fd01f449f98c3.svg"/>
     <p class="fs16 nomargin">
      Never make payments in advance.
     </p>
    </div>
    <div class="d-flex align-center mb15">
     <img alt="Tip-for-safe-deal-2" class="mr20" height="36

Now lets dig in to find relevant data, before that we need to read the html file to see what we are looking for. 

For this we go to our target webpage  and inspects it content. 

- Open the webpage on your browser.
- Right click and then select *Inspect*, on the relevant information on the webpage. This will open the html version of the webpage on your browser. 
- Identify the tag in which the relevant information is stored.

- The most important element in html is, *tags*. A tag is Each tag can have multiple attributes, *class* and *ID* are some of the most common attributes that uniquely identify the tag (by possessing important information). The relevant data is generally text between these tags (start and end tags), for example \<a\> and \</a\>.
- In order to access data between texts, `.text` or the function `.get_text()` or `getText()` can be used. All have similar functionality, however the function versions give more control to manipulate the data. 
- If the relevant data is the value of some attribute, `get()` function can be used to get the value.
- To search using tags, class or ID, two functions can be used (there are others as well): 
    - `find()`: returns the first instance of the tag to be found.
    - `find_all()`: returns all instances of the tag to be found. returns a list of items.
    

Typical tags are:
  - \<div\>
  - \<a\>
  - \<h3\>

In [30]:
# single search
soup.find('div')

<div class="modal" id="safety_precautions">
<div class="modal-dialog">
<div class="modal-content">
<div class="modal-header noborder pb0">
<button aria-hidden="true" class="close" data-dismiss="modal" type="button">×</button>
</div>
<div class="modal-body nomargin" style="padding: 0px 40px 10px;">
<div class="tlc mb25">
<img alt="Tips-for-safe-deal" height="70px" src="https://wsa4.pakwheels.com/assets/tips-for-safe-deal-00805d1034ee7600820049c852bd6163.svg"/>
<div class="mb0 fs18 generic-basic mt10 lhm fwm">Tips for Safe Deal</div>
</div>
<div class="d-flex align-center mb15">
<img alt="Tip-for-safe-deal-1" class="mr20" height="36px" src="https://wsa1.pakwheels.com/assets/tip-for-safe-deal-1-a3f472bbbc5249edca9fd01f449f98c3.svg"/>
<p class="fs16 nomargin">Never make payments in advance.</p>
</div>
<div class="d-flex align-center mb15">
<img alt="Tip-for-safe-deal-2" class="mr20" height="36px" src="https://wsa3.pakwheels.com/assets/tip-for-safe-deal-2-b8b8dded80b193b4de603dd617fd42db.sv

In [31]:
soup.find_all('div')

#len(soup.find_all('div'))


[<div class="modal" id="safety_precautions">
 <div class="modal-dialog">
 <div class="modal-content">
 <div class="modal-header noborder pb0">
 <button aria-hidden="true" class="close" data-dismiss="modal" type="button">×</button>
 </div>
 <div class="modal-body nomargin" style="padding: 0px 40px 10px;">
 <div class="tlc mb25">
 <img alt="Tips-for-safe-deal" height="70px" src="https://wsa4.pakwheels.com/assets/tips-for-safe-deal-00805d1034ee7600820049c852bd6163.svg"/>
 <div class="mb0 fs18 generic-basic mt10 lhm fwm">Tips for Safe Deal</div>
 </div>
 <div class="d-flex align-center mb15">
 <img alt="Tip-for-safe-deal-1" class="mr20" height="36px" src="https://wsa1.pakwheels.com/assets/tip-for-safe-deal-1-a3f472bbbc5249edca9fd01f449f98c3.svg"/>
 <p class="fs16 nomargin">Never make payments in advance.</p>
 </div>
 <div class="d-flex align-center mb15">
 <img alt="Tip-for-safe-deal-2" class="mr20" height="36px" src="https://wsa3.pakwheels.com/assets/tip-for-safe-deal-2-b8b8dded80b193b4de

In [32]:
#tag with class 

soup.find('div', class_='col-md-9 grid-style')
#soup.find('li',class_="classified-listing featured-listing")


<div class="col-md-9 grid-style">
<div class="">
<div class="search-title-row">
<div class="search-title">
<div class="right">
<div class="price-details generic-dark-grey">

                        PKR 20 <span>lacs</span>
</div>
</div>
<a class="car-name ad-detail-path" current-index="0" href="/used-cars/toyota-corolla-1994-for-sale-in-faisalabad-6916782" target="_blank" title="Toyota Corolla  1994 ">
<h3 style="white-space: normal;">Toyota Corolla  1994  for Sale</h3>
</a>
</div>
</div>
</div>
<div class="row">
<div class="col-md-12 grid-date">
<ul class="list-unstyled search-vehicle-info fs13">
<li>
                    Faisalabad
                  </li>
</ul>
<ul class="list-unstyled search-vehicle-info-2 fs13">
<li>1994</li>
<li>75,071 km</li>
<li>Petrol</li>
<li>1500 cc</li>
<li>Automatic</li>
</ul>
</div>
</div>
<div class="search-bottom clearfix">
<div class="pull-left dated">Updated less than a minute ago</div>
<div class="pull-right">
<button class="btn btn-success phone_numbe

In [33]:
car=soup.find('div', class_='col-md-9 grid-style')
print(car.h3.get_text(strip=True))
print(car.a.get('href'))
print(car.find(class_='price-details generic-dark-grey').get_text(strip=True))
print(car.find(class_='list-unstyled search-vehicle-info fs13').get_text(strip=True))
print(car.find(class_='pull-left dated').get_text())




Toyota Corolla  1994  for Sale
/used-cars/toyota-corolla-1994-for-sale-in-faisalabad-6916782
PKR 20lacs
Faisalabad
Updated less than a minute ago


### Getting List
What if need to access the list of items (denoted here by tag `li`)

In [34]:
x=car.find(class_='list-unstyled search-vehicle-info-2 fs13')
print(x)
list1 = x.find_all('li')
for i in list1:
    print(i.get_text())


<ul class="list-unstyled search-vehicle-info-2 fs13">
<li>1994</li>
<li>75,071 km</li>
<li>Petrol</li>
<li>1500 cc</li>
<li>Automatic</li>
</ul>
1994
75,071 km
Petrol
1500 cc
Automatic


- `contents` attribute provides all the body of the soup object (generally as a list).
- `children` attributes provides an iterator to all the tags within the body of the soup object.

In [35]:
#children
print(car.contents)
for child in car.children:
    print(child)



['\n', <div class="">
<div class="search-title-row">
<div class="search-title">
<div class="right">
<div class="price-details generic-dark-grey">

                        PKR 20 <span>lacs</span>
</div>
</div>
<a class="car-name ad-detail-path" current-index="0" href="/used-cars/toyota-corolla-1994-for-sale-in-faisalabad-6916782" target="_blank" title="Toyota Corolla  1994 ">
<h3 style="white-space: normal;">Toyota Corolla  1994  for Sale</h3>
</a>
</div>
</div>
</div>, '\n', <div class="row">
<div class="col-md-12 grid-date">
<ul class="list-unstyled search-vehicle-info fs13">
<li>
                    Faisalabad
                  </li>
</ul>
<ul class="list-unstyled search-vehicle-info-2 fs13">
<li>1994</li>
<li>75,071 km</li>
<li>Petrol</li>
<li>1500 cc</li>
<li>Automatic</li>
</ul>
</div>
</div>, '\n', <div class="search-bottom clearfix">
<div class="pull-left dated">Updated less than a minute ago</div>
<div class="pull-right">
<button class="btn btn-success phone_number_btn pull-ri

Let us gather all required data for each car and store it in a dataframe object for later us

In [36]:
len(soup.find_all('div', class_='col-md-9 grid-style'))

31

In [37]:
l=[]
for car in soup.find_all('div', class_='col-md-9 grid-style'):
    o=[car.h3.get_text(strip=True),car.find(class_='price-details generic-dark-grey').get_text(strip=True),car.find(class_='list-unstyled search-vehicle-info fs13').get_text(strip=True),car.find(class_='pull-left dated').get_text(),car.a.get('href')]
    l.append(o)
    

In [38]:
l
#len(l)

[['Toyota Corolla  1994  for Sale',
  'PKR 20lacs',
  'Faisalabad',
  'Updated less than a minute ago',
  '/used-cars/toyota-corolla-1994-for-sale-in-faisalabad-6916782'],
 ['Nissan Note  2018 1.2E for Sale',
  'PKR 47lacs',
  'Faisalabad',
  'Updated 2 minutes ago',
  '/used-cars/nissan-note-2018-for-sale-in-faisalabad-6082219'],
 ['KIA Sportage  2021 FWD for Sale',
  'PKR 73.5lacs',
  'Sialkot',
  'Updated 3 minutes ago',
  '/used-cars/kia-sportage-2021-for-sale-in-sialkot-7291543'],
 ['Honda N Wgn  2015 G for Sale',
  'PKR 21lacs',
  'Karachi',
  'Updated 4 minutes ago',
  '/used-cars/honda-n-wgn-2015-for-sale-in-karachi-7362668'],
 ['Toyota Yaris  2023 ATIV X CVT 1.5 for Sale',
  'PKR 57.75lacs',
  'Rawalpindi',
  'Updated 6 minutes ago',
  '/used-cars/toyota-yaris-2023-for-sale-in-rawalpindi-7362952'],
 ['Toyota Yaris  2023 ATIV X CVT 1.5 for Sale',
  'PKR 57.75lacs',
  'Rawalpindi',
  'Updated 6 minutes ago',
  '/used-cars/toyota-yaris-2023-for-sale-in-rawalpindi-7362949'],
 ['Da

Storing it into dataframe.

In [39]:
import pandas as pd

df = pd.DataFrame(l,columns=['Name','Price','Area','Date','Link'])

In [40]:
display(df)

Unnamed: 0,Name,Price,Area,Date,Link
0,Toyota Corolla 1994 for Sale,PKR 20lacs,Faisalabad,Updated less than a minute ago,/used-cars/toyota-corolla-1994-for-sale-in-fai...
1,Nissan Note 2018 1.2E for Sale,PKR 47lacs,Faisalabad,Updated 2 minutes ago,/used-cars/nissan-note-2018-for-sale-in-faisal...
2,KIA Sportage 2021 FWD for Sale,PKR 73.5lacs,Sialkot,Updated 3 minutes ago,/used-cars/kia-sportage-2021-for-sale-in-sialk...
3,Honda N Wgn 2015 G for Sale,PKR 21lacs,Karachi,Updated 4 minutes ago,/used-cars/honda-n-wgn-2015-for-sale-in-karach...
4,Toyota Yaris 2023 ATIV X CVT 1.5 for Sale,PKR 57.75lacs,Rawalpindi,Updated 6 minutes ago,/used-cars/toyota-yaris-2023-for-sale-in-rawal...
5,Toyota Yaris 2023 ATIV X CVT 1.5 for Sale,PKR 57.75lacs,Rawalpindi,Updated 6 minutes ago,/used-cars/toyota-yaris-2023-for-sale-in-rawal...
6,Daihatsu Mira 2015 X Limited Smart Drive Pack...,PKR 25lacs,Islamabad,Updated 8 minutes ago,/used-cars/daihatsu-mira-2015-for-sale-in-isla...
7,Hyundai Sonata 2023 2.5 for Sale,PKR 1.25crore,Lahore,Updated 8 minutes ago,/used-cars/hyundai-sonata-2023-for-sale-in-lah...
8,Toyota Land Cruiser 2018 ZX for Sale,PKR 6.35crore,Lahore,Updated 9 minutes ago,/used-cars/toyota-land-cruiser-2018-for-sale-i...
9,BMW X1 2018 for Sale,PKR 1.18crore,Lahore,Updated 9 minutes ago,/used-cars/bmw-x1-2018-for-sale-in-lahore-7255156


In [41]:
df.describe()

Unnamed: 0,Name,Price,Area,Date,Link
count,31,31,31,31,31
unique,30,28,9,11,31
top,Toyota Yaris 2023 ATIV X CVT 1.5 for Sale,PKR 57.75lacs,Lahore,Updated 2 minutes ago,/used-cars/toyota-corolla-1994-for-sale-in-fai...
freq,2,2,9,7,1


In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    31 non-null     object
 1   Price   31 non-null     object
 2   Area    31 non-null     object
 3   Date    31 non-null     object
 4   Link    31 non-null     object
dtypes: object(5)
memory usage: 1.3+ KB


In [43]:

df.to_csv('DataDC.csv')