# WebScraping

Web Scraping is a method to extract (or scrape) relevant data/information from the website. It may also be referred to as **web data extraction**. Generally, web scraping has a certain goal in mind (to retrieve a certain piece of information) as opposed to **web crawling** in which all data is gathered.

Some examples may be:
- Gathering job information from [rozee.pk](https://rozee.pk)
- Gathering car information from pakwheels. 
- Used in Natural Language Processing (NLP) for extracting relevant data (for learning purposes).


## Challenges of WebScraping
- Web scraping is structure dependent. This means, that based on the structure of the website design, the location to look for the relevant information may be different and so the search process cannot be generalized to all types of websites.
- The scraper can become outdated if the website is modified and the structure is changed. 

# Methodology

Before we dive into details of Python code and relevant libraries, let us just understand the process of web scraping.

1. We start by choosing the website we want to scrap.
2. We go ahead and download the particular web page (using *requests* library).
3. We parse the downloaded web page and parse and store it as an html version (other formats are available too). This way, we can access individual html tags.
4. We spend some time understanding the structure of the html file. This way we can identify what information is available, how it is stored and how to access it.
5. Finally, once the location is identified, we go ahead and write Python code to access relevant data.
6. We can store that data as dataframe (other options are available too) and perform data analysis on it.

In [1]:
#required libraries
import requests
from bs4 import BeautifulSoup

# Getting the Web Page
In order to get the local copy of the webpage, we use the *requests* library from Python. It has a function `get()` that takes the URL of the webpage and stores it as a *content* attribute. 



In [2]:
#request the content from website
page = requests.get("https://washington.craigslist.org/search/cta")
page

<Response [200]>

If the `get()` operation is successful, the *status_code* attribute is set to **200**, otherwise it is set to **404** in case the operation is not successful.

Since scraping is basically sending requests to a website for information, too many requests may slowdown their server, therefore, many websites block requests if done by robot (or have some policy in place).

In [3]:
#see response status
page.status_code

200

We can view the content of the website, using *content*

In [4]:
page.content

b'<!DOCTYPE html>\n<html>\n<head>\n    \n\t<meta charset="UTF-8">\n\t<meta http-equiv="X-UA-Compatible" content="IE=Edge">\n\t<meta name="viewport" content="width=device-width,initial-scale=1">\n\t<meta property="og:site_name" content="craigslist">\n\t<meta name="twitter:card" content="preview">\n\t<meta property="og:title" content="washington, DC cars &amp; trucks - craigslist">\n\t<meta name="description" content="washington, DC cars &amp; trucks - craigslist">\n\t<meta property="og:description" content="washington, DC cars &amp; trucks - craigslist">\n\t<meta property="og:url" content="https://washingtondc.craigslist.org/search/cta">\n\t<meta name="smartbanner:api" content="true">\n\t<meta name="smartbanner:title" content="the craigslist app">\n\t<meta name="smartbanner:author" content="what&#39;s old is new">\n\t<meta name="smartbanner:icon-apple" content="/images/app_icon.png">\n\t<meta name="smartbanner:icon-google" content="/images/app_icon.png">\n\t<meta name="smartbanner:butto

Not a pretty picture, huh.

# Parsing the Webpage
Once, we have the local copy of the webpage, the next step is to parse it for scraping purposes. For this purpose, we have a Python library called **Beautiful Soup**.

## Beautiful Soup
The Beautiful Soup library is a Python library used to extract structured data from the webpage. It is a helper module to interact with HTML and allows to parse data from HTML and XML files. 

This way, it is more readable than what is acquired by the `requests` library. 

There are multiple parsers available (as discussed in the next slide), however for our exercise, we will stick to *html* parser.

In [5]:
# create beautifulSoup object
soup = BeautifulSoup(page.content, 'html.parser')


## Different Available parsers
<table class="docutils align-default">
<colgroup>
<col style="width: 18%">
<col style="width: 35%">
<col style="width: 26%">
<col style="width: 21%">
</colgroup>
<tbody>
<tr class="row-odd"><td><p>Parser</p></td>
<td><p>Typical usage</p></td>
<td><p>Advantages</p></td>
<td><p>Disadvantages</p></td>
</tr>
<tr class="row-even"><td><p>Python’s html.parser</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">"html.parser")</span></code></p></td>
<td><ul class="simple">
<li><p>Batteries included</p></li>
<li><p>Decent speed</p></li>
<li><p>Lenient (As of Python 3.2)</p></li>
</ul>
</td>
<td><ul class="simple">
<li><p>Not as fast as lxml,
less lenient than
html5lib.</p></li>
</ul>
</td>
</tr>
<tr class="row-odd"><td><p>lxml’s HTML parser</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">"lxml")</span></code></p></td>
<td><ul class="simple">
<li><p>Very fast</p></li>
<li><p>Lenient</p></li>
</ul>
</td>
<td><ul class="simple">
<li><p>External C dependency</p></li>
</ul>
</td>
</tr>
<tr class="row-even"><td><p>lxml’s XML parser</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">"lxml-xml")</span></code>
<code class="docutils literal notranslate"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">"xml")</span></code></p></td>
<td><ul class="simple">
<li><p>Very fast</p></li>
<li><p>The only currently supported
XML parser</p></li>
</ul>
</td>
<td><ul class="simple">
<li><p>External C dependency</p></li>
</ul>
</td>
</tr>
<tr class="row-odd"><td><p>html5lib</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">BeautifulSoup(markup,</span> <span class="pre">"html5lib")</span></code></p></td>
<td><ul class="simple">
<li><p>Extremely lenient</p></li>
<li><p>Parses pages the same way a
web browser does</p></li>
<li><p>Creates valid HTML5</p></li>
</ul>
</td>
<td><ul class="simple">
<li><p>Very slow</p></li>
<li><p>External Python
dependency</p></li>
</ul>
</td>
</tr>
</tbody>
</table>

<center><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser">Source</a></center>

On more detail on differences between parsers, click [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers)

### `prettify()`
This is a very important function that prints the html file in a better formatted string with indentation.

In [6]:
#View formatted content
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="craigslist" property="og:site_name"/>
  <meta content="preview" name="twitter:card"/>
  <meta content="washington, DC cars &amp; trucks - craigslist" property="og:title"/>
  <meta content="washington, DC cars &amp; trucks - craigslist" name="description"/>
  <meta content="washington, DC cars &amp; trucks - craigslist" property="og:description"/>
  <meta content="https://washingtondc.craigslist.org/search/cta" property="og:url"/>
  <meta content="true" name="smartbanner:api"/>
  <meta content="the craigslist app" name="smartbanner:title"/>
  <meta content="what's old is new" name="smartbanner:author"/>
  <meta content="/images/app_icon.png" name="smartbanner:icon-apple"/>
  <meta content="/images/app_icon.png" name="smartbanner:icon-google"/>
  <meta content="view" name="smartbanner:butt

Now lets dig in to find relevant data, before that we need to read the html file to see what we are looking for. 

For this we go to our target webpage  and inspects it content. 

- Open the webpage on your browser.
- Right click and then select *Inspect*, on the relevant information on the webpage. This will open the html version of the webpage on your browser. 
- Identify the tag in which the relevant information is stored.

- The most important element in html is, *tags*. A tag is Each tag can have multiple attributes, *class* and *ID* are some of the most common attributes that uniquely identify the tag (by possessing important information). The relevant data is generally text between these tags (start and end tags), for example \<a\> and \</a\>.
- In order to access data between texts, `.text` or the function `.get_text()` or `getText()` can be used. All have similar functionality, however the function versions give more control to manipulate the data. 
- If the relevant data is the value of some attribute, `get()` function can be used to get the value.
- To search using tags, class or ID, two functions can be used (there are others as well): 
    - `find()`: returns the first instance of the tag to be found.
    - `find_all()`: returns all instances of the tag to be found. returns a list of items.
    

In [7]:
# single search
soup.find('div')

<div id="curtain">
<div class="cover"></div>
<div class="content">
<div class="icom-"></div>
<div class="text loading">loading</div>
<div class="text reading">reading</div>
<div class="text writing">writing</div>
<div class="text saving">saving</div>
<div class="text searching">searching</div>
<div class="text unrecoverable">
                There was an error loading the page; please try to
                <a href="#" id="cl-unrecoverable-hard-refresh" onclick="location.reload(true);">refresh the page.</a>
</div>
<div class="text message"></div>
</div>
</div>

In [8]:
soup.find_all('div')

#len(soup.find_all('div'))


[<div id="curtain">
 <div class="cover"></div>
 <div class="content">
 <div class="icom-"></div>
 <div class="text loading">loading</div>
 <div class="text reading">reading</div>
 <div class="text writing">writing</div>
 <div class="text saving">saving</div>
 <div class="text searching">searching</div>
 <div class="text unrecoverable">
                 There was an error loading the page; please try to
                 <a href="#" id="cl-unrecoverable-hard-refresh" onclick="location.reload(true);">refresh the page.</a>
 </div>
 <div class="text message"></div>
 </div>
 </div>,
 <div class="cover"></div>,
 <div class="content">
 <div class="icom-"></div>
 <div class="text loading">loading</div>
 <div class="text reading">reading</div>
 <div class="text writing">writing</div>
 <div class="text saving">saving</div>
 <div class="text searching">searching</div>
 <div class="text unrecoverable">
                 There was an error loading the page; please try to
                 <a href="#" 

In [9]:
#tag with class 

soup.find('div', class_='result-info')


<div class="result-info">
<span class="icon icon-star" role="button">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2022-04-18 06:48" title="Mon 18 Apr 06:48:25 AM">Apr 18</time>
<h3 class="result-heading">
<a class="result-title hdrlnk" data-id="7472628127" href="https://washingtondc.craigslist.org/nva/ctd/d/washington-2016-ford-explorer-xlt-sport/7472628127.html" id="postid_7472628127">2016 Ford Explorer XLT Sport Utility 4D suv Silver - FINANCE ONLINE</a>
</h3>
<span class="result-meta">
<span class="result-price">$29,590</span>
<span class="result-hood"> (TOUCHLESS DELIVERY TO YOUR HOME)</span>
<span class="result-tags">
<span class="pictag">pic</span>
</span>
<span class="banish icon icon-trash" role="button">
<span class="screen-reader-text">hide this posting</span>
</span>
<span aria-hidden="true" class="unbanish icon icon-trash red" role="button"></span>
<a class="restore-link" href="#">
<span class="restore-narrow-text">

In [10]:
car=soup.find('div', class_='result-info')
print(car.h3.get_text(strip=True))
print(car.a.get('href'))
print(car.find('a').get('href'))
print(car.find('a').get('data-id'))
print(car.find(class_='result-price').get_text())
print(car.find(class_='result-hood').get_text(strip=True))
print(car.find(class_='result-date').get('title'))
print(car.find(class_='result-date').get('datetime'))
print(car.find(class_='result-date').get_text())



2016 Ford Explorer XLT Sport Utility 4D suv Silver - FINANCE ONLINE
https://washingtondc.craigslist.org/nva/ctd/d/washington-2016-ford-explorer-xlt-sport/7472628127.html
https://washingtondc.craigslist.org/nva/ctd/d/washington-2016-ford-explorer-xlt-sport/7472628127.html
7472628127
$29,590
(TOUCHLESS DELIVERY TO YOUR HOME)
Mon 18 Apr 06:48:25 AM
2022-04-18 06:48
Apr 18


- `contents` attribute provides all the body of the soup object (generally as a list).
- `children` attributes provides an iterator to all the tags within the body of the soup object.

In [11]:
#children
print(car.contents)
for child in car.children:
    print(child)



['\n', <span class="icon icon-star" role="button">
<span class="screen-reader-text">favorite this post</span>
</span>, '\n', <time class="result-date" datetime="2022-04-18 06:48" title="Mon 18 Apr 06:48:25 AM">Apr 18</time>, '\n', <h3 class="result-heading">
<a class="result-title hdrlnk" data-id="7472628127" href="https://washingtondc.craigslist.org/nva/ctd/d/washington-2016-ford-explorer-xlt-sport/7472628127.html" id="postid_7472628127">2016 Ford Explorer XLT Sport Utility 4D suv Silver - FINANCE ONLINE</a>
</h3>, '\n', <span class="result-meta">
<span class="result-price">$29,590</span>
<span class="result-hood"> (TOUCHLESS DELIVERY TO YOUR HOME)</span>
<span class="result-tags">
<span class="pictag">pic</span>
</span>
<span class="banish icon icon-trash" role="button">
<span class="screen-reader-text">hide this posting</span>
</span>
<span aria-hidden="true" class="unbanish icon icon-trash red" role="button"></span>
<a class="restore-link" href="#">
<span class="restore-narrow-text

Let us gather all required data for each car and store it in a dataframe object for later us

In [13]:
len(soup.find_all('div', class_='result-info'))

120

In [14]:
l=[]
for car in soup.find_all('div', class_='result-info'):
    o=[car.a.get('data-id'),car.h3.get_text(strip=True), car.find('span',class_='result-price').text,car.find('span',class_='result-hood').text.replace('(','').replace(')',''),car.find(class_='result-date').get('datetime'),car.a.get('href')]
    l.append(o)
    

In [15]:
l
#len(l)

[['7472628127',
  '2016 Ford Explorer XLT Sport Utility 4D suv Silver - FINANCE ONLINE',
  '$29,590',
  ' TOUCHLESS DELIVERY TO YOUR HOME',
  '2022-04-18 06:48',
  'https://washingtondc.craigslist.org/nva/ctd/d/washington-2016-ford-explorer-xlt-sport/7472628127.html'],
 ['7472628110',
  '2019 BMW i3 Base w/Range Extender Hatchback 4D hatchback Black -',
  '$35,590',
  ' TOUCHLESS DELIVERY TO YOUR HOME',
  '2022-04-18 06:48',
  'https://washingtondc.craigslist.org/nva/ctd/d/washington-2019-bmw-i3-base-range/7472628110.html'],
 ['7472628103',
  '2019 Volvo XC90 T6 Inscription Sport Utility 4D suv Silver - FINANCE',
  '$55,590',
  ' TOUCHLESS DELIVERY TO YOUR HOME',
  '2022-04-18 06:48',
  'https://washingtondc.craigslist.org/nva/ctd/d/washington-2019-volvo-xc90-t6-in-ion/7472628103.html'],
 ['7472627916',
  '2021 Lexus RX RX 350 Sport Utility 4D suv Black - FINANCE ONLINE',
  '$51,590',
  ' TOUCHLESS DELIVERY TO YOUR HOME',
  '2022-04-18 06:46',
  'https://washingtondc.craigslist.org/nva

Storing it into dataframe.

In [16]:
import pandas as pd

df = pd.DataFrame(l,columns=['ID','Name','Price','Area','Date','Link'])

In [17]:
display(df)

Unnamed: 0,ID,Name,Price,Area,Date,Link
0,7472628127,2016 Ford Explorer XLT Sport Utility 4D suv Si...,"$29,590",TOUCHLESS DELIVERY TO YOUR HOME,2022-04-18 06:48,https://washingtondc.craigslist.org/nva/ctd/d/...
1,7472628110,2019 BMW i3 Base w/Range Extender Hatchback 4D...,"$35,590",TOUCHLESS DELIVERY TO YOUR HOME,2022-04-18 06:48,https://washingtondc.craigslist.org/nva/ctd/d/...
2,7472628103,2019 Volvo XC90 T6 Inscription Sport Utility 4...,"$55,590",TOUCHLESS DELIVERY TO YOUR HOME,2022-04-18 06:48,https://washingtondc.craigslist.org/nva/ctd/d/...
3,7472627916,2021 Lexus RX RX 350 Sport Utility 4D suv Blac...,"$51,590",TOUCHLESS DELIVERY TO YOUR HOME,2022-04-18 06:46,https://washingtondc.craigslist.org/nva/ctd/d/...
4,7472627915,2021 Chevy Chevrolet Silverado 2500 HD Crew Ca...,"$65,990",TOUCHLESS DELIVERY TO YOUR HOME,2022-04-18 06:46,https://washingtondc.craigslist.org/nva/ctd/d/...
...,...,...,...,...,...,...
115,7472426087,Chevy Express 3500 Cutaway Box Dully,"$5,900",Fairfax,2022-04-17 13:28,https://washingtondc.craigslist.org/nva/cto/d/...
116,7472425842,2007 Toyota Camry,"$5,500",Silver Spring,2022-04-17 13:27,https://washingtondc.craigslist.org/mld/cto/d/...
117,7472425828,2022 Gmc sierra 1500 AT4 New Resigned,"$76,000",northern virginia,2022-04-17 13:27,https://washingtondc.craigslist.org/nva/cto/d/...
118,7472422035,2005 Land Rover LR3 HSE,"$7,900",district of columbia,2022-04-17 13:16,https://washingtondc.craigslist.org/doc/cto/d/...


In [18]:
df.describe()

Unnamed: 0,ID,Name,Price,Area,Date,Link
count,120,120,120,120,120,120
unique,120,118,97,66,102,120
top,7472628127,2014 HYUNDAI VELOSTER Turbo APPROVED!!! APPROV...,$0,TOUCHLESS DELIVERY TO YOUR HOME,2022-04-18 06:48,https://washingtondc.craigslist.org/nva/ctd/d/...
freq,1,3,5,15,3,1


In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      120 non-null    object
 1   Name    120 non-null    object
 2   Price   120 non-null    object
 3   Area    120 non-null    object
 4   Date    120 non-null    object
 5   Link    120 non-null    object
dtypes: object(6)
memory usage: 5.8+ KB


In [19]:

df.to_csv('DataDC.csv')