# Introduction to Web Scraping in Python

Web scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools.
Many disciplines, such as data science, business intelligence, and investigative reporting, can benefit enormously from collecting and analyzing data from websites.

## Scrape and Parse texts from Websites
Collecting data from websites using an automated process is known as web scraping. Some websites explicitly forbid users from scraping their data with automated tools.

**Websites have two main reasons to not allow web scraping**
1. To protect its data. For example: Google maps do not allow users to request too many results in a minute.
2. To prevent overuse of their servers. When bots start sending many requests website's servers slow down and thus other users will have slower connection to the website.

One useful package for web scraping that you can find in Python’s standard library is [urllib](https://docs.python.org/3/library/urllib.html), which contains tools for working with URLs.
**urllib** is for opening and reading URLs.

#### Let's look at the example and use **urllib**

In [29]:
from urllib.request import urlopen
url = "http://olympus.realpython.org/profiles/aphrodite"
page = urlopen(url)

To extract the HTML from the page:
1. Use html's read method to return sequence of bytes
2. Use decode method on 1st result to decode bytes to strings

In [30]:
html_by = page.read()
html = html_by.decode("utf-8")
print(html)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



#### Let's try to get the title of the webpage
1. We need to get the index of the **\<title>**, and because title tags strings have been counted we need to add it to the index. 
2. Find the index of the closing **\<title>** tag
3. Get the title by slicing the html

In [33]:
html

'<html>\n<head>\n<title>Profile: Aphrodite</title>\n</head>\n<body bgcolor="yellow">\n<center>\n<br><br>\n<img src="/static/aphrodite.gif" />\n<h2>Name: Aphrodite</h2>\n<br><br>\nFavorite animal: Dove\n<br><br>\nFavorite color: Red\n<br><br>\nHometown: Mount Olympus\n</center>\n</body>\n</html>\n'

In [32]:
html.find("<title>")

14

In [37]:
html[14:21]

'<title>'

In [35]:
title_index = html.find("<title>")
start_index = title_index + len("<title>")

In [36]:
print(start_index)
print(title_index)

21
14


In [38]:
end_index = html.find("</title>")
print(end_index)

39


In [39]:
title = html[start_index:end_index]
print(title)

Profile: Aphrodite


#### It is a lot of work just to get the title of the page. In the real world, websites are much more complex and complicated. We can use find many dedicated tools for html scraping but the most powerful and popular library for Python is [**Beautiful soup**](https://www.crummy.com/software/BeautifulSoup/)

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping.

**Run the command below to install**:
```bash
conda install beautifulsoup4
pip install beautifulsoup4
```

In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/aphrodite"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

#### Example above does three things
1. Opens up a page using **urlopen** from **urllib.request**
2. Reads and decodes the page and saves as a variable
3. Creates a BeautifulSoup object and assigns it to the soup variable 

BeautifulSoup objects have a **.get_text()** method that can be used to extract all the text from the document and automatically remove any HTML tags

In [41]:
print(soup.get_text())



Profile: Aphrodite





Name: Aphrodite

Favorite animal: Dove

Favorite color: Red

Hometown: Mount Olympus






To get the title of the page, you can use **.title**, and **.string** to get the text

In [36]:
print(soup.title)
print(soup.title.string)

<title>Profile: Aphrodite</title>
Profile: Aphrodite


You can use **find()** to find the tags you want and get the source attributes.

In [46]:
image = soup.find("img")

In [47]:
image

<img src="/static/aphrodite.gif"/>

In [50]:
image['src']

'/static/aphrodite.gif'

#### Exercise your web scraping on Unegui.mn
1. Go to https://www.unegui.mn/avto-mashin/-avtomashin-zarna/, Use inspection tool on your browser to see the html tags and attributes.
2. Scrape all the listing's **title** and **price**. Scrape only the first page!
3. Save your listings as a pandas DataFrame
Example below illustrates the final result

In [51]:
import pandas as pd
titles = ['Toyota FJ Cruiser, 2012/2020', 'Honda Crossroad, 2009/2019']
prices = ['62 сая', '17 сая']
results = pd.DataFrame([titles, prices], columns=['titles', 'prices'])

In [None]:
announcement-block__price _verified

In [2]:
import requests
from bs4 import BeautifulSoup

In [6]:
response = requests.get('https://www.unegui.mn/kompyuter-busad/notebook/')

In [7]:
soup = BeautifulSoup(response.content)

In [8]:
results = soup.find_all("div", {"class": "announcement-block__price _verified"})

In [10]:
len(results)

65

In [28]:
results[0]

<div class="announcement-block__price _verified" itemprop="offers" itemscope="" itemtype="http://schema.org/Offer">
<meta content="Acer aspire e-5 ram-4gb hard-250gb hdd inch-15.6" itemprop="name">
<meta content="Улаанбаатар" itemprop="areaServed">
<meta content="MNT" itemprop="priceCurrency"/>
<meta content="890000.00" itemprop="price"/>
              890,000 <b>₮</b>
<span class="verified" title="Баталгаажсан хэрэглэгч"></span>
</meta></meta></div>

In [42]:
df = pd.DataFrame(columns=['title','price'])

for url in urls:
    soup = BeautifulSoup(response.content)
    results = soup.find_all("div", {"class": "announcement-block__price _verified"})
    for item in results:
        title = item.find("meta", {"itemprop":"name"})['content']
        price = item.find("meta", {"itemprop":"price"})['content']
        df = df.append({'title':title,'price':price}, ignore_index=True)

In [43]:
df

Unnamed: 0,title,price
0,Samsung i5 8-р үе ram-8gb hard-128gb ssd+500gb...,1690000.00
1,Dell precision-5520 i7 7820hq ram-16gb hard-25...,2690000.00
2,Hp spectre x360 convertible i5 8-р үе ram-8gb ...,2490000.00
3,Dell gamer g7 i7 9-р үе rtx2060 17.3,3890000.00
4,Dell latitude e-5270 i3 6-р үе ram-8gb hard-25...,890000.00
...,...,...
120,Acer 17.3 i5-10th gen 8gb ram 128gb ssd 500gb hdd,2000000.00
121,Dell i3 8gb ram 128gb ssd,1750000.00
122,Hp omen x i7-8 16gb 128gb+1000gb rtx 2070 15.6...,4090000.00
123,Acer nitro 5 i7 10-р үе 8gb 128gb+1000gb hdd 1...,3490000.00


In [30]:
results[0].find_all("meta", {"itemprop":"price"})

'890000.00'

1. Get a list of URLs to scrape
2. Loop through the URLs
3. Inside that loop, loop through the listings (65 per page)
4. Grab the data you need (title and price for 65 listings)
5. Append it to a dataframe
6. Go the next page

In [90]:
!python --version

Python 3.8.12


In [18]:
urls = []
for number in range(1,72):
    urls.append(f'https://www.unegui.mn/kompyuter-busad/notebook/?page={number}')

In [28]:
url = 'https://www.zangia.mn/job/_sygeyd3008'

In [29]:
response = requests.get(url, timeout=15)

In [30]:
soup = BeautifulSoup(response.content)

In [35]:
soup.find_all("div", {"class": "details"})[0].find_all("div")

[<div><b>Байршил</b><span>Улаанбаатар хот, Сонгинохайрхан дүүрэг</span></div>,
 <div><b>Салбар</b><span>Банк, санхүү, нягтлан бодох бүртгэл</span></div>,
 <div><b>Түвшин</b><span>Дунд шатны удирдлага</span></div>,
 <div><b>Төрөл</b><span>Бүтэн цагийн</span></div>,
 <div><b>Цалин</b><span>1,500,000 - 1,800,000</span></div>]

In [11]:
other_div = soup.find_all("div", {"class": "details"})[0].find_all("div")

In [6]:
other = soup.find_all("div", {"class": "details"})[0].find_all("div")

In [85]:
other

[<div><b>Байршил</b><span>Улаанбаатар хот, Сонгинохайрхан дүүрэг</span></div>,
 <div><b>Салбар</b><span>Банк, санхүү, нягтлан бодох бүртгэл</span></div>,
 <div><b>Түвшин</b><span>Дунд шатны удирдлага</span></div>,
 <div><b>Төрөл</b><span>Бүтэн цагийн</span></div>,
 <div><b>Цалин</b><span>1,500,000 - 1,800,000</span></div>]

In [36]:
url = "https://www.zangia.mn/job/list/lmt.100/pg.2"

In [37]:
response = requests.get(url, timeout=15)

In [38]:
soup = BeautifulSoup(response.content)

In [39]:
ads = soup.find_all("div", {"class": "ad"})

In [42]:
job_links = [ad.find("a")['href'] for ad in ads]

In [None]:
job_links = []
for ad in ads:
    job_links.append(ad.find("a")['href'])

In [57]:
len(job_links)

100

In [58]:
links = soup.find_all("a")

In [59]:
entry_links = []
for link in links:
    if 'job/_' in link['href']:
        entry_links.append(link['href'])

In [None]:
https://www.zangia.mn/

In [79]:
entry_links

['job/_vn+z9gfpq2',
 'job/_ev05qlotmk',
 'job/_4aff_s4wpd',
 'job/_wc_k+oh-qw',
 'job/_bb2sf1qm6v',
 'job/_y+81sl4de4',
 'job/_9n3t3icdly',
 'job/_b0+vlvpjtv',
 'job/_1970_zgrm5',
 'job/_l1mnuiic_c',
 'job/_pi0h0ghq_e',
 'job/_w-_8xy-ekk',
 'job/_54_lmpebjv',
 'job/_qq9_ko9qt2',
 'job/_yz8w5387fk',
 'job/_klqlvqctyo',
 'job/_8u-o8mqgp6',
 'job/_23-maw4zz2',
 'job/_7+o8m36t-0',
 'job/_hqdk93m-7d',
 'job/_lbap2gnt53',
 'job/_i1m7zekc42',
 'job/_2nkvgkb6x5',
 'job/_dfuazr7553',
 'job/_ok3+pohb66',
 'job/_5z4yvm4i-0',
 'job/_827c7wv2a3',
 'job/_ejv5rq3ymx',
 'job/_gobpmz0tly',
 'job/_bb0o4xo+fc',
 'job/_be51z+fmu0',
 'job/_l9ea0lf25l',
 'job/_phq7b0v6ni',
 'job/_p__c1cigv3',
 'job/_bxz5va6nj9',
 'job/_ejlarkg9_s',
 'job/_puk-g22_il',
 'job/_axoxr9j8+w',
 'job/_8t4bahd+_3',
 'job/_p3late6z6s',
 'job/_fbkg90ngn3',
 'job/_e180cn88mj',
 'job/_0b-d3+82tn',
 'job/_c-qg77g_su',
 'job/_5chf0vfz7r',
 'job/_ahzvlvre_l',
 'job/_pvwfay-0xb',
 'job/_ti4xdz7f8m',
 'job/_m7if4zw433',
 'job/_vdwxhe2uhy',
