# Web Scraping with Beautiful Soup

### Importing packages

In [None]:
import requests

### How does the Internet Work?

The following is a five-minute video on how the Internet works. A basic understanding of this is necessary for us to work with web scraping

https://www.youtube.com/watch?v=7_LPdttKXPc

To perform HTPP requests in Python we will be working with [requests](http://docs.python-requests.org/en/master/). Fun fact: requests is the most downloaded Python packages of all time, receiving 400,000 downloads in a single day

In [None]:
response = requests.get("https://en.wikipedia.org/wiki/List_of_cities_in_Malaysia")

### HTTP Response Status

To understand if our HTTP requests is successful or not, we have to check the status code. The following [link]("https://developer.mozilla.org/en-US/docs/Web/HTTP/Status") explains each of the codes in detail, but generally these are the codes you can expect to see:

- 2xx Success (200 means your query was successful)
- 3xx Redirections
- 4xx Client errors (A familiar code would be error code 404: resource not found )
- 5xx Server errors

In [None]:
response.status_code

#You should receive the code '200' here which means that your request was successful

In [None]:
print(response.text) #Calling the text attribute of response allows us to see the HTML text from our query

### Web Scraping 101

To Web Scrape is to retrieve data that exists on a website in a usable format for further analysis. Webpages are rendered by your web browser from HTML and CSS code. Useful content for us is usually stores as HTML.

In the following section we will perform the following:

- Get the HTML code of a given url.  We can use `urllib` or `requests` for that.
- Create a Beautiful Soup object which is an interfact to the Document Object Model (DOM)

As we know, HTML code is contained in angled brackets '<>'. These brackets provide structural information and are useful for selecting the content we want to see from the entire webpage.

For example, '< a >' refers to links on the webpage, and by finding all a tags, one can quickly access all the links on the webpage. BeautifulSoup enables us to select these HTML elements quickly.


#### Helpful Resources:
- [HTML Basics]("https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics")


## Beautiful Soup

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. The current supported version of Beautiful Soup is version 4.

To install:

```python
!pip install bs4
```

In [None]:
!pip install bs4

### Usage

Right after the installation you can start using BeautifulSoup. At the beginning of your Python script, import the library

```python
from bs4 import BeautifulSoup
```

Now you have to pass something to BeautifulSoup to create a soup object. That could be a document or an URL. BeautifulSoup does not fetch the web page for you, you have to do that yourself. Libraries such as `urllib2` or `requests` can be used.

```python
import requests
```

**Parser**

Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands:

```python
pip install lxml
```
or

```python
pip install html5lib
```
 

### Importing packages

In [None]:
#import requests #We already have this from before
from bs4 import BeautifulSoup

### Filtering

We can apply filters into methods such as `find_all` and can use these filters based on a tag’s name, on its attributes, on the text of a string, or on some combination of these. This enables us to quickly access HTML elements that we are interested in.

**EXAMPLE**

Suppose we have the following HTML document:

```html
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
```

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml') #The lxml line here just specifies a HTML parser we want to use
print(soup)

A handy function we have to create readable formatting is prettify(). It tidies up the spacing

In [None]:
print(soup.prettify()) #It is a lot easier to find elements when things are spaced appropriately

Run the following bits of code line by line to understand how we can access different tags through BS4

In [None]:
#print(soup.title) #This will access the 'title' tag
#print(soup.title.parent) #This will access the parent tag above the title
#print(soup.title.parent.name) #This gives us the name of the parent tag, which is head

Lets try to get the title of the 'webpage' we simulated above

In [None]:
print(soup.title)
print()
print(soup.title.name)
print()
print(soup.title.text)
print()
print(soup.title.parent.name)

We can get the whole content of our soup object with soup.contents. Note that while the output seems similar to just calling 'soup', the two objects are of different data type

In [None]:
print(soup.contents)
print()
print()
print(type(soup.contents))
print() #This just prints empty lines
print()
print(soup)
print()
print()
print(type(soup))

We can specifically access the 'body' tag of our HTML document through soup.body

In [None]:
print(soup.body)

We can access different tags such as the paragraph tag '< p >' and the link tags '< a >' in the following way

In [None]:
print(soup.p)

In [None]:
print(soup.a)

The 'class' above just refers to the class attribute which is used to point to a class in a style sheet.

If we want to find all of the tags, we can use the 'find_all' method on soup

In [None]:
soup.find_all('a')

In [None]:
ptags = soup.find_all('p')
print(len(ptags))
ptags

In [None]:
## use find_all for all tags start with 'a' with class 'sister'
soup.find_all('a', {'class':'sister'}) #The attribute is passed in as a dictionary

In [None]:
soup.find_all('a', {'class':'sister', 'id':'link1'}) 

#### Challenge: find the a link with the id: link3

In [None]:
# your code here

soup.find_all('a',{'class':'sister','id':'link3'})

A common task when web scraping is extracting all the URLs found within a webpage's '< a >' tags. We have done so below through the use of a for loop

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

We can apply the same logic to extracting the text used as the hyperlink

In [None]:
for para in soup.find_all('a'):
    print(para.get_text())

Another common task is extracting all the text from a page

In [None]:
print(soup.get_text())

In [None]:
# This code only retrieves text from the body tag of our fictitious website

print(soup.find('body').get_text())

### Regular Expressions (Regex) 


We can pass in a regular expression object, Beautiful Soup will filter against
that regular expression using its match() method. 

This code finds all the tags whose names start with the letter "b",
in this case, the 'body' tag and the 'b' tag:

In [None]:
import re

In [None]:
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

In [None]:
for tag in soup.find_all(re.compile("t")):
    print(tag.name)

The code below does not work because we did not use regular expressions. Soup is searching for a tag < t > which does not exist in our fictitious HTML. It would work with "b" however.

In [None]:
for tag in soup.find_all("t"): #did not use the re package here
    print(tag.name)

## List

We can pass in a list, Beautiful Soup will allow a string match against any
item in that list. 

This code finds all the 'p' tags and all the 'b' tags

In [None]:
print(soup.find_all(["p", "b"]))

## Navigating the Parse Tree

If you want to know how to navigate the tree please see the official [documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree)

There you can read about the following things:

**Going down**
* Navigating using tag names 
 * contents and children
 * descendants
 * string
 * strings and stripped_strings

**Going up**
* parent
* parents

**Going sideways**
* next_sibling and .previous_sibling 
* next_siblings and .previous_siblings 

**Going back and forth**
* .next_element and .previous_element
* .next_elements and .previous_elements

# Web Scraping Workflow

1. Find the data you want on the web.
2. Inspect the webpage and figure out how to select the content you want. This usually involves some combination of
    - Viewing the source code of the page (especially if it is simple), and
    - Figuring out the structure of the HTML parse tree.  This step is much easier with a something like __Chrome Developer Tools__.
3.  Write code to get out what you want:
    - If the page is very simple, treat it as a bunch of text => __string manipulation / [regular expressions](https://docs.python.org/2/howto/regex.html)__ in Python.
    - To have a more robust solution, it is better to use the HTML parse tree => __[BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) / [lxml](http://lxml.de/lxmlhtml.html)__ in Python.
4.  Make sure it worked!
5.  If your crawling problem is at all non-trivial, you will now have to go back to Step 2 to zoom in further -- or you'll have parsed the URL of a link you want to follow, in which case you'll go back to Step 1 to figure out how to parse what you want from the new target page.

## Exercise 1

Given the following page from https://en.wikipedia.org/wiki/List_of_cities_in_Malaysia, extract all the filenames and their links

In [None]:
url = "https://en.wikipedia.org/wiki/List_of_cities_in_Malaysia"
    
response1 = requests.get(url)
soup1 = BeautifulSoup(response1.text, 'lxml')
print(soup1)

#### Problem 1

The first thing you should do from here is prettify your soup1 and inspect the general structure 

In [None]:
print(soup1.prettify())

# your code here

#### Problem 2

Retrieve the title in string format of the webpage

In [None]:
# your code here

print(soup1.title.string)

#### Problem 3

Retrieve all the links on your website. Remember that these are contained in the 'a' tags

In [None]:
# your code here

for link in soup1.find_all('a'):
    print(link.get('href'))

#### Problem 4 

Find all the tables, and then find the specific table that contains the state names that we are interested in

In [None]:
soup1.find_all('table') #You can actually check this by going 
#to the website and right clicking on the table you are interested in

In [None]:
right_table = soup1.find('table', class_ = 'wikitable sortable')

In [None]:
right_table

In [None]:
list1 = []

for link in right_table.find_all('a'):
    list1.append(link.get('title'))
    
list1

In [None]:
import pandas as pd

df = pd.DataFrame(list1)
df

In [None]:
df1 = df.iloc[[2,7,10,13,17,20,23,26,29,32,35,38,41,44]] 

#Just pulling out the relevant states

In [None]:
df1.rename(columns={0:'State'})
#df1.reset_index(drop = True)

## Do it yourself: Web Scraping

The goal of this mini-project is to scrape data from e-commerce or other websites such as
Lelong, Lazada, Mudah, iProperty, Booking, Expedia etc.

Scrape at least 1000 items from one of the website mentioned above. The scraped data should include:
- Product Name/Product Title
- Amount/Price 
- Brand
- Comments/Reviews
- Number of views


In addition, you are required to export the scraped data to dataframe format and also save a
copy in csv format. 

Upon successful extracting data to dataframe, you are required to do a data
analysis on the data.

Your analysis should provide answers to the following questions: 
* What do you think is interesting about this data? 
* Tell a story about some interesting thing you have discovered by looking at the data. 
* Visualize your data with matplotlib or with folium library package.

For example, you might consider whether there is a difference in pricings at different times
doing the day or city, or whether other factors that influnced the pricings etc. Another thing you
might consider is whether there is a relationship between the pricing and number of reviews or
comments.

Get your analysis workflow in your Jupyter notebook.

### Time package

These websites have algorithms to detect people that may be accessing large amounts of their data in a rapid fashion. Time helps us add a human-like pause to our code

We have to add **sleeps** in order not to be blacklisted by the website we are crawling

In [None]:
import time
for i in range(10):
    print(i)
    time.sleep(1)

The following code will scrape a lelong url and write the information into a CSV for us

In [None]:
import csv
import requests
from bs4 import BeautifulSoup


lelong_url='https://www.lelong.com.my/catalog/all/list'#this is the url we will look at

with open('phones_lelong.csv', 'w', encoding='utf-8', newline='') as csvfile:
    lelongwriter = csv.writer(csvfile)
    
# This is a context manager to open a file and write to it
    
    for page in range(1, 11):
        print("Querying page %s..." % page)
        response = requests.get(lelong_url, params={'TheKeyword':'phone', 'D': page})
        print('Got page %s' % page)
        soup = BeautifulSoup(response.text, 'lxml')
        results = soup.find_all('div', attrs={'class':'summary'})
        for product in results:
            b_element = product.find('b')
            price = float(b_element.get('data-price'))
            url = b_element.get('data-link')
            title = b_element.text
            lelongwriter.writerow([title, price, url])
        print('Sleeping...')
        time.sleep(1)
        print('Waking up!')


In [None]:
import pandas as pd

df2 = pd.read_csv('phones_lelong.csv', header = None)

In [None]:
df2.head()

In [None]:
df2.rename(columns = {0:'Summary', 1:'Price', 2:'Hyperlink'}, inplace = True)
df2.head()

In [None]:
#Some useful information about price from the data

print(df2.Price.mean())
print(df2.Price.max())
print(df2.Price.min())
print(len(df2)) # we have this many rows in our data

In [None]:
df2.tail()

In [None]:
brandlist = ['[sS]amsung','[sS]ony','[sS]pigen','[hH]uawei','[cC]elcom',
             '[aA]pple','[lL]G','[mM]otorola','[rR]azer','[pP]anasonic','[xX]iaomi','3M',
             '[oO]nePlus','[oO]ppo']

# The reason why we have [sS] is to enable string matching 
#for either upper or lower case in the name.

brandcounts = []
for brand in brandlist:
    
    brandcounts.append(len(df2[df2.Summary.str.contains(brand)]))
    
dict1 = dict(zip(brandlist,brandcounts))
df3 = pd.DataFrame.from_dict(dict1, orient = 'index')
df3.columns = ['Count']

In [None]:
print(df3.Count.sum())
len(df2)
# There may be unbranded goods or goods with brands we are not aware of

In [None]:
df3.plot(kind = 'barh', title = 'Brands by Count')

# Note that this graph will look different every day because we are donwloading fresh data!