# Web Scraping

## Outline:

#### 1. Motivations
#### 2. HTML and the DOM
#### 3. Parsing HTML in Python: beautiful soup
#### 4. Chrome developer tools

### How the Internet works
#### A five-minutes video
https://www.youtube.com/watch?v=7_LPdttKXPc

To perform HTTP requests using Python, [requests](http://docs.python-requests.org/en/master/) is an easy library

In [1]:
import requests
response = requests.get(
    "https://www.lelong.com.my/catalog/all/list?TheKeyword=oneplus+x%20case"
)

#### HTTP Response status
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
- 2xx Success (200 Ok)
- 3xx Redirections
- 4xx Client errors ( 404: resource not found )
- 5xx Server errors



In [2]:
response.status_code

200

In [3]:
print(response.text)

    <!--Canonical Link-->
    <!--Meta-->
    <!--Chat-->


<!DOCTYPE html>
<html lang="en">
<head>
    <meta http-equiv="content-type" content="text/html; charset=iso-8859-1">
    
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <meta name="google-translate-customization" content="2f26647729bf248b-f2b32987adcfc8b8-gbe489916db8e3e69-11">
    <title>Oneplus x case price, harga in Malaysia - lelong</title>
    <meta name="description" content="Oneplus x case Malaysia price, harga; Price list of Malaysia Oneplus x case products from sellers on Lelong.my">
    <!-- Twitter Card data -->
    <meta name="twitter:card" content="product" />
    <meta name="twitter:site" content="@LelongMy" />
    <meta name="twitter:creator" content="@LelongMy" />
    <meta name="twitter:title" content="Oneplus x case price, harga in Malaysia - lelong" />
    <meta name="twitter:description" content="Oneplus x case Malaysia price, harga; Price list of Malaysia 

In [4]:
requests.post?

In [5]:
response = requests.get("https://www.lelong.com.my/catalog/all/list",
                        params={'TheKeyword':'oneplus x case'})

In [6]:
response.status_code

200

In [7]:
response.encoding

'utf-8'



## HTML and the DOM

Web scraping:
- Retrieve data that exists on a website
- In a usable format for analysis
- Webpages are rendered by the browser from HTML and CSS code
- The useful content is usually in HTML

We will:
- Get the html of a given url. We can use `urllib` or `requests` for that.
- Create a beautiful Soup object which is an interface to the DOM (Document Object Model)

BeautifulSoup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

HTML Basics: https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics


Beautiful Soup is essentially a set of wrapper functions that make it simple to select common HTML elements.

Beautiful Soup is an API on DOM.


In [8]:
html_doc = """
<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <title>My test page</title>
    <!-- <link href='http://fonts.googleapis.com/css?family=Open+Sans' rel='stylesheet' type='text/css'> -->
     <link href="styles/style.css" rel="stylesheet" type="text/css"> 
  </head>
  <body>
    <h1>Mozilla is cool</h1>
    <img src="img/guidelines-logo.7ea045a4e288.png" alt="The Firefox logo: a flaming fox surrounding the Earth.">

    <p>At Mozilla, we’re a global community of</p>
    
    <ul> <!-- changed to list in the tutorial -->
      <li id='myid' thing='34'>technologists</li>
      <li>thinkers</li>
      <li>builders</li>
    </ul>

    <p>working together to keep the Internet alive and accessible, so people worldwide can be informed contributors and creators of the Web. We believe this act of human collaboration across an open platform is essential to individual growth and our collective future.</p>

    <p>Read the <a href="https://www.mozilla.org/en-US/about/manifesto/">Mozilla Manifesto</a> to learn even more about the values and principles that guide the pursuit of our mission.</p>
  </body>
</html>
"""

In [9]:
with open('test.html', 'w') as f:
    f.write(html_doc)

In [10]:
!rm styles/style.css

rm: styles/style.css: No such file or directory


In [11]:
import os
import webbrowser
from urllib.parse import urljoin
url = urljoin("file:///", os.path.abspath('test.html'))
webbrowser.open(url)

True

In [12]:
css_doc = """html {
  font-size: 10px; /* px means 'pixels': the base font size is now 10 pixels high  */
}

html {
  background-color: #00539F;
}

h1 {
  font-size: 60px;
  text-align: center;
}

p, li {
  font-size: 16px;    
  line-height: 2;
  letter-spacing: 1px;
}

body {
  width: 600px;
  margin: 0 auto;
  background-color: #FF9500;
  padding: 0 20px 20px 20px;
  border: 5px solid black;
}

#myid {
color : red;
}

h1 {
  margin: 0;
  padding: 20px 0;    
  color: #00539F;
  text-shadow: 3px 3px 1px black;
}

img {
  display: block;
  margin: 0 auto;
}

a:hover {
    font-size : 60px;
}"""

In [13]:
import os
os.makedirs('styles', exist_ok=True)
with open('styles/style.css', 'w') as f:
    f.write(css_doc)

In [14]:
webbrowser.open(url)

True

Browsers:
- reads HTML document,
- parses it into a DOM (Document Object Model) structure, 
- then renders the DOM structure

The DOM is an agreed-upon standard.

Thanks to this tree-like model we can select explicitly specific elements

```html
<html>
    <body>
        <h1>Title</h1>
        <p>A <em>world</em></p>
    </body>
</html>```

<img src="./images/treeStructure.png">

In [15]:
html_doc = """
<html>
    <body>
        <h1>Title</h1>
        <p>A <em>world</em></p>
    </body>
</html>
"""

## Beautiful Soup

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. The current supported version of Beautiful Soup is version 4.

To install:

```python
pip install bs4
```

### Usage

Right after the installation you can start using BeautifulSoup. At the beginning of your Python script, import the library

```python
from bs4 import BeautifulSoup
```

Now you have to pass something to BeautifulSoup to create a soup object. That could be a document or an URL. BeautifulSoup does not fetch the web page for you, you have to do that yourself. Libraries such as `urllib2` or `requests` can be used.

```python
import requests
```

**Parser**

Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands:

```python
pip install lxml
```
or

```python
pip install html5lib
```
 

In [16]:
import requests
#import urllib3
from bs4 import BeautifulSoup


### Filtering

For example, we have the following HTML doc:

```html
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
```

We can apply filters into methods such as `find_all` and can use these filters based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.

In [17]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')
print(soup)

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>


In [18]:
## for better readable formatting, use prettify() to tidy up
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


In [19]:
print(soup.title.parent.name)


head


In [20]:
# get title of the webpage

print(soup.title)
print()
print(soup.title.name)
print()
print(soup.title.text)
print()
print(soup.title.parent.name)

<title>The Dormouse's story</title>

title

The Dormouse's story

head


In [21]:
#get the whole content
print(soup.contents)

[<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>]


In [22]:
type(soup.contents)

list

In [23]:
type(soup.contents[0])

bs4.element.Tag

In [24]:
#get the body
print(soup.body)

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>


In [25]:
print(soup.p)

<p class="title"><b>The Dormouse's story</b></p>


In [26]:
# get the p tag
print(soup.p)
print()
print(soup.p['class'])

<p class="title"><b>The Dormouse's story</b></p>

['title']


In [27]:
# get the a tag
print(soup.a)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


In [28]:
## use find_all() for all tags start with 'a'

soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [29]:
## use find_all for all tags start with 'p'
ptags = soup.find_all('p')

In [30]:
len(ptags)

3

In [31]:
ptags

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [32]:
## use find_all for all tags start with 'a' with class 'sister'
soup.find_all('a', {'class':'sister'})

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [33]:
## use find_all for all tags start with 'a' with class 'sister' with id 'link1'
soup.find_all('a', {'class':'sister', 'id':'link1'})

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [34]:
# find id=link3
print(soup.find(id="link3"))

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


In [35]:
print(soup.find_all('a'))

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


In [36]:
for link in soup.find_all('a'):
    print(type(link), link)

<class 'bs4.element.Tag'> <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<class 'bs4.element.Tag'> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<class 'bs4.element.Tag'> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


In [37]:
# One common task is extracting all the URLs found within a page’s <a> tags:
for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [38]:
for para in soup.find_all('a'):
    print(para.get_text())

Elsie
Lacie
Tillie


In [39]:
# Another common task is extracting all the text from a page:
print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



In [40]:
print(soup.find('body').get_text())


The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



## Regular expression

We can pass in a regular expression object, Beautiful Soup will filter against
that regular expression using its match() method. 

This code finds all the tags whose names start with the letter "b",
in this case, the 'body' tag and the 'b' tag:

In [41]:
print(soup.contents)

[<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>]


In [42]:
#find all strings start with 'b
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)


body
b


In [43]:
#finds all the tags whose names contain the letter "t"
for tag in soup.find_all(re.compile("t")):
    print(tag.name)

html
title


In [44]:
for tag in soup.find_all("t"):
    print(tag.name)

In [45]:
soup.find_all?

## List

We can pass in a list, Beautiful Soup will allow a string match against any
item in that list. 

This code finds all the 'p' tags and all the 'b' tags

In [46]:
#find all with 'p' and 'b' tags
print(soup.find_all(["p", "b"]))

[<p class="title"><b>The Dormouse's story</b></p>, <b>The Dormouse's story</b>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]


## Keyword Arguments


In [47]:
# find id='link2'
print(soup.find_all('a', attrs={'id':'link2'}))

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]


Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes.

In [48]:
soup.find_all?

In [49]:
# find href with the string 'elsie'
#for e in soup.find_all(href=re.compile("ie$")):
for e in soup.find_all(attrs={'href': re.compile("ie$")}):
    print(e)


<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


## Multi-valued Attributes

HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is `class` (that is, a tag can have more than one CSS class). Others include `rel`, `rev`, `accept-charset`, `headers`, and `accesskey`. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:

In [50]:
print(soup.p['class'])


['title']


In [51]:
#search by CSS class

#print(soup.find_all("a", class_="sister"))
print(soup.find_all("a", attrs={'class':"sister"}))

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


## Navigating the Parse Tree

If you want to know how to navigate the tree please see the official [documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree)

There you can read about the following things:

**Going down**
* Navigating using tag names 
 * contents and children
 * descendants
 * string
 * strings and stripped_strings

**Going up**
* parent
* parents

**Going sideways**
* next_sibling and .previous_sibling 
* next_siblings and .previous_siblings 

**Going back and forth**
* .next_element and .previous_element
* .next_elements and .previous_elements


"Scraping": how to get unstructured data and turn it into something usable. We'll primarily focus on _web scraping_.

The basic workflow is:

1. Find the data you want on the web.
2. Inspect the webpage and figure out how to select the content you want. This usually involves some combination of
    - Viewing the source code of the page (especially if it is simple), and
    - Figuring out the structure of the HTML parse tree.  This step is much easier with a something like __Chrome Developer Tools__.
3.  Write code to get out what you want:
    - If the page is very simple, treat it as a bunch of text => __string manipulation / [regular expressions](https://docs.python.org/2/howto/regex.html)__ in Python.
    - To have a more robust solution, it is better to use the HTML parse tree => __[BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) / [lxml](http://lxml.de/lxmlhtml.html)__ in Python.
4.  Make sure it worked!
5.  If your crawling problem is at all non-trivial, you will now have to go back to Step 2 to zoom in further -- or you'll have parsed the URL of a link you want to follow, in which case you'll go back to Step 1 to figure out how to parse what you want from the new target page.

## Exercise
Given the following page from https://thecadsdata.github.io/EDS3-2017/, extract all the filenames and their links


In [52]:
eds_web_page = """<!doctype html>
<html lang="en-US">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="chrome=1">

<!-- Begin Jekyll SEO tag v2.3.0 -->
<title>EDS 3, term January 2018 | Enterprise Data Science training by the Center of Applied Data Science. You need to be logged in on GitHub and enrolled in the class to have access to the content.</title>
<meta property="og:title" content="EDS 3, term January 2018" />
<meta property="og:locale" content="en_US" />
<meta name="description" content="Enterprise Data Science training by the Center of Applied Data Science. You need to be logged in on GitHub and enrolled in the class to have access to the content." />
<meta property="og:description" content="Enterprise Data Science training by the Center of Applied Data Science. You need to be logged in on GitHub and enrolled in the class to have access to the content." />
<link rel="canonical" href="https://thecadsdata.github.io/EDS3-2017/" />
<meta property="og:url" content="https://thecadsdata.github.io/EDS3-2017/" />
<meta property="og:site_name" content="EDS 3, term January 2018" />
<script type="application/ld+json">
{"name":"EDS 3, term January 2018","description":"Enterprise Data Science training by the Center of Applied Data Science. You need to be logged in on GitHub and enrolled in the class to have access to the content.","author":null,"@type":"WebSite","url":"https://thecadsdata.github.io/EDS3-2017/","image":null,"publisher":null,"headline":"EDS 3, term January 2018","dateModified":null,"datePublished":null,"sameAs":null,"mainEntityOfPage":null,"@context":"http://schema.org"}</script>
<!-- End Jekyll SEO tag -->


    <link rel="stylesheet" href="/EDS3-2017/assets/css/style.css?v=4f4deb47c0b30979dbd649c944714c240bb287f5">
    <meta name="viewport" content="width=device-width">
    <!--[if lt IE 9]>
    <script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
  </head>
  <body>
    <div class="wrapper">
      <header>
        <h1>EDS 3, term January 2018</h1>
        <p>Enterprise Data Science training by the Center of Applied Data Science. You need to be logged in on GitHub and enrolled in the class to have access to the content.</p>

        
          <p class="view"><a href="https://github.com/TheCadsData/EDS3-2017">View the Project on GitHub <small></small></a></p>
        

        

        
      </header>
      <section>

      <p><a href="http://thecads.org/">
<img border="0" alt="The Center of Applied Data Science" src="The_Cads_FullColor_logo.jpg" />
</a></p>

<h1 id="enterprise-data-science-eds">Enterprise Data Science (EDS)</h1>
<p>September 2017 to January 2018</p>

<h2 id="forum">Forum</h2>

<p>Use the <a href="http://piazza.com/the_center_of_applied_data_science/winter2018/eds3">Piazza Q&amp;A if you have any question</a>.</p>

<h2 id="code-share">Code share</h2>

<p><a href="https://codeshare.io/5ZbL1g">https://codeshare.io/5ZbL1g</a></p>

<h2 id="content">Content</h2>

<h3 id="day-1-introduction-to-data-science">Day 1: Introduction to Data Science</h3>
<ul>
  <li><a href="https://github.com/TheCadsData/EDS3-2017/raw/master/Day1-Introduction-to-Data-Science/01-Lec01.pdf">Introduction to Data Science</a></li>
  <li><a href="https://github.com/TheCadsData/EDS3-2017/raw/master/Day1-Introduction-to-Data-Science/03-Intro-to-Anaconda.pdf">Introduction to Anaconda</a></li>
  <li><a href="https://github.com/TheCadsData/EDS3-2017/raw/master/Day1-Introduction-to-Data-Science/04-Intro-to-Jupyter.pdf">Introduction to Jupyter</a>. Jupyter <a href="https://github.com/TheCadsData/EDS3-2017/blob/master/Day1-Introduction-to-Data-Science/Jupyter-Notebook-Tutorial.ipynb">notebook</a> tutorial.</li>
  <li><a href="https://github.com/TheCadsData/EDS3-2017/raw/master/Day1-Introduction-to-Data-Science/intro_to_git.pdf">Introduction to Git</a></li>
</ul>

<h3 id="programming-and-python">Programming and Python</h3>
<h4 id="days-2-and-3">Days 2 and 3</h4>
<ul>
  <li><a href="https://github.com/TheCadsData/EDS3-2017/raw/master/Day2-Python/intro_to_python.pdf">Slides about Python and Programming</a></li>
  <li>Lecture notebooks:
    <ul>
      <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7942wrn4m85rh">Python01.ipynb</a></li>
      <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7942x448qh5ro">Python02.ipynb</a></li>
    </ul>
  </li>
  <li>Exercises notebooks:
    <ul>
      <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j79445jtlmx6bo">Exercise01.ipynb</a></li>
      <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j79445tyblz6bv">Exercise02.ipynb</a>. Solution: <a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7u2s1iyx6ve2">Exercise02-solution.ipynb</a></li>
      <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7944ivu6v76hu">Challenge questions</a></li>
    </ul>
  </li>
</ul>

<h4 id="days-4-and-5">Days 4 and 5</h4>
<ul>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7g82a3atvr4ey">Python03-list.ipynb</a></li>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j794h0i4tbe3gm">Exercise03.ipynb</a></li>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7g83jsgcpj50y">Python04-dict.ipynb</a></li>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7g83wrrbz556n">Exercise04.ipynb</a>. Data: <a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7g8471ishz5ap">stocks.json</a></li>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7g856rfnub5pi">Error handling in Python.ipynb</a></li>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7k5fzef581cc">Twitter assignment</a>. Example: <a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7lmxoihav72t5">Ludovic’s analysis, draft!</a></li>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7lbzaodhhn4k4">Python05-fileIO</a></li>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7lbza6bker4jv">Python05-b-error-handling</a></li>
</ul>

<h4 id="days-6-and-7">Days 6 and 7</h4>
<ul>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7romoh9ksg5dm">Python06-NumPy.ipynb</a>
And the <a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7ron0fagto5iw">broadcating schema image</a> to be saved in a folder images/</li>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7spzxkxqho6m2">Exercise05-NumPy.ipynb</a></li>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7t4fvtgt7q37k">Python07-Pandas.ipynb</a></li>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j7t4g1u4ecb3ad">Exercise06-Pandas.ipynb</a>. Solutions: <a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j85bmprjh082gh">Exercise06solutions.ipynb</a></li>
</ul>

<h4 id="days-8-9-and-10">Days 8, 9 and 10</h4>

<ul>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j82wcfagpkp2wd">Python08-Pandas.ipynb</a></li>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j82wcmqzaot2yo">Exercise07-Pandas-olympics.ipynb</a> One solution: <a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j8715w91acd3aw">Exercise07-Pandas-olympics-solutions.ipynb</a></li>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j82wbqt5jc02qu">Data.zip</a> to be uncompressed in same folder.</li>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j85i9jkirey5gs">Exercise08-Pandas.ipynb</a></li>
</ul>

<h5 id="capstone-project-presentation-slides">Capstone project: <a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j86s7tffhk57c8">Presentation slides</a></h5>

<h4 id="days-11-12-visualization">Days 11, 12: Visualization</h4>
<ul>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j8d5qbiufnn27p">Python09-matplolib.ipynb</a></li>
  <li>Exercises:
    <ul>
      <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j8d5rsi0mn02yj">Exercise09-matplotlib.ipynb</a>. One solution: <a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j8fj94xuqsj1h0">Exercise09-matplotlib-solution.ipynb</a></li>
      <li>Weather data for Exercise09: <a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j8d5olnn67rxb">weather.csv</a>, <a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j8d5ogwn8qpui">weather_readme.txt</a></li>
      <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j8d5s0mic4w32i">Exercise10-matplotlib.ipynb</a></li>
      <li>Images for Exercise10: <a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j8d5t8aa2rl6gj">images.zip</a></li>
    </ul>
  </li>
  <li>Slides: <a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j8egfy3skm3155">Data-visualization.pdf</a></li>
</ul>

<h3 id="days-13-14-15-regular-expressions-web-scraping">Days 13, 14, 15: Regular expressions, Web scraping</h3>
<ul>
  <li><a href="https://piazza.com/class_profile/get_resource/j75xpfjh2fw71d/j8mvu28qtd044d">Python10-regular-expression</a></li>
</ul>

<h2 id="schedule">Schedule</h2>
<h4 id="september-10-days">September (10 Days):</h4>
<ul>
  <li>6th - 8th (Wednesday - Friday)</li>
  <li>14th - 15th (Thursday - Friday)</li>
  <li>20th - 21st (Wednesday - Thursday)</li>
  <li>28th - 30th (Thursday - Saturday)</li>
</ul>

<h4 id="october-8-days">October (8 Days):</h4>
<ul>
  <li>5th - 6th (Thursday - Friday)</li>
  <li>12th - 14th (Thursday - Saturday)</li>
  <li>26th - 28th (Thursday - Saturday)</li>
</ul>

<h4 id="november-13-days">November (13 Days):</h4>
<ul>
  <li>2nd - 4th (Thursday - Saturday)</li>
  <li>9th - 10th (Thursday - Friday)</li>
  <li>16th - 18th (Thursday - Saturday)</li>
  <li>22nd - 24th (Wednesday - Friday)</li>
  <li>29th - 30th (Wednesday - Thursday)</li>
</ul>

<h4 id="december-9-days">December (9 Days):</h4>
<ul>
  <li>7th - 9th (Thursday - Saturday)</li>
  <li>14th - 16th (Thursday - Saturday)</li>
  <li>20th - 22nd (Wednesday - Friday)</li>
</ul>

<h4 id="january-2018-2-days">January 2018 (2 Days):</h4>
<ul>
  <li>11th - 12th (Thursday - Friday)</li>
</ul>


      </section>
      <footer>
        
        <p>This project is maintained by <a href="https://github.com/TheCadsData">TheCadsData</a></p>
        
        <p><small>Hosted on GitHub Pages &mdash; Theme by <a href="https://github.com/orderedlist">orderedlist</a></small></p>
      </footer>
    </div>
    <script src="/EDS3-2017/assets/js/scale.fix.js"></script>


  
  </body>
</html>
"""

In [53]:
# your code here


## What about a bigger web page ?

https://www.lelong.com.my/

### Chrome developer tool

1. Launch Chrome
1. Go to https://www.lelong.com.my/catalog/all/list?TheKeyword=phone
1. To launch Chrome DevTools:
    1. Click on 
        - $\vdots$ in the upper right corner,
        - More Tools,
        - Developer Tools.
    1. Or use the shotcut: Command+Option+I on Mac and F12 or Control+Shift+I on Windows / Linux.

#### The following code shows step by step web scrapping

To know where is this information we are interested in, we use Chrome developer tool and specifically the "inspect" feature.

#### Step 1
Get the data

In [54]:
url = "https://www.lelong.com.my/catalog/all/list?TheKeyword=phone"

In [55]:
import requests

In [56]:
url='https://www.lelong.com.my/catalog/all/list'
response = requests.get(url, params={'TheKeyword':'phone'})


In [57]:
response.status_code

200

In [58]:
response.text[:500]

'    <!--Canonical Link-->\r\n    <!--Meta-->\r\n    <!--Chat-->\r\n\r\n\r\n<!DOCTYPE html>\r\n<html lang="en">\r\n<head>\r\n    <meta http-equiv="content-type" content="text/html; charset=iso-8859-1">\r\n    \r\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\r\n    <meta name="google-translate-customization" content="2f26647729bf248b-f2b32987adcfc8b8-gbe489916db8e3e69-11">\r\n    <title>Phone price, harga in Malaysia - telefon bimbit</title>\r\n    <meta name="description" content="Phone Malay'

#### Step 2
Parse the data, get the results

In [59]:
soup = BeautifulSoup(response.text, 'lxml')

In [60]:
results = soup.find_all('div', attrs={'class':'summary'})
len(results)

60

#### Step 3
Take one specific result to see how to extract different features from this product. Later, we will loop over all results

In [61]:
product = results[42]

In [62]:
product.find('b').get('data-price')

'21.00'

In [63]:
product.find('b').get('data-link')

'//www.lelong.com.my/-220441052-2023-02-Sale-P.htm'

In [64]:
product.find('b').text

'【B000】(Select 1 colour) Phone Casing (Ori RM 30)'

Ok, we are able to extract: price, link and title for this product. We should be able to extract this information for all the product

#### Step 4
Loop

#### Step 5 
Save the data in CSV

From the [documentation](https://docs.python.org/3/library/csv.html#csv.writer):

```python
import csv
with open('eggs.csv', 'w', newline='') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter=' ',
                            quotechar='|', quoting=csv.QUOTE_MINIMAL)
    spamwriter.writerow(['Spam'] * 5 + ['Baked Beans'])
    spamwriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam'])
```

In [65]:
import csv

results = soup.find_all('div', attrs={'class':'summary'})

with open('phones_lelong.csv', 'w') as csvfile:
    lelongwriter = csv.writer(csvfile)
    for product in results:
        b_element = product.find('b')
        price = float(b_element.get('data-price'))
        url = b_element.get('data-link')
        title = b_element.text
        lelongwriter.writerow([title, price, url])

We have to add **sleeps** in order not to be blacklisted by the website we are crawling

In [66]:
import time
for i in range(10):
    print(i)
    time.sleep(1)

0
1
2
3
4
5
6
7
8
9


### Result: everything together

In [67]:
lelong_url='https://www.lelong.com.my/catalog/all/list'

with open('phones_lelong.csv', 'w', encoding='utf-8', newline='') as csvfile:
    lelongwriter = csv.writer(csvfile)
    for page in range(1, 11):
        print("Querying page %s..." % page)
        response = requests.get(lelong_url, params={'TheKeyword':'phone', 'D': page})
        print('Got page %s' % page)
        soup = BeautifulSoup(response.text, 'lxml')
        results = soup.find_all('div', attrs={'class':'summary'})
        for product in results:
            b_element = product.find('b')
            price = float(b_element.get('data-price'))
            url = b_element.get('data-link')
            title = b_element.text
            lelongwriter.writerow([title, price, url])
        print('Sleeping...')
        time.sleep(1)
        print('Waking up!')


Querying page 1...
Got page 1
Sleeping...
Waking up!
Querying page 2...
Got page 2
Sleeping...
Waking up!
Querying page 3...
Got page 3
Sleeping...
Waking up!
Querying page 4...
Got page 4
Sleeping...
Waking up!
Querying page 5...
Got page 5
Sleeping...
Waking up!
Querying page 6...
Got page 6
Sleeping...
Waking up!
Querying page 7...
Got page 7
Sleeping...
Waking up!
Querying page 8...
Got page 8
Sleeping...
Waking up!
Querying page 9...
Got page 9
Sleeping...
Waking up!
Querying page 10...
Got page 10
Sleeping...
Waking up!


## Do it yourself: Web Scraping

The goal of this mini-project is to scrape data from e-commerce or other websites such as
Lelong, Lazada, Mudah, iProperty, Booking, Expedia etc.

Scrape at least 1000 items from one of the website mentioned above. The scraped data should include:
- Product Name/Product Title
- Amount/Price 
- Brand
- Comments/Reviews
- Number of views


In addition, you are required to export the scraped data to dataframe format and also save a
copy in csv format. 

Upon successful extracting data to dataframe, you are required to do a data
analysis on the data.

Your analysis should provide answers to the following questions: 
* What do you think is interesting about this data? 
* Tell a story about some interesting thing you have discovered by looking at the data. 
* Visualize your data with matplotlib or with folium library package.

For example, you might consider whether there is a difference in pricings at different times
doing the day or city, or whether other factors that influnced the pricings etc. Another thing you
might consider is whether there is a relationship between the pricing and number of reviews or
comments.

Get your analysis workflow in your Jupyter notebook.