<p><a name="sections"></a></p>
<br>
<br>

# Sections

- <a href="#structures">Data Structures Review</a><br>

- <a href="#functions">Functions and Methods</a><br>

- <a href="#intro">Introduction to Beautiful Soup</a><br>
    - <a href="#web">What is Web Scraping?</a><br>
    - <a href="#html">Introduction to HTML</a><br>
    - <a href="#beautiful">Basics of Beautiful Soup</a><br>

- <a href="#example">Example</a><br>
    - <a href="#yelp">Scraping Yelp Reviews</a><br>

<p><a name="structures"></a></p>

## Data Structures Review

**Everything is an object** in Python. That is the beauty of object-oriented programming. Objects are designed in specific ways to _organize and handle_ data.

A **data structure** is an object format that enables efficient access and modification of data. There are many types of data strcutres in Python, but there are two that are important to know for web scraping: lists and dictionaries.

### List

- A list is an ordered collection of values. Examples: `[1,2,3,4]` and `[7.5, 'hello', None, [1,2,3]]`
- Lists can contain any type of object. This means they can contain integers, strings, and even other lists!
- Elements in a list do not necessarily all need to be the same data type.

We define a list by wrapping a collection of zero or more elements, separated by commas, in square brackets.

In [None]:
my_list = [1,2,3,4]

Note that we can also create a list with no elements. This is extremely useful - sort of like the number 0.

In [None]:
empty_list = []

We can subscript a list to extract specific elements within it. Python uses zero indexing, meaning indices start at 0 and count up.

In [None]:
my_list[0] # First element, returns 1

In [None]:
my_list[3] # Fourth element, return 4

There are many other operations we can do on lists, but this is enough to get us started.

### Dictionary

- A dictionary in Python is like a list in that it stores values. However, a dictionary stores **key and value pairs**.
- You can think of the keys in a dictionary as the indices of the values, but keys can be data types other than integers.
- The most common data type used for dictionary keys is string.

We define a dictionary by warpping a collection of key and value pairs, separated by commas, in curly brackets.

In [None]:
my_dict = {'key_one': 1, 'key_two': 2, 'key_three': 'three'}

- The dictionary above contains three keys and each key has a single value, an `int` or a `str`.
- We can extract a value by referencing the key associated with that value.

In [None]:
my_dict['key_two'] # Returns 2

If we try to access a key that doesn't exist in the dictionary, Python will raise a `KeyError`.

In [None]:
my_dict['unknown_key']

We can add a new key by using the syntax above, except we assign a value to the new key.

In [None]:
my_dict['key_four'] = 'new key'
my_dict

<p><a name="functions"></a></p>

## Functions and Methods

- Functions in Python allow us the ability to give names to specific tasks that we often repeat.
- When we name the operation, we can just call the name the next time we want to use it instead of writing the code all over again.
- This saves a lot of time and makes code much cleaner and easier to read.

### Functions

- There are functions everywhere in Python and they are here to help you!
- Very often you can find a function written by someone that does exactly what you're planning to do.
- It is important to note that many native Python functions are written in C or Cython (kind of like a C/Python hybrid) and run much faster than functions written in Python. For this reason, it is almost always better to use built-in Python functions than to "reinvent the wheel" by implementing the code yourself.

We define a function using this syntax:

In [None]:
def my_func(argument1, argument2):
    # Function body
    return argument1 * argument2 # This is what you want back from the function

- A function can take an arbitrary number of arguments (the information we feed into the function).
- A function can have a return statement (what it spits out), but it doesn't need one.
- If a function has no return statement, the default return is `None` (this is a legimate type in Python).

In [None]:
my_func(2,3) # Returns 6 (2 times 3)

- Note that **functions in Python are objects!**
- This means that we can pass functions as arguments to other functions.

In [None]:
type(my_func)

In [None]:
def pass_a_func(L, func):
    '''This function takes a list and a function
    as arguments and performs the function on the list
    and returns the result'''
    
    return func(L)

In [None]:
lis = [1,2,3,4,5]

pass_a_func(lis, max) # Returns 5

In [None]:
pass_a_func(lis, sum) # Returns 15

### Methods

- A method is just a special kind of function.
- Python objects are defined by classes.
    - Think of a class like the blueprints of an object. A class contains all the rules and operations that an object of that class type must follow.
- Methods are functions that belong to a specific class. **The methods that belong to a certain class can only be used by objects of that class type!**

- For example, the `str` class hs many differend methods used for manipulating strings.
- To call a method using an object, follow this form:
    - `object.method(arguments)`

In [None]:
sentence = 'Hello world!'

sentence.upper() # Converts the characters to uppercase

- Note that this does not change the original object, it just returns a different version of it.

In [None]:
sentence # The original object hasn't changed

- If we try to call the `upper()` method on a list, we will get an error.

In [None]:
my_list = [1,2,3,4]

my_list.upper()

- There is no method called `upper()` that belongs to the `list` class, so a `list` object is not allowed to call that method!
- To see what methods and attributes belong to a certain object or class, use the `dir()` function.

In [None]:
dir(sentence) # upper is in this list, so we can use it!

Let's try another string method.

In [None]:
sentence.split() # This will split the string on every space and return a list

We can change the value of the argument `sep` if our string has different separating characters.

In [None]:
'Hello,this,sentence,is,separated,by,commas.'.split(sep = ',') # The default value of sep is one space

<p><a name="web"></a></p>

## What is Web Scraping?

- HTML is short for **HyperText Markup Language**. It's a language for presenting content on the Web.

- Plain text is turned into an HTML document by **tags** that are then interpreted by a browser.

- Using BeautifulSoup, you can easily extract the tag values from HTML source code.

### Beautiful Soup VS Regular Expressions

In [None]:
# the source code of hi.html
!cat data/hi.html
# Windows user
# !type data\hi.html

### Example:
- Extract the characters between the title tags. 


- In this case it's `Hi` (`<title>Hi</title>`).

- **Solution using Regular Expressions**

In [None]:
import re
hi_path = 'data/hi.html'
with open(hi_path, 'r') as f:
    hi = f.read()
    print(re.findall('<title>(.*)</title>', hi))

- **Solution using BeautifulSoup**

In [None]:
from bs4 import BeautifulSoup
with open(hi_path, 'r') as f:
    hi = f.read()
    hi = BeautifulSoup(hi, 'html.parser')
    print(hi.title) # find the title tag
    print(hi.title.string)  # find the value of tag

**Compared with regular expressions:**
    
- Beautiful Soup's syntax is much simpler, while regular expressions are more flexible.

<p><a name="html"></a></p>

## Introduction to HTML

### Example html document
```
<!DOCTYPE html>
<html>
    <head>
        <title>Hi</title> <!--Im a comment, ignore me.-->
    </head>
    <body>
        <a href='http://www.crummy.com/software/BeautifulSoup/'>Hello, beautifulsoup!</a>
    </body>
</html>

```

### Tag

- The `<title>` tags in this example designate the enclosed text as the title to be displayed in the head of the browser tab.
![hi](pic/hi.png)
- Tags are always enclosed by `<` and `>` to distinguish them from the content. 
- A pair of tags consist of start and end tags which have the same name, but the end tag is preceded by a slash `/` .

### Values

Values are the content between start and end tags.

- **Example:**

`<title>Hi</title>` is a title tag with a value of `Hi`.

### Attributes

Tags have another feauture called attributes.

- **Example:**

`<a href='http://www.crummy.com/software/BeautifulSoup/'>Hello, beautifulsoup!</a>`

This is an anchor tag `<a>` with an attribute `href` and hyperlink http://www.crummy.com/software/BeautifulSoup/. It creates text that points to another web address (a hyperlink).

### Tree structure
- The first tag in the example is the `<html>` tag. 

- Between the `<html>` tags, several tags are opened and closed again: `<head>, <title>` , and
`<body>, <a>`.

    - The `<head>` and `<body>` tags are directly enclosed by the `<html>` tag. 
    - The `<title>` tag is enclosed by the `<head>` tag.
    - The `<a>` tag is enclosed by the `<body>` tag.


- A good way to describe the multiple layers of an HTML document is the tree analogy. 
![html](pic/html.png)

- The `html` tag is the root tag that splits into two branches, `<head>` and `<body>`; `<head>` is followed by another branch called `<title>`; `<body>` is followed by another branch called `<a>`.

<p><a name="beautiful"></a></p>

## Basics of Beautiful Soup

### Parse HTML

- The `prettify()` method adds indentations so that it will help you understand the tree structure of the html document.

In [None]:
from bs4 import BeautifulSoup
# open a local file and parse the plain text by BeautifulSoup directly
with open(hi_path, 'r') as f:
    hi = f.read()
    hi = BeautifulSoup(hi, 'html.parser')
    print(type(hi)) # get a bs4.BeautifulSoup object
    print('\n')
    print(hi.prettify())

### Names, Values, and Attributes

Beautiful Soup can extract the `name`, `value` and `attributes` of tags. The corresponding methods are:
- name
- string
- attrs

In [None]:
print("The name of a tags is: ", hi.a.name)
print("The value of a tags is: ", hi.a.string)
print("The attribute of a tags is: ", hi.a.attrs)

## get_text() & get()
- For tags that have child tags the `string` attribute does not work.

In [None]:
print(hi.html.string)

- Use the get_text method instead. The `get_text()` method will extract all the text content of child tags.

In [None]:
print(hi.html.get_text())

- `get()` is used to find the attribute of the `a` tag. For example, we can get the `href` of the `a` tag using the following code. 

- It is the same as running `hi.a.attrs` and then extracting the value of key `href` from the dictionary.

In [None]:
print(hi.a.get('href'))

In [None]:
print(hi.a.attrs)

### find() & find_all()
- The functions `find` and `findall` are flexible for finding tags.

In [None]:
!cat data/article.html
# Windows user
# !type data\article.html

![article](pic/article.png)

In [None]:
article_path = 'data/article.html'
with open(article_path, 'r') as f:
    article = f.read()
    article = BeautifulSoup(article, 'html.parser')

- Return only the first `p` tag.

In [None]:
print(article.p)

- `find()` returns the first p tags, which is equivalent to article.p

In [None]:
print(article.find('p'))

- `find_all()` returns all p tags

In [None]:
print(article.find_all('p'))

- To find the tags that have specific attributes, you can pass a dictionary as the `attrs` argument.

In [None]:
print(article.find_all('h1', attrs={'id':'one'}))

- You can also specify a function to extract a list of Tag objects that match the given criteria.
- It is the same as the following:

In [None]:
# the tags whose attribute id equals 'one'
print(article.find_all(lambda tag: tag.get('id') == 'one'))

<p><a name="example"></a></p>

## Example

<p><a name="yelp"></a></p>

### Scrape Yelp Reviews
- Let's apply what we have learned to a more complicated example - scrape Yelp reviews.
- Our task is to scrape all the reviews of the ABC Kitchen Restaurant on Yelp. https://www.yelp.com/biz/abc-kitchen-new-york
- You can easily extend this code to all the restaurants.

#### Step 1: Find the pattern of url

In [None]:
from bs4 import BeautifulSoup
import requests

response = requests.get('https://www.yelp.com/biz/abc-kitchen-new-york')
text = BeautifulSoup(response.text, 'html.parser')

- If you go to the second page, you can see the url becomes https://www.yelp.com/biz/abc-kitchen-new-york?start=20
- Similarly, the url to the thid page: https://www.yelp.com/biz/abc-kitchen-new-york?start=40
- But how do we find out the url of the last page?

In [None]:
import re
num_reviews = text.find('span', attrs={'class': 'review-count rating-qualifier'}).string
# \d+ is a pattern in regular expression to extract numbers from raw string
num_reviews = int(re.findall('\d+', num_reviews)[0])
print(num_reviews)

In [None]:
url_list = []
for i in range(0, num_reviews, 20):
    url_list.append('https://www.yelp.com/biz/abc-kitchen-new-york?start='+str(i))
print(url_list[:10])

#### Step 2: Find all the review divs on the page

In [None]:
reviews = text.find_all('div', attrs={'class':'review review--with-sidebar'})
print(len(reviews))

#### Step 3: Scrape the detail information

For debugging purpose, we usually test it out on one review and then apply to the others.

In [None]:
review = reviews[0]

# Username
username = review.find('a', attrs={'class': 'user-display-name js-analytics-click'}).string
print(username)

In [None]:
# Location
location = review.find('li', attrs={'class': 'user-location responsive-hidden-small'}).get_text()
print(location)

In [None]:
# Rating
rating = review.find('img', attrs={'class': 'offscreen'}).get('alt')
# \d+ is a pattern in regular expression to extract numbers from raw string
rating = float(re.findall('\d+', rating)[0])
print(rating)

In [None]:
# Date
date = review.find('span', attrs={'class': 'rating-qualifier'}).get_text()
print(date)

In [None]:
# Content
content = review.find('p').get_text()
print(content)

#### Step 4: Apply to all the reviews and save them to a csv file

In [None]:
import csv
 # Windows using text encoding when opening the file by default.# Windo 
# Override it to 'utf-8' will save lots of encoding issues.
with open('reviews.csv', 'w', encoding='utf-8', newline='') as csvfile:
    review_writer = csv.writer(csvfile)
    for review in reviews:
        # We use a dictionary to save one review because each review should have the same number of keys
        dic = {}
        # Code copied from the previous steps
        username = review.find('a', attrs={'class': 'user-display-name js-analytics-click'}).string
        location = review.find('li', attrs={'class': 'user-location responsive-hidden-small'}).get_text().strip()
        date = review.find('span', attrs={'class': 'rating-qualifier'}).get_text().strip()
        rating = review.find('img', attrs={'class': 'offscreen'}).get('alt')
        rating = float(re.findall('\d+', rating)[0])
        content = review.find('p').text
        # Assign values to the dictionary
        dic['username'] = username
        dic['location'] = location
        dic['date'] = date
        dic['rating'] = rating
        dic['content'] = content
        # Write the dictionaries to local csv file
        review_writer.writerow(dic.values())

#### Step 5: Apply to all the pages

In [None]:
import time
import random


def scrape_single_page(reviews, csvwriter):
    for review in reviews:
        dic = {}
        username = review.find('a', attrs={'class': 'user-display-name js-analytics-click'}).text
        location = review.find('li', attrs={'class': 'user-location responsive-hidden-small'}).text.strip()
        date = review.find('span', attrs={'class': 'rating-qualifier'}).text.strip()
        rating = review.find('img', attrs={'class': 'offscreen'}).get('alt')
        rating = rating = float(re.findall('\d+', rating)[0])
        content = review.find('p').text
        dic['username'] = username
        dic['location'] = location
        dic['date'] = date
        dic['rating'] = rating
        dic['content'] = content
        csvwriter.writerow(dic.values())
    

with open('reviews.csv', 'w', encoding='utf-8', newline='') as csvfile:
    review_writer = csv.writer(csvfile)
    for index, url in enumerate(url_list):
        response = requests.get(url).text
        soup = BeautifulSoup(response, 'html.parser')
        reviews = soup.find_all('div', attrs={'class':'review review--with-sidebar'})
        scrape_single_page(reviews, review_writer)
        # Random sleep to avoid getting banned from the server
        time.sleep(random.randint(1,3))
        # Log the progress
        print('Finished page ' + str(index + 1))