<p><a name="sections"></a></p>
<br>
<br>

# Sections

- <a href="#intro">Introduction to Beautiful Soup</a><br>
    - <a href="#web">What is web scraping</a><br>
    - <a href="#html">Introduction to HTML</a><br>
    - <a href="#beautiful">Basics of Beautiful Soup</a><br>

- <a href="#example">Examples</a><br>
    - <a href="#calendar">Python User Group Calendar</a><br>
    - <a href="#yelp">Scrape Yelp Reviews</a><br>

<p><a name="web"></a></p>

## What is web scraping?

- HTML is short for **HyperText Markup Language**. It's a language for presenting content on the Web.

- Plain text is turned into an HTML document by **tags** that are then interpreted by a browser.

- Using BeautifulSoup, you can easily extract the tag values from HTML source code.

### Beautiful Soup VS Regular Expressions

In [1]:
# the source code of hi.html
# !cat data/hi.html
# Windows user
!type data\hi.html

<!DOCTYPE html>
<html>
    <head>
        <title>Hi</title> <!--Im a comment, ignore me.-->
    </head>
    <body>
        <a href='http://www.crummy.com/software/BeautifulSoup/'>Hello, beautifulsoup!</a>
    </body>
</html>


### Example:
- Extract the characters between the title tags. 


- In this case it's `Hi` (`<title>Hi</title>`).

- **Solution using Regular Expressions**

In [2]:
import re
hi_path = 'data/hi.html'
with open(hi_path, 'r') as f:
    hi = f.read()
    print(re.findall('<title>(.*)</title>', hi))

['Hi']


- **Solution using BeautifulSoup**

In [6]:
from bs4 import BeautifulSoup
with open(hi_path, 'r') as f:
    hi = f.read()
    hi = BeautifulSoup(hi, 'html.parser')
    print(hi.title) # find the title tag
    print(hi.title.string)  # find the value of tag

<title>Hi</title>
Hi


In [8]:
type(hi.title.string)

bs4.element.NavigableString

**Compared with regular expressions:**
    
- Beautiful Soup's syntax is much simpler, while regular expressions are more flexible.

<p><a name="html"></a></p>

## Introduction to HTML

### Tag

- The `<title>` tags in this example designate the enclosed text as the title to be displayed in the head of the browser tab.
![hi](pic/hi.png)

- Tags are always enclosed by `<` and `>` to distinguish them from the content. 
- A pair of tags consist of start and end tags which carry the same name, but the end tag is preceded by a slash `/` .

### Values

Values are the content between start tags and end tags.

- **Example**

`<title>Hi</title>`: It's a title tag with a value of `Hi`.

### Attributes
Tags have another feauture called attributes.

- **Example**

`<a href='http://www.crummy.com/software/BeautifulSoup/'>Hello, beautifulsoup!</a>`

The anchor tag `<a>` with an attribute `href` and hyperlink—http://www.crummy.com/software/BeautifulSoup/. It creates an association of text points to another address (a hyperlink).

### Tree structure
- The first tag in the example is the `<html>` tag. 

- Between the `<html>` tags, several tags are opened and closed again: `<head>, <title>` , and
`<body>, <a>`.

    - The `<head>` and `<body>` tags are directly enclosed by the `<html>` tag. 
    - The `<title>` tag is enclosed by the `<head>` tag.
    - The `<a>` tag is enclosed by the `<body>` tag.


- A good way to describe the multiple layers of an HTML document is the tree analogy. 
![html](pic/html.png)

- The `html` tag is the root tag that splits into two branches, `<head>` and `<body>`; `<head>` is followed by another branch called `<title>`; `<body>` is followed by another branch called `<a>`.

<p><a name="beautiful"></a></p>

## Basics of Beautiful Soup

### Parse HTML

- The `prettify()` method adds indentations so that it will help you understand the tree structure of the html document.

In [9]:
from bs4 import BeautifulSoup
# open a local file and parse the plain text by BeautifulSoup directly
with open(hi_path, 'r') as f:
    hi = f.read()
    hi = BeautifulSoup(hi, 'html.parser')
    print(type(hi)) # get a bs4.BeautifulSoup object
    print('\n')
    print(hi.prettify())

<class 'bs4.BeautifulSoup'>


<!DOCTYPE html>
<html>
 <head>
  <title>
   Hi
  </title>
  <!--Im a comment, ignore me.-->
 </head>
 <body>
  <a href="http://www.crummy.com/software/BeautifulSoup/">
   Hello, beautifulsoup!
  </a>
 </body>
</html>



### Names, Values, and Attributes

Beautiful Soup can extract the `name`, `value` and `attributes` of tags. The corresponding methods are:
- name
- string
- attrs

In [10]:
print("The name of a tags is: ", hi.a.name)
print("The value of a tags is: ", hi.a.string)
print("The attribute of a tags is: ", hi.a.attrs)

The name of a tags is:  a
The value of a tags is:  Hello, beautifulsoup!
The attribute of a tags is:  {'href': 'http://www.crummy.com/software/BeautifulSoup/'}


### get_text() & get()
- For tags that have child tags the string does not work

In [13]:
print(hi.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Hi
  </title>
  <!--Im a comment, ignore me.-->
 </head>
 <body>
  <a href="http://www.crummy.com/software/BeautifulSoup/">
   Hello, beautifulsoup!
  </a>
 </body>
</html>



In [14]:
print(hi.html.string)

None


- Use the get_text method instead. The `get_text()` method will extract all the contents of child tags.

In [15]:
print(hi.html.get_text())



Hi 


Hello, beautifulsoup!




- `get()` is used to find the attribute of a tag. For example, we can get the href of tag a using the following code. 

- It is the same as run `hi.a.attrs` first and then find the value of key `href` from the dictionary.

In [17]:
hi.a.attrs

{'href': 'http://www.crummy.com/software/BeautifulSoup/'}

In [23]:
print(hi.a.get('href'))

http://www.crummy.com/software/BeautifulSoup/


In [22]:
print(hi.a.attrs['href'])

http://www.crummy.com/software/BeautifulSoup/


### find() & find_all()
The functions `find` and `findall` are flexible for finding tags.

In [24]:
# !cat data/article.html
# Windows user
!type data\article.html

<!DOCTYPE html>
<html>
    <head>
        <title>Article</title>
    </head>
    <body>
        <h1 id='one'>One</h1>
        	<p>This is the first paragraph.</p>
        <h2 id='two'>Two</h2>
        	<p><a href='www.google.com'>Here is the Google website.</a></p>
        <h3 id='three'>Three</h3>
        	<p>This is the third paragraph.</p>
    </body>
</html>


![article](pic/article.png)

In [25]:
article_path = './data/article.html'
with open(article_path, 'r') as f:
    article = f.read()
    article = BeautifulSoup(article, 'html.parser')

- Return only the first `p` tag.

In [26]:
print(article.p)

<p>This is the first paragraph.</p>


- `find()` returns the first p tags, which is equivalent to article.p

In [27]:
print(article.find('p'))

<p>This is the first paragraph.</p>


- `find_all()` returns all p tags

In [35]:
print(article.find_all('p'))
type(article.find_all('p'))
result = article.find_all('p')
result.append('new element')

[<p>This is the first paragraph.</p>, <p><a href="www.google.com">Here is the Google website.</a></p>, <p>This is the third paragraph.</p>]


In [37]:
dir(result)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort',
 'source']

In [34]:
isinstance(result, list)

True

- To find the tags that have specific attributes, you can pass a dictionary as the `attrs` argument.

In [38]:
print(article.find_all('h1', attrs={'id':'one'}))

[<h1 id="one">One</h1>]


- You can also specify a function to extract a list of Tag objects that match the given criteria.
- It is the same as the following:

In [42]:
# the tags whose attribute id equals 'one'
print(article.find_all(lambda tag: tag.get('href')))

[<a href="www.google.com">Here is the Google website.</a>]


<p><a name="example"></a></p>

## Examples

<p><a name="calendar"></a></p>

### Python User Group Calendar

Let's extract the time, location, and event titles from this web page [Python User Group Calendar](https://www.python.org/events/python-user-group/).

<img src=pic/events.png width=800/>

- For the examples we discussed before, we saved the html document locally. However, you don't want to download all the pages and then start scraping for your web scraping project.
- The [Requests package](http://docs.python-requests.org/en/master/) we are using here is well designed and very popular in the industry. It makes http requests easy to use with Python.
- The `get` method we are using here is one type of [http request](https://www.tutorialspoint.com/http/http_methods.htm). It is most often used to retrieve information from the web server. 

In [47]:
import requests
response = requests.get('https://www.python.org/events/python-user-group/')
text = BeautifulSoup(response.text, 'html.parser')

In [48]:
print(text.prettify())

<!DOCTYPE doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" dir="ltr" lang="en">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <link href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js" rel="prefetch"/>
  <meta content="Python.org" name="application-name"/>
  <meta content="The official home of the Python Programming Language" name="msapplication-tooltip"/>
  <meta content="Python.org" name="apple-mobile-web-app-title"/>
  <meta content="yes" name="apple-mobile-web-app-capable"/>
  <meta content="black" name="apple-mobile-web-app-status-bar-style"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="True" name="HandheldFrien

#### Title
Titles are in `h3` tags with an attribute `class="event-title"`.
<img src=pic/title.png width=900/>

In [55]:
titleTags = text.find_all('h3', {'class': "event-title"})
titleTags

[<h3 class="event-title"><a href="/events/python-user-group/911/">Python Meeting Düsseldorf</a></h3>,
 <h3 class="event-title"><a href="/events/python-user-group/874/">PyCC Meetup'19 (Python Cape Coast User Group)</a></h3>,
 <h3 class="event-title"><a href="/events/python-user-group/903/">enterPy</a></h3>,
 <h3 class="event-title"><a href="/events/python-user-group/912/">Python Meeting Düsseldorf</a></h3>,
 <h3 class="event-title"><a href="/events/python-user-group/979/">Python for Signal Processing Algorithms Implementation Workshop</a></h3>]

In [57]:
first_tag = titleTags[0]
first_tag.get_text()

'Python Meeting Düsseldorf'

In [50]:
titleString = [tag.get_text() for tag in titleTags]
titleString

['Python Meeting Düsseldorf',
 "PyCC Meetup'19 (Python Cape Coast User Group)",
 'enterPy',
 'Python Meeting Düsseldorf',
 'Python for Signal Processing Algorithms Implementation Workshop']

#### Time
Times are in the `time` tags that have `datetime` attribute.

![time](pic/time.png)

In [70]:
text.find_all('time')

[<time datetime="2020-09-30T16:00:00+00:00">30 Sept.<span class="say-no-more"> 2020</span> 4pm UTC – 6pm UTC</time>,
 <time datetime="2020-10-26T08:00:00+00:00">26 Oct.<span class="say-no-more"> 2020</span> 8am UTC – 10am UTC</time>,
 <time datetime="2020-11-23T00:00:00+00:00">23 Nov. – 24 Nov. <span class="say-no-more"> 2020</span></time>,
 <time datetime="2020-06-24T16:00:00+00:00">24 June<span class="say-no-more"> 2020</span> 4pm UTC – 6pm UTC</time>,
 <time datetime="2020-06-18T00:00:00+00:00">18 June<span class="say-no-more"> 2020</span></time>]

In [71]:
timeTags = text.find_all(lambda tag: 'datetime' in tag.attrs)
timeTags

[<time datetime="2020-09-30T16:00:00+00:00">30 Sept.<span class="say-no-more"> 2020</span> 4pm UTC – 6pm UTC</time>,
 <time datetime="2020-10-26T08:00:00+00:00">26 Oct.<span class="say-no-more"> 2020</span> 8am UTC – 10am UTC</time>,
 <time datetime="2020-11-23T00:00:00+00:00">23 Nov. – 24 Nov. <span class="say-no-more"> 2020</span></time>,
 <time datetime="2020-06-24T16:00:00+00:00">24 June<span class="say-no-more"> 2020</span> 4pm UTC – 6pm UTC</time>,
 <time datetime="2020-06-18T00:00:00+00:00">18 June<span class="say-no-more"> 2020</span></time>]

In [72]:
timeString = [tag.get('datetime') for tag in timeTags]
timeString

['2020-09-30T16:00:00+00:00',
 '2020-10-26T08:00:00+00:00',
 '2020-11-23T00:00:00+00:00',
 '2020-06-24T16:00:00+00:00',
 '2020-06-18T00:00:00+00:00']

#### Location
Locations are in `span` tags with the attribute `class="event-location"`.

<img src=pic/location.png width=900/>

In [73]:
locationTags = text.find_all("span", {"class": "event-location"})
locationTags

[<span class="event-location">Düsseldorf, Germany</span>,
 <span class="event-location">Cape coast, Ghana</span>,
 <span class="event-location">Mannheim, Germany</span>,
 <span class="event-location">Online</span>,
 <span class="event-location">Erode, Tamilnadu, INDIA</span>]

In [74]:
locationString = [tag.string for tag in locationTags]
locationString

['Düsseldorf, Germany',
 'Cape coast, Ghana',
 'Mannheim, Germany',
 'Online',
 'Erode, Tamilnadu, INDIA']

In [77]:
import pandas as pd
pd.DataFrame(list(zip(titleString, timeString, locationString)), columns=['event_title', 'time', 'location'])

Unnamed: 0,event_title,time,location
0,Python Meeting Düsseldorf,2020-09-30T16:00:00+00:00,"Düsseldorf, Germany"
1,PyCC Meetup'19 (Python Cape Coast User Group),2020-10-26T08:00:00+00:00,"Cape coast, Ghana"
2,enterPy,2020-11-23T00:00:00+00:00,"Mannheim, Germany"
3,Python Meeting Düsseldorf,2020-06-24T16:00:00+00:00,Online
4,Python for Signal Processing Algorithms Implem...,2020-06-18T00:00:00+00:00,"Erode, Tamilnadu, INDIA"


In [76]:
list(zip(titleString, timeString, locationString))

[('Python Meeting Düsseldorf',
  '2020-09-30T16:00:00+00:00',
  'Düsseldorf, Germany'),
 ("PyCC Meetup'19 (Python Cape Coast User Group)",
  '2020-10-26T08:00:00+00:00',
  'Cape coast, Ghana'),
 ('enterPy', '2020-11-23T00:00:00+00:00', 'Mannheim, Germany'),
 ('Python Meeting Düsseldorf', '2020-06-24T16:00:00+00:00', 'Online'),
 ('Python for Signal Processing Algorithms Implementation Workshop',
  '2020-06-18T00:00:00+00:00',
  'Erode, Tamilnadu, INDIA')]

### Web Scraping Project Workflow
- We have been lucky so far because there is no missing values on this page. But what if the location of one event is missing? There is no way for us to locate it from three lists of different length.
- The general workflow of a web scraping project is like the following:
 - Find the unique attribute that will locate the **top level** tags that you are interested in.
     - Each tag could be a listing, review, item...
     - **one unique tag -> one row in csv file**
 - We want to locate the event tag that its child tags contain the title, datetime and location that you want to save as columns in a csv file.
 - Then you go levels deeper to find the child tags of each event. If there is something missing there, you just replace it with an empty string.
 - The event tags have a unique  attribute **class=list-recent-events menu**.
 - Next question is: what is the best data structure to represent one single event?

In [83]:
# Save all the event is a list
result = []
# Save all the ul tags, each ul is a section of the page
uls = text.find_all('ul', {'class': 'list-recent-events menu'})
for ul in uls:
    # Save all the li tags, each li is an event
    lis = ul.find_all('li')
#     print(len(lis))
    for li in lis:
        # Initialize an empty dictionary for each event
        event = {}
        # Using try/except to avoid errors caused by missing values
        try:
            title = li.find('h3').get_text()
        except:
            continue
        try:
            time = li.find('time').get('datetime')
        except:
            time = None
        try:
            location = li.find('span', {'class':'event-location'}).string.strip()
        except:
            location = None
        
        # Assign the values in the dictionary
        event['location'] = location
        event['time'] = time
        event['title'] = title
        result.append(event)

In [84]:
result

[{'location': 'Düsseldorf, Germany',
  'time': '2020-09-30T16:00:00+00:00',
  'title': 'Python Meeting Düsseldorf'},
 {'location': 'Cape coast, Ghana',
  'time': '2020-10-26T08:00:00+00:00',
  'title': "PyCC Meetup'19 (Python Cape Coast User Group)"},
 {'location': 'Mannheim, Germany',
  'time': '2020-11-23T00:00:00+00:00',
  'title': 'enterPy'},
 {'location': 'Online',
  'time': '2020-06-24T16:00:00+00:00',
  'title': 'Python Meeting Düsseldorf'},
 {'location': 'Erode, Tamilnadu, INDIA',
  'time': '2020-06-18T00:00:00+00:00',
  'title': 'Python for Signal Processing Algorithms Implementation Workshop'}]

<p><a name="yelp"></a></p>

## Scrape Yelp Reviews
- Let's apply what we have learned to a more complicated example - scrape Yelp reviews. 
- Our task is to scrape all the reviews of the ABC Kitchen Restaurant on Yelp. https://www.yelp.com/biz/abc-kitchen-new-york
- You can easily extend this code to all the restaurants.

### Step 1: Find the pattern of url

- Here we added `User-Agent` to the header of our request. It is because sometimes the web server will check the different fields of the header to block robot scrapers. 
- `User-Agent` is the most common one because it is specific to your browser.

In [85]:
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
            }

response = requests.get('https://www.yelp.com/biz/abc-kitchen-new-york', headers=headers)
text = BeautifulSoup(response.text, 'html.parser')

- If you go to the second page, you can see the url becomes https://www.yelp.com/biz/abc-kitchen-new-york?start=20
- Similarly, the url to the thid page: https://www.yelp.com/biz/abc-kitchen-new-york?start=40
- But how do we find out the url of the last page?

In [102]:
import re

temp = 'lemon--div__373c0__1mboc border-color--default__373c0__3-ifU text-align--center__373c0__2n2yQ'
num_reviews = text.find_all('div', attrs={'class': temp})[0].get_text()
num_reviews = int(re.findall('1 of (\d+)', num_reviews)[0])

# url_list = []
# for i in range(num_reviews):
#     url_list.append(f'https://www.yelp.com/biz/abc-kitchen-new-york?start={i*20}')
    
url_list = [f'https://www.yelp.com/biz/abc-kitchen-new-york?start={i*20}' for i in range(num_reviews+1)]

In [103]:
url_list[-1]

'https://www.yelp.com/biz/abc-kitchen-new-york?start=2960'

In [104]:
import re

temp = 'lemon--p__373c0__3Qnnj text__373c0__2Kxyz text-color--mid__373c0__jCeOG text-align--left__373c0__2XGa- text-size--large__373c0__3t60B'
num_reviews = text.find('p', attrs={'class': temp}).string
num_reviews = int(re.findall('\d+', num_reviews)[0])
print(num_reviews)

# url_list = []
# for i in range(0, num_reviews, 20):
#     url_list.append('https://www.yelp.com/biz/abc-kitchen-new-york?start='+str(i))
    
url_list = [f'https://www.yelp.com/biz/abc-kitchen-new-york?start={i}' for i in range(0, num_reviews, 20)]
print(url_list[-1])

2972


### Step 2: Find all the review divs on the page

In [109]:
temp = 'lemon--li__373c0__1r9wz margin-b3__373c0__q1DuY padding-b3__373c0__342DA border--bottom__373c0__3qNtD border-color--default__373c0__3-ifU'
reviews = text.find_all('li', attrs={'class': temp})
print(len(reviews))

20


### Step 3: Scrape the detail information

For debugging purpose, we usually test it out on one review and then apply to the others.

In [125]:
review = reviews[0]

# Rating
temp = 'lemon--span__373c0__3997G display--inline__373c0__3JqBP border-color--default__373c0__3-ifU'
rating = review.find('span', attrs={'class': temp}).find('div').get('aria-label')
rating = float(re.findall('\d+', rating)[0])
print(rating)

5.0


In [126]:
# Date
temp = 'lemon--span__373c0__3997G text__373c0__2Kxyz text-color--mid__373c0__jCeOG text-align--left__373c0__2XGa-'
date = review.find('span', attrs={'class': temp}).string
print(date)

3/7/2020


In [127]:
# Content
temp = 'lemon--span__373c0__3997G raw__373c0__3rKqk'
content = review.find('span', attrs={'class': temp}).get_text()
print(content)

Honestly I cannot say enough goods things about the food here. I went with a group of 7 for dinner, we were at a lovely round table which was perfect for conversation. I would definitely recommend making a reservation. There are two entrances, we came through the store on the oposite side. The restaurant is very cool, very nice dinner vibe. The staff was on top of everything and made sure everyone's drink was full and mine was never empty. We started off with some appetizers. The chef prefers when guests order appetizers and main courses at the same time so that's what we did. We got a few crab toasts with lemon aioli (SO GOOD, my favorite dish), kale salad with lemon (delicious, the mint and Serrano chilies are a phenomenal addition and make the salad wonderful), and the tomato mozzarella basil whole wheat pizza (very good and fresh, but I would recommend the crab toast over the pizza). For my main course I chose the Black Sea bass. This was so good!! I cannot put into words how much 

### Step 4: Apply to all the reviews and save them to a csv file

In [130]:
import csv
# Windows using text encoding when opening the file by default.
# Override it to 'utf-8' will save lots of encoding issues.
with open('reviews.csv', 'w', encoding='utf-8', newline='') as csvfile:
    review_writer = csv.writer(csvfile)
    review_writer.writerow(['date','rating','content'])
    for review in reviews:
        dic = {}
        date = review.find('span', attrs={'class': 'lemon--span__373c0__3997G text__373c0__2Kxyz text-color--mid__373c0__jCeOG text-align--left__373c0__2XGa-'})\
                   .get_text().strip()
        rating = review.find('span', attrs={'class': 'lemon--span__373c0__3997G display--inline__373c0__3JqBP border-color--default__373c0__3-ifU'})\
                   .find('div').get('aria-label')
        rating = float(re.findall('\d+', rating)[0])
        content = review.find('span', attrs={'class': 'lemon--span__373c0__3997G raw__373c0__3rKqk'})\
                   .get_text().strip()
        dic['date'] = date
        dic['rating'] = rating
        dic['content'] = content
        review_writer.writerow(dic.values())

### Step 5: Apply to all the pages

In [133]:
import time
import random
import csv


def scrape_single_page(reviews, csvwriter):
    for review in reviews:
        dic = {}
        date = review.find('span', attrs={'class': 'lemon--span__373c0__3997G text__373c0__2Kxyz text-color--mid__373c0__jCeOG text-align--left__373c0__2XGa-'})\
                   .get_text().strip()
        rating = review.find('span', attrs={'class': 'lemon--span__373c0__3997G display--inline__373c0__3JqBP border-color--default__373c0__3-ifU'})\
                   .find('div').get('aria-label')
        rating = float(re.findall('\d+', rating)[0])
        content = review.find('span', attrs={'class': 'lemon--span__373c0__3997G raw__373c0__3rKqk'})\
                   .get_text().strip()
        dic['date'] = date
        dic['rating'] = rating
        dic['content'] = content
        review_writer.writerow(dic.values())
    

with open('reviews.csv', 'w', encoding='utf-8', newline='') as csvfile:
    review_writer = csv.writer(csvfile)
    review_writer.writerow(['date','rating','content'])
    
    for index, url in enumerate(url_list, 1):
        response = requests.get(url, headers=headers)
        text = BeautifulSoup(response.text, 'html.parser')
        reviews = text.find_all('li', attrs={'class': 'lemon--li__373c0__1r9wz margin-b3__373c0__q1DuY padding-b3__373c0__342DA border--bottom__373c0__3qNtD border-color--default__373c0__3-ifU'})
        scrape_single_page(reviews, review_writer)
        # Random sleep to avoid getting banned from the server
        time.sleep(random.randint(1,3))
        # Log the progress
        print(f'Finished page: {index}')

Finished page: 1
Finished page: 2
Finished page: 3
Finished page: 4
Finished page: 5
Finished page: 6
Finished page: 7
Finished page: 8
Finished page: 9


KeyboardInterrupt: 

In [134]:
import pandas as pd

df = pd.read_csv('./reviews.csv')

In [136]:
df.rating.mean()

4.122222222222222