In [None]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


# Data Collection

In this lecture, we are going to see some of the examples on how to extract data from online resources.  
Particularly, web scraping and API use cases will be examined.

## A General Pipeline

*Data Collection*
> The process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer queries, stated research questions, test hypotheses, and evaluate outcomes.

<div>
<img src="https://miro.medium.com/max/1200/1*ZWcBynyugbLpWcU3QWH7Tg.jpeg" alt="project-flow" width="500" height="600"/>
</div>


## Data Sources

#### Central Authorities

- [The U.S Governments Open Data](http://www.data.gov/)
- [Reddit Open Data](https://www.reddit.com/r/opendata/)
- [Climate Data](http://www.realclimate.org/index.php/data-sources/)
- [The World Bank](https://datacatalog.worldbank.org/)

#### Databases

- [Crawdad](https://crawdad.org/)
- [Radar](https://www.radars.org.uk/)
- Private Databases

#### Web Scraping & APIs

A data scientist doesn’t always get data handed to them in a CSV or an easily accessible database. In those cases, you need to manually extract the data from various resources. To this end, we have specialized tools.

For instance, most of the web sources, such as IMDB, provide a set of protocols/methods for outside connections to interact with their database. These protocols/methods are aggregated as an **API** (Application Programming Interface). An API can be used in numerous contexts, such OS or web-dev like here. The idea is to have an outer interface for those who wish access a set of resources. In our case, this resource is particularly a dataset.

However, there might be some cases in which an API does not exist. The desired data is embedded in the raw HTML file and enclosed by various tags. In those cases, we need to parse the document and extract the desired data. To this end, we have **web scraping** concept in which the HTML file is parsed and stored as a tree to preserve the hierarchical relationship between tags.




## Web Scarping

Web scraping is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort.


![](https://pbs.twimg.com/media/EGwqy2OWwAAi6-F?format=jpg&name=small)

## Working with APIs

The term API is an acronym, and it stands for "**Application Programming Interface**". Think of an API like a menu in a restaurant. The menu provides a list of dishes you can order, along with a description of each dish. When you specify what menu items you want, the restaurant’s kitchen does the work and provides you with some finished dishes. You don’t know exactly how the restaurant prepares that food, and you don’t really need to.

![](https://miro.medium.com/max/1200/1*3h95bN2_xe-eitwHh_Ygvw.png)

## HTTP Requests

HTTP stands for Hypertext Transfer Protocol and is used to structure requests and responses over the internet. HTTP requires data to be transferred from one point to another over the network. You may think of it as the command language that the devices on both sides of the connection must follow in order to communicate.


|Command (HTTP CODE)|CRUD Operation|Sample Endpoint|Description|
|---|---|---|---|
|get (GET)|Read (Retrieve)|http://example.com/resources/item17|Retrieve a representation of the addressed member of the collection, expressed in an appropriate Internet media type.|
|post (POST)|Create	Collection|http://example.com/resources/|Create a new entry in the collection. The new entry's URL is assigned automatically and is usually returned by the operation.|
|put (PUT)|Update|http://example.com/resources/item17|Replace the addressed member of the collection, or if it doesn't exist, create it|.
|delete (DELETE)|Delete (Destroy)|http://example.com/resources/item17|Delete the addressed member of the collection.|
||||**Table 1 Methods and sample endpoints.**|

Below, you may find a sample request from [Twitter's official API page](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets).

![](https://pbs.twimg.com/media/EGsWEYwX0AADYm8?format=jpg&name=large)

In return, this request is replied with a set fo extracted tweets in **json** format.

JSON is short for JavaScript Object Notation, and is a way to store information in an organized, easy-to-access manner. In a nutshell, it gives us a human-readable collection of data that we can access in a really logical manner. You may think of them as a generalized dictionary object across various languages.

<div>
<img src="https://d2tlksottdg9m1.cloudfront.net/uploads/2019/02/JSONSample.jpg" alt="project-flow" width="500" height="400"/>
</div>




In [None]:
# the library to perform I/O operations in json format
import json
# smart path joining
from os.path import join
# pretty printing
from pprint import pprint

path = "/content/gdrive/My Drive/data"

In [None]:
filename = "quiz.json"

# retrieve the file object
with open(join(path, filename), "r") as f:
  # load the json object into a variable
  data = json.load(f)

# now, variable 'data' is just a dictionary object
print("Data Type: ", type(data))
pprint(data)

# accessing deeper levels
pprint(data["quiz"]["maths"]["q1"])

Data Type:  <class 'dict'>
{'quiz': {'maths': {'q1': {'answer': '12',
                           'options': ['10', '11', '12', '13'],
                           'question': '5 + 7 = ?'},
                    'q2': {'answer': '4',
                           'options': ['1', '2', '3', '4'],
                           'question': '12 - 8 = ?'}},
          'sport': {'q1': {'answer': 'Huston Rocket',
                           'options': ['New York Bulls',
                                       'Los Angeles Kings',
                                       'Golden State Warriros',
                                       'Huston Rocket'],
                           'question': 'Which one is correct team name in '
                                       'NBA?'}}}}
{'answer': '12', 'options': ['10', '11', '12', '13'], 'question': '5 + 7 = ?'}


## Goodreads: Collecting Popular Books!

[Goodreads](https://www.goodreads.com/) is a social cataloging website that allows individuals to freely search its database of books, annotations, and reviews.

<img src="https://pbs.twimg.com/media/EGxQ68EXYAIque2?format=png&name=small" alt="project-flow" width="500" height="500"/>

The figure above shows a snapshot from a Goodreads list named [Books That Everyone Should Read At Least Once](https://www.goodreads.com/list/show/264.Books_That_Everyone_Should_Read_At_Least_Once). Now, we are going to scrape the books listed in this website and create a dataframe in which each row will represent a book with the following attributes.

---

- *rating*: the average rating on a 1-5 scale achieved by the book
- *review_count*: the number of Goodreads users who reviewed this book
- *booktype*: an internal Goodreads identifier for the book
- *author_url*: the Goodreads (relative) URL for the author of the book
- *rating_count*: the number of ratings for this book (this is different from the number of reviews)
- *name*: the name of the book
---




First of all, we need to understand how these books are placed in the dom tree. The naive approach is to download the webpage as a regular _html_ file and locate it manually. However, this process would take a long time and yield a huge cognitive load.

Luckily, modern browsers possess a built-in inspection tool to analyze webpages. For instance, if you press on _F12_ or _Ctrl+Shift+i_ the screen below would appear. Here, we are able observe and control the dom tree of the webpage.

<img src="https://www.maketecheasier.com/assets/uploads/2016/12/google-chrome-inspect-element-elements-tab-min.png" alt="project-flow" width="400" height="400"/>

In our case, the inspection tool provides a useful functionality to automatically locate the table in the webpage. The figure below shows the location of the selected book, which is highlighted with blue rectangle, in the webpage. As a result, `table` tag with class `tableList` contains the entire list. And each entry in the table is stored within `tr` tag.

![](https://pbs.twimg.com/media/EGxex--WwAIkfeX?format=jpg&name=small)

![](https://pbs.twimg.com/media/EGxjw-lW4AYjmeG?format=png&name=small)

At this point, we located where our data is. Now, we need a set of tools to extract each entry in the table in an automated manner. But, before that, we need another mechanism to retrieve (download) the webpages for us.

### Requests: Making HTTP Requests!

This is the de facto standard library for making HTTP requests in Python. It abstracts the complexities of making requests behind a beautiful, simple API so that you can focus on interacting with services and consuming data in your application.

As mentioned before, the `GET` method indicates that you’re trying to get or retrieve data from a specified resource. To make a GET request with `requests` library, just call requests.get(url) with url as the target webpage.

In [None]:
# the library comes built-in with colab
import requests

In [None]:
url = "http://www.google.com"
# making a GET request
res = requests.get(url)

In [None]:
# success code
res.status_code

200

In [None]:
# returns the HTML format of the search page
res.content

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\'re looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="IxN1ijBJ5VAz4OrEacTrmQ">(function(){var _g={kEI:\'Fuz1ZbedFbCMwbkPx5WoqAQ\',kEXPI:\'0,18168,1347300,206,4804,2329821,650,361,379728,35513,9286,24076,12027,17588,4998,23959,29334,2226,2872,2891,3926,4422,4012,58287,2403,15324,2025,1,16916,2652,4,62597,24052,6642,7596,1,42154,2,16395,342,23024,6700,31121,4569,6258,24670,33064,2,2,1,10957,15675,8155,23350,22436,9779,42459,20198,23165,13582,3801,2412,30219,3030,15816,1804,7734,6626,1,11471,21250,1632,8842,868,3786,42866,5203197,69

Now, we know how to retrieve the source of a webpage by utilizing `requests` library. Whenever we download a page, we can provide its raw text to a parser of our choice. In this lecture, we are going to use a library named `Beautiful Soup` to parse and store the html files.

### Beautiful Soup: Parsing Structured Documents!

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.


Given a HTML file, our goal is to parse the content and store it in an easily accessible data structure. So, we'll store it as a document tree object. Whenever we provide an HTML content to Beautiful Soap parser as the input, it returns the root of the resulting domcument tree.

``` py
# soup is the root of the dom-tree, which is, in fact, the html tag
>>> soup = BeautifulSoup(page.content, 'html.parser')
```

![](https://dab1nmslvvntp.cloudfront.net/wp-content/uploads/2014/10/1413373269crp-1.png)

With the root of the tree is at our hands, we can extract various tags with different class or id values.

``` py
# returns the first p tag and its content
>>> soup.find("p")
# returns all p tags stored in a list
>>> soup.find_all("p")
# returns all p tags with class attribute set to "tableList"
>>> soup.find_all("p", {"class" : "tableList"})
```

One thing to pay attention is these examples is that the returned values are still nodes from the tree. They can further be queried with the above syntax.

``` py
# returns the first p tag and its content
>>> first_p = soup.find("p")
# now, we can select the span tags placed only in first_p node
>>> first_p.find_all("span")
```

Now, our goal is to create the dataframe from a list of dictionaries. Each book entry will be converted into a dictionary and then stored as a row in the resulting dataframe.

![](https://pbpython.com/images/pandas-dataframe-shadow.png)

In [None]:
# importing the parser
from bs4 import BeautifulSoup
import re
import time
import pandas as pd
from os.path import join

Now, let's start extracting data from the reading list.

In [None]:
url = "https://www.goodreads.com/list/show/264.Books_That_Everyone_Should_Read_At_Least_Once?page=1"
res = requests.get(url)

Here, we have the request response object. We can check whether the request was successfull or not with the status code.

In [None]:
# success
res.status_code

200

To parse the document into a tree, we need to obtain its content and provide it as an input to Beautiful Soup.

In [None]:
soup = BeautifulSoup(res.content, "html.parser")

In [None]:
type(soup)

Now, to get the table where our data is stored, we need to find it with its class value.

In [None]:
# tag attributes are passed a dict object
table = soup.find("table", {"class": "tableList"})

We know that book entries are stored in `tr` tags in which each `td` tag contains a property of the selected book, such as title and rating.

In order to retrieve each book entry, we need to find all `tr` tags stored in the table. Then, to extract the book properties, we need to select the third `td`.

In [None]:
entries = table.find_all("tr")

for entry in entries:

  # find the third td
  properties = entry.find_all("td")[2]

  print(properties)

  # not to print all of them
  break

<td valign="top" width="100%">
<a class="bookTitle" href="/book/show/2657.To_Kill_a_Mockingbird" itemprop="url">
<span aria-level="4" itemprop="name" role="heading">To Kill a Mockingbird</span>
</a> <br>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/1825.Harper_Lee" itemprop="url"><span itemprop="name">Harper Lee</span></a>
</div>
</span>
<br/>
<div>
<span class="greyText smallText uitext">
<span class="minirating"><span class="stars staticStars notranslate"><span class="staticStar p10" size="12x12"></span><span class="staticStar p10" size="12x12"></span><span class="staticStar p10" size="12x12"></span><span class="staticStar p10" size="12x12"></span><span class="staticStar p3" size="12x12"></span></span> 4.26 avg rating — 6,089,980 ratings</span>
</span>
</div>
<div style="margin-top: 5px">
<span class="smallText uitext">
<a href=

Everythin we need for a particular book is up there!

For text values such as title and author name, all we have to do is just find the contained tag and then call the `get_text` function on the selected tag. For instance, to get the book title, we can execute the cell below.

In [None]:
entries = table.find_all("tr")

for entry in entries:

  # find the third td
  properties = entry.find_all("td")[2]

  print(properties.find("a", {"class": "bookTitle"}).get_text())

  # not to print all of them
  break


To Kill a Mockingbird



Extracting the text values seem easy. However, we have float values with varying lengths in the table. For sure, we can obtain these values with complex string operations. However, we can achive the same and even better results with single-line regex rules.

#### Regex

A regular expression (regex or regexp for short) is a special text string for describing a search pattern. For instance, the following regex rule helps us extract email addresses from a text source.

> [A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}

It look quite complicated, but in time, you'll get along with the syntax. Below, you may find some of the available characters that help you build regex rules.

![](https://www.optimizesmart.com/wp-content/uploads/2010/06/regex-cheatsheet-for-Google-Analytics1.jpg)

In Python, we have a library named  `re` for regex operations.

In [None]:
# regex library
import re

In [None]:
# character "a" at the start of the line
# then, any 3 characters except the new line
# character "s" at the end of the line
pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)
result

<re.Match object; span=(0, 5), match='abyss'>

In [None]:
sentence = '\n and \r are escape sequences .'
# 1 or more occurrences of alphanumeric characters (i.e words)
pattern = r'\w+'
result = re.findall(pattern, sentence)
# matching substrings in list
result

['and', 'are', 'escape', 'sequences']

### Parsing Cont'd

Now back to goodreads.

Now, let's write a regex rule to diretly obtain the ratings from a text. First, we need to locate the text, of course.

In [None]:
entries = table.find_all("tr")

for entry in entries:

  # find the third td
  properties = entry.find_all("td")[2]

  print("raw text")
  text = properties.find("span", {"class": "minirating"}).get_text()
  print(text)

  print("result")
  # \d for digits
  # with paranthesis we have groupings
  # (\d+\.+\d+) -> one or more digits followed by a dot and then one or more digits
  # .+? -> . (dot) represents any character except new line,
  # one or more characters - +? provides lazy matching
  # ((\d+,?)+\d) -> (\d+,?)+ one or more digits followed by none or one comma
  # with the outer plus, the previous pattern is repeated at least once
  # and lastly, \d is there to make sure that value ends with a digit not a comma
  search_res = re.search(r"(\d+\.+\d+).+?((\d+,?)+\d)", text).groups()
  # .groups() returns the resulting groups enclosed by paranthesis
  print(search_res)


  # not to print all of them
  break

raw text
 4.26 avg rating — 6,089,980 ratings
result
('4.26', '6,089,980', '98')


### Extract All Pages

At this point, we can merge all the extracted properties and form a dataframe. Just repeat all the process for each page in the website.

In [None]:
# base url for the website
# at the end of the url we have curly brackets
# to fill with page numbers
base_url = "https://www.goodreads.com/list/show/264.Books_That_Everyone_Should_Read_At_Least_Once?page={}"
# we'll retrieve 10 pages
num_pages = 10

In [None]:
# list of dictionaries
rows = []

for page in range(1, num_pages+1):
  # retrieve the web page
  response = requests.get(base_url.format(page))
  # parse the html content
  soup = BeautifulSoup(response.content, 'html.parser')

  # find the table
  table = soup.find("table", {"class": "tableList"})
  # retrieve all book entries
  book_entries = table.find_all("tr")

  for entry in book_entries:
    # index 2 corresponds to info section
    book_info = entry.find_all("td")[2]

    # select and slice book title
    title_sec = book_info.find("a", {"class": "bookTitle"})
    book_url = title_sec.get("href")
    book_title = title_sec.get_text()

    # author info section
    author_sec = book_info.find("a", {"class": "authorName"})
    author_url = author_sec.get("href")
    author_name = author_sec.get_text()

    # ratings with regex selection
    ratings = book_info.find("span", {"class": "minirating"}).get_text()
    search_res = re.search(r"(\d+\.+\d+).+?((\d+,?)+\d)", ratings).groups()
    avg_rating = search_res[0]
    num_ratings = search_res[1]

    # the regex with a different method
    scores = book_info.find_all("div")[-1].get_text()
    score_dist = re.findall(r"((\d+,?)+\d)", scores)
    score = score_dist[0][0]
    score_users = score_dist[1][0]

    # create the row as a dict
    row = {
        "book_url": book_url,
        "book_title": book_title,
        "author_url": author_url,
        "author_name": author_name,
        "avg_rating": avg_rating,
        "num_ratings": num_ratings,
        "score": score,
        "score_users": score_users
    }

    # appending to the list
    rows.append(row)

  print("page {} is parsed!".format(page))

  # idle time between requests
  time.sleep(1)

page 1 is parsed!
page 2 is parsed!
page 3 is parsed!
page 4 is parsed!
page 5 is parsed!
page 6 is parsed!
page 7 is parsed!
page 8 is parsed!
page 9 is parsed!
page 10 is parsed!


In [None]:
df = pd.DataFrame(rows)
df.head()

Unnamed: 0,book_url,book_title,author_url,author_name,avg_rating,num_ratings,score,score_users
0,/book/show/2657.To_Kill_a_Mockingbird,\nTo Kill a Mockingbird\n,https://www.goodreads.com/author/show/1825.Har...,Harper Lee,4.26,6089980,2428585,24617
1,/book/show/1885.Pride_and_Prejudice,\nPride and Prejudice\n,https://www.goodreads.com/author/show/1265.Jan...,Jane Austen,4.29,4246751,1511539,15479
2,/book/show/48855.The_Diary_of_a_Young_Girl,\nThe Diary of a Young Girl\n,https://www.goodreads.com/author/show/3720.Ann...,Anne Frank,4.19,3712297,1441732,14764
3,/book/show/3.Harry_Potter_and_the_Sorcerer_s_S...,\nHarry Potter and the Sorcerer's Stone (Harry...,https://www.goodreads.com/author/show/1077326....,J.K. Rowling,4.47,9994242,1132411,11572
4,/book/show/170448.Animal_Farm,\nAnimal Farm\n,https://www.goodreads.com/author/show/3706.Geo...,George Orwell,3.99,3859504,1086840,11258


In [None]:
path = "/content/gdrive/My Drive/data"
df.to_csv(join(path,"books.csv"), index=False)