# Part 1: Reading Datasets 

This is the part 1 of our 3-part series on Retrieving, Analyzing and Visualizing georeferenced data of earthquakes. We are using a list of available datasets from [Rdatasets](https://vincentarelbundock.github.io/Rdatasets/). This list in CSV format will be imported, opened and read. Then, we will look for the words `latitude` or `longitude` inside each dataset HTML code [(web crawling)](https://en.wikipedia.org/wiki/Web_crawler). Finally, from the resulting list, we will select a dataset that will be used to create the database and the map.

First, we will import all the required Python libraries

In [1]:
import requests
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
import re
import itertools

## Getting the the collection of datasets in CSV format

In this step, we will use the URL provided by [Rdatasets](https://vincentarelbundock.github.io/Rdatasets/) to download a CSV file containing, among other features, CSV data URL and documentation (HTML) URL of over 1300 datasets.

In [2]:
!wget 'http://vincentarelbundock.github.com/Rdatasets/datasets.csv'

--2020-03-07 20:53:45--  http://vincentarelbundock.github.com/Rdatasets/datasets.csv
Resolving vincentarelbundock.github.com (vincentarelbundock.github.com)... 185.199.108.153, 185.199.109.153, 185.199.111.153, ...
Connecting to vincentarelbundock.github.com (vincentarelbundock.github.com)|185.199.108.153|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://vincentarelbundock.github.io/Rdatasets/datasets.csv [following]
--2020-03-07 20:53:46--  http://vincentarelbundock.github.io/Rdatasets/datasets.csv
Resolving vincentarelbundock.github.io (vincentarelbundock.github.io)... 185.199.111.153, 185.199.108.153, 185.199.110.153, ...
Reusing existing connection to vincentarelbundock.github.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 329694 (322K) [text/csv]
Saving to: ‘datasets.csv’


2020-03-07 20:53:46 (5.45 MB/s) - ‘datasets.csv’ saved [329694/329694]



### Exploring the collection of datasets

Let's explore the first three rows of the collection dataset. We can see many column names, but we are interested in the `CSV` and `Doc` columns. We can also see that there are 1340 datasets in the collection.

In [3]:
# print the first three rows
file = open('datasets.csv', 'r')
csvReader1 = csv.reader(file)
for row in itertools.islice(csvReader1, 3):
    print(row)

['Package', 'Item', 'Title', 'Rows', 'Cols', 'n_binary', 'n_character', 'n_factor', 'n_logical', 'n_numeric', 'CSV', 'Doc']
['boot', 'acme', 'Monthly Excess Returns', '60', '3', '0', '1', '0', '0', '2', 'https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/boot/acme.csv', 'https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/boot/acme.html']
['boot', 'aids', 'Delay in AIDS Reporting in England and Wales', '570', '6', '1', '0', '0', '0', '6', 'https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/boot/aids.csv', 'https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/boot/aids.html']


In [4]:
# get the number of datasets in the file
file = open("datasets.csv")
numline = len(file.readlines())
print (numline-1) # minus the header

1340


## Processing the collection of datasets

Next, we will look for the word latitude or longitude inside each dataset HTML code. To do that we will create various empty lists to store the intermediate and final results. Then, we will open again the dataset collection dataset and append in a list all the content of the `Doc` column which consists of HTML documents links. Finally, we will perform the web scrawling process itself.

In [5]:
# create list to store results
url_list=[]
sel_list = []
sel_list2 = []

Be patient...the crawling process takes time...

In [6]:
print('Working...\n')
print()

# open the collection dataset
with open('datasets.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    # append in a list all the content of the Doc column 
    for row in reader:
        url_list.append(row['Doc'])
    for url in url_list:
        ctx = ssl.create_default_context()
        ctx.check_hostname = False
        ctx.verify_mode = ssl.CERT_NONE
        r = requests.get(url)
        if r.status_code == 200:
            html = urlopen(url, context=ctx).read()
            soup = BeautifulSoup(html, "html.parser")
            for item in soup:
                # find latitude or longitude word
                if soup.find_all(text=re.compile(r'\blatitude\b | \blongitude\b', flags=re.I | re.X)):
                    sel_list.append(url)
                else:
                    continue
        else:
            continue
    for item2 in sel_list:
        if item2 not in sel_list2:
            sel_list2.append(item2)
    print('List of datasets containing the terms latitude or longitude: \n', sel_list2)

Working...


List of datasets containing the terms latitude or longitude: 
 ['https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/boot/polar.html', 'https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/carData/Depredations.html', 'https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/carData/MplsStops.html', 'https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/DAAG/aulatlong.html', 'https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/DAAG/dengue.html', 'https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/DAAG/leafshape.html', 'https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/DAAG/leafshape17.html', 'https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/DAAG/possumsites.html', 'https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/datasets/quakes.html', 'https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/HistData/Langren.all.html', 'https://raw.github.com/vincentarelbundock/Rdatase

## Selecting a dataset

From 1340 datasets we obtained 20 HTML links containing the word latitude and/or longitude, next we will select a dataset for the next steps.

In [7]:
print('The selected dataset is: ', sel_list2[8])

The selected dataset is:  https://raw.github.com/vincentarelbundock/Rdatasets/master/doc/datasets/quakes.html


***Execute `db_earthquakes.ipynb` to create a database from the selected dataset and perform some spatial analysis.***