# Data Acquisition

### Use Markdown

For tips on text formatting in Markdown, refer to the official [Markdown Guide](https://www.markdownguide.org/basic-syntax).

### All Python Imports should be Sorted

This is a [best practice](https://www.python.org/dev/peps/pep-0008/#imports). The order should be
- Python standard library
  - `os`
  - `json`
  - `datetime`
- third party libraries
  - `numpy`
  - `pandas`
  - `keras`
  - `pyspark`
- your custom library (see [the best practice](https://www.python.org/dev/peps/pep-0008/#package-and-module-names) for how to pick a name for custom Python modules)
  - `from src.load_data import get_data`

In [2]:
# Python standard library
import os
from glob import glob
from io import BytesIO
from time import sleep
from urllib.request import urlopen
from zipfile import ZipFile

# Third party libraries
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service

### Use html to create a table of contents at the start of the notebook

**Benefits**

1. guides the audience when navigating through the notebook
2. overview of notebook objectives through [**Level 2** headings](https://www.markdownguide.org/basic-syntax/#headings)
   - go directly to a section or sub-section
   - do not need to read the whole notebook

**Usage**

Uses [HTML hyperlinks](https://www.w3schools.com/tags/tag_a.asp) `<a>` and the value of the `id` attribute will be refrenced elsewhere in the notebook.

<a id='table-of-contents'></a>

## [Table of Contents](#table-of-contents)

1. [About](#about)
2. [User Inputs](#user-inputs)
3. [Define Web Driver](#define-web-driver)
4. [Get Data](#get-data)
   - 4.1. [Get URLs to download data files](#get-urls-to-download-data-files)
   - 4.2. [Download Bikeshare Ridership Data Files](#download-bikeshare-ridership-data-files)
5. [Close Web Driver](#close-web-driver)

### Introduction / Overview

This should be an overview of what the notebook covers. Everything here should guide the user to what they will encounter as they continue reading the notebook.

Some examples
- if there are important pre-requisites, they can be included here. Some examples are
- significant project-level assumptions should be included here only if they impact the code in **this notebook only**
  - if there are assumptions that impact all notebooks, then they should be documented in the `README.md` file for the project

<a id='about'></a>

## 1. About

This notebook will cover bikeshare ridership data collection from the [Capital Bikeshare system data page](https://www.capitalbikeshare.com/system-data).

**Prerequisites**
1. This notebook was run with
   - Google Chrome web browser (version 97.0.4692.71)
   - ChromeDriver 97.0.4692.71 from the [Chromedriver downloads section](https://chromedriver.chromium.org/downloads)

### Input (Python) Variables

These are Python variables that won't be changed later in this notebook.

**Tip**

After this cell is changed, we should be able to run all the following cells of this notebook with no errors.

<a id='user-inputs'></a>

## 2. User Inputs

In [3]:
# Get user name
user_name = os.getenv("USERNAME")

# Link to data files
system_data_url = "https://s3.amazonaws.com/capitalbikeshare-data/index.html"

# Path to the Chrome webdriver on local system
webdriver_path = f"/home/{user_name}/chromedriver_linux64/chromedriver"

# Year of data to use
years_to_use_str = "2021"

In [4]:
# Create ChromeDriver service object
webdriver_service_object = Service(webdriver_path)
chrome_webdriver_options = Options()

### Workflow Pre-Requisites

These are Python variables that use the input variables specified immediately above.

<a id='define-web-driver'></a>

## 3. Define Web Driver

In [5]:
# Create instance of Chrome webdriver
driver = webdriver.Chrome(
    service=webdriver_service_object, options=chrome_webdriver_options
)

### Workflow

This will be the main source code of this notebook.

<a id='get-data'></a>

## 4. Get Data

<a id='get-urls-to-download-data-files'></a>

### 4.1. Get URLs to download data files

Retrieve HTML from the system data webpage

In [6]:
%%time
# Get source HTML from page
driver.get(system_data_url)

CPU times: user 0 ns, sys: 3.44 ms, total: 3.44 ms
Wall time: 362 ms


Wait for HTML on page to be loaded

In [7]:
# simple wait is sufficient; other approaches are possible too
sleep(1)

Get table object and the table headers

In [8]:
%%time
# Get table object
container = driver.find_element(By.XPATH, './/div[@class="container"]')
table_id = container.find_element(
    By.XPATH, "//table[@class='hide-while-loading table table-striped']/tbody"
)

# Get text from table headers
header = container.find_element(
    By.XPATH, "//table[@class='hide-while-loading table table-striped']/thead"
)
headers = [h.text for h in header.find_elements(By.CSS_SELECTOR, "th")]

CPU times: user 3.1 ms, sys: 3.89 ms, total: 6.99 ms
Wall time: 65.1 ms


Extract a `DataFrame` of all available metadata from the table object, including a column with the URLs to the data to be downloaded

In [9]:
%%time
# Extract all rows of metadata from the table
list_dfs_all_rows = []
# Iterate over rows
for row in table_id.find_elements(By.CSS_SELECTOR, "tr"):
    list_single_row = []
    zip_file_urls = []
    # Iterate over columns per row
    for col_idx, cell in enumerate(row.find_elements(By.TAG_NAME, "td")[:2]):
        # Get link to zip data file (in first column only)
        if col_idx == 0:
            data_zip_url = cell.find_element(By.CSS_SELECTOR, "a").get_attribute("href")
            zip_file_urls.append(data_zip_url)
        # Append contents of single row to empty list
        list_single_row.append(cell.text)
    # Get the data URLs from all rows except the last row
    if list_single_row[0] != "index.html":
        # Create single-row DataFrame from nested list
        df_single_row = pd.DataFrame.from_records(
            [{h: r for h, r in zip(headers, list_single_row)}]
        )
        # Append column with zip data file links
        df_single_row["zip_file_url"] = zip_file_urls
        # Append single-row DataFrame to empty list
        list_dfs_all_rows.append(df_single_row)

# Combine list of single-row DataFrames into one DataFrame
df = pd.concat(list_dfs_all_rows, ignore_index=True)
df.head(3)

CPU times: user 208 ms, sys: 0 ns, total: 208 ms
Wall time: 2.19 s


Unnamed: 0,Name,Date Modified,zip_file_url
0,2010-capitalbikeshare-tripdata.zip,"Mar 15th 2018, 06:33:31 pm",https://s3.amazonaws.com/capitalbikeshare-data...
1,2011-capitalbikeshare-tripdata.zip,"Mar 15th 2018, 06:45:30 pm",https://s3.amazonaws.com/capitalbikeshare-data...
2,2012-capitalbikeshare-tripdata.zip,"Mar 15th 2018, 06:55:27 pm",https://s3.amazonaws.com/capitalbikeshare-data...


Select data from the required year (2021)

In [10]:
df = df[df["Name"].str[:4] == years_to_use_str]
df.head(3)

Unnamed: 0,Name,Date Modified,zip_file_url
44,202101-capitalbikeshare-tripdata.zip,"Feb 4th 2021, 04:55:29 pm",https://s3.amazonaws.com/capitalbikeshare-data...
45,202102-capitalbikeshare-tripdata.zip,"Mar 9th 2021, 07:07:41 pm",https://s3.amazonaws.com/capitalbikeshare-data...
46,202103-capitalbikeshare-tripdata.zip,"Apr 8th 2021, 10:31:40 am",https://s3.amazonaws.com/capitalbikeshare-data...


Convert the `Date Modified` column to `datetime`

In [11]:
df["Date Modified"] = pd.to_datetime(df["Date Modified"])
df.head(3)

Unnamed: 0,Name,Date Modified,zip_file_url
44,202101-capitalbikeshare-tripdata.zip,2021-02-04 16:55:29,https://s3.amazonaws.com/capitalbikeshare-data...
45,202102-capitalbikeshare-tripdata.zip,2021-03-09 19:07:41,https://s3.amazonaws.com/capitalbikeshare-data...
46,202103-capitalbikeshare-tripdata.zip,2021-04-08 10:31:40,https://s3.amazonaws.com/capitalbikeshare-data...


**Notes**
1. The following columns of metadata have been extracted
   - `Name` (string)
     - name of the zip file containing the ridership data
   - `Date Modified` (datetime)
     - date on which the data (zip file) was posted to the public data portal
   - `zip_file_url` (string)
     - full path to download the zip file with data

<a id='download-bikeshare-ridership-data-files'></a>

### 4.2. Download Bikeshare Ridership Data Files

We will now download each of the data files in the `zip_file_url` column of the above `DataFrame`, and extract their contents without saving the zip file to disk ([link](https://stackoverflow.com/a/65106410/4057186))

In [12]:
%%time
for _, row in df.iterrows():
    # Get full url for the zip file from zip_file_url column
    zipurl = row["zip_file_url"]
    # Extract without saving
    with urlopen(zipurl) as zipresp:
        with ZipFile(BytesIO(zipresp.read())) as zfile:
            zfile.extractall(f'data/raw')
    print(
        f"Saved data from {zipurl.split('-', 1)[0]} to "
        f"data/raw/{os.path.basename(zipurl).replace('.zip', '.csv')}"
    )

Saved data from https://s3.amazonaws.com/capitalbikeshare to data/raw/202101-capitalbikeshare-tripdata.csv
Saved data from https://s3.amazonaws.com/capitalbikeshare to data/raw/202102-capitalbikeshare-tripdata.csv
Saved data from https://s3.amazonaws.com/capitalbikeshare to data/raw/202103-capitalbikeshare-tripdata.csv
Saved data from https://s3.amazonaws.com/capitalbikeshare to data/raw/202104-capitalbikeshare-tripdata.csv
Saved data from https://s3.amazonaws.com/capitalbikeshare to data/raw/202105-capitalbikeshare-tripdata.csv
Saved data from https://s3.amazonaws.com/capitalbikeshare to data/raw/202106-capitalbikeshare-tripdata.csv
Saved data from https://s3.amazonaws.com/capitalbikeshare to data/raw/202107-capitalbikeshare-tripdata.csv
Saved data from https://s3.amazonaws.com/capitalbikeshare to data/raw/202108-capitalbikeshare-tripdata.csv
Saved data from https://s3.amazonaws.com/capitalbikeshare to data/raw/202109-capitalbikeshare-tripdata.csv
Saved data from https://s3.amazonaws.

<a id='close-web-driver'></a>

## 5. Close Web Driver

In [13]:
# Close the browser window
driver.quit()

### Use a Footer to Direct Users to the Previous / Next notebook

---

<span style="float:left;">
    <a href="./01_dont_do_this.ipynb"><< 01_dont_do_this.ipynb</a>
</span>

<span style="float:right;">
    &#169; 2022 | <a href="https://github.com/elsdes3/documentation-tips\">@elsdes3</a> (MIT)
</span>