# Data Acquisition

### Use Markdown

For tips on text formatting in Markdown, refer to the official [Markdown Guide](https://www.markdownguide.org/basic-syntax).

### All Python Imports should be Sorted

This is a [best practice](https://www.python.org/dev/peps/pep-0008/#imports). The order (and some examples) should be
- Python standard library
  - `os`
  - `json`
  - `datetime`
- third party libraries
  - `numpy`
  - `pandas`
  - `pyspark`
- your custom library (see [the best practice](https://www.python.org/dev/peps/pep-0008/#package-and-module-names) for how to name your library)
  - `from bikesharelib import capital_bikeshare_data_loader`

In [6]:
# Jupyter magics
%load_ext nb_black

<IPython.core.display.Javascript object>

In [7]:
import os

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service

<IPython.core.display.Javascript object>

### Use html to create a table of contents at the start of the notebook

**Benefits**

1. guides the audience when navigating through the notebook
2. overview of notebook objectives through [**Level 2** headings](https://www.markdownguide.org/basic-syntax/#headings)
   - go directly to a section or sub-section
   - do not need to read the whole notebook

**Usage**

Uses [HTML hyperlinks](https://www.w3schools.com/tags/tag_a.asp) `<a>` and the value of the `id` attribute will be refrenced elsewhere in the notebook.

<a id='table-of-contents'></a>

## [Table of Contents](#table-of-contents)

1. [About](#about)
2. [User Inputs](#user-inputs)
3. [Define Web Driver](#define-web-driver)
4. [Get Data](#get-data)
   - 4.1. [Download Metadata](#download-metadata)
   - 4.2. [Download Bikeshare Ridership Data Files](#download-bikeshare-ridership-data-files)
5. [Close Web Driver](#close-web-driver)
6. [Process Data](#process-data)
7. [Links](#links)

### Introduction / Overview

This should be an overview of what the notebook covers. Everything here should guide the user to what they will encounter as they continue reading the notebook.

Some examples
- if there are important pre-requisites, they can be included here. Some examples are
  - this notebook was run with the
    - Google Chrome web browser (version 97.0.4692.71)
    - ChromeDriver 97.0.4692.71 from the [Chromedriver downloads section](https://chromedriver.chromium.org/downloads)
- significant project-level assumptions that affect the workflow in **this notebook only** should be included here too
  - if there are assumptions affecting all notebooks, then they should be added to be `README.md` file for the project

<a id='about'></a>

## 1. [About](#about)

This notebook will cover bikeshare ridership data collection from the [Capital Bikeshare system data page](https://www.capitalbikeshare.com/system-data).

**Prerequisites**
1. This notebook was run with
   - Google Chrome web browser (version 97.0.4692.71)
   - ChromeDriver 97.0.4692.71 from the [Chromedriver downloads section](https://chromedriver.chromium.org/downloads)

### Input (Python) Variables

These are Python variables that won't be changed later in this notebook.

**Tip**

After this cell is changed, we should be able to run all the following cells of this notebook with no errors.

<a id='define-web-driver'></a>

## 2. [User Inputs](#user-inputs)

In [None]:
# Get user name
user_name = os.getenv("USERNAME")

# Link to data files
system_data_url = "https://s3.amazonaws.com/capitalbikeshare-data/index.html"

# Path to the Chrome webdriver on local system
webdriver_path = f"/home/{user_name}/chromedriver_linux64/chromedriver"

In [None]:
# Create ChromeDriver service object
webdriver_service_object = Service(webdriver_path)
chrome_webdriver_options = Options()

### Workflow Pre-Requisites

These are Python variables that use the input variables specified immediately above.

<a id='define-web-driver'></a>

## 3. [Define Web Driver](#define-web-driver)

In [None]:
# Create instance of Chrome webdriver
driver = webdriver.Chrome(service=webdriver_service_object, options=chrome_webdriver_options)

### Workflow

This will be the main source code of this notebook.

<a id='get-data'></a>

## 4. [Get Data](#get-data)

<a id='download-metadata'></a>

### 4.1. [Download Metadata](#download-metadata)

Retrieve HTML from the system data webpage

In [None]:
%%time
# Get source HTML from page
driver.get(system_data_url)

Get table object and the table headers

In [None]:
%%time
# Get table object
container = driver.find_element(By.XPATH, './/div[@class="container"]')
table_id = container.find_element(By.XPATH, "//table[@class='hide-while-loading table table-striped']/tbody")

# Get text from table headers
header = container.find_element(By.XPATH, "//table[@class='hide-while-loading table table-striped']/thead")
headers = [h.text for h in header.find_elements(By.CSS_SELECTOR, "th")]

Extract a `DataFrame` of all available metadata from the table object

In [None]:
%%time
# Extract all rows of metadata from the table
list_dfs_all_rows = []
# Iterate over rows
for row in table_id.find_elements(By.CSS_SELECTOR, "tr"):
    list_single_row = []
    zip_file_urls = []
    # Iterate over columns per row
    for col_idx, cell in enumerate(row.find_elements(By.TAG_NAME, "td")):
        # Get link to zip data file (in first column only)
        if col_idx == 0:
            data_zip_url = cell.find_element(By.CSS_SELECTOR, "a").get_attribute("href")
            zip_file_urls.append(data_zip_url)
        # Append contents of single row to empty list
        list_single_row.append(cell.text)
    # Create single-row DataFrame from nested list
    df_single_row = pd.DataFrame.from_records([{h: r for h, r in zip(headers, list_single_row)}])
    # Append column with zip data file links
    df_single_row["zip_file_url"] = zip_file_urls
    list_dfs_all_rows.append(df_single_row)

In [None]:
# Combine list of single-row DataFrames into one DataFrame
df = pd.concat(list_dfs_all_rows, ignore_index=True)
with pd.option_context('display.max_colwidth', 100):
    display(df.head().append(df.tail()))

<a id='download-bikeshare-ridership-data-files'></a>

### 4.2. [Download Bikeshare Ridership Data Files](#download-bikeshare-ridership-data-files)

<a id='close-web-driver'></a>

## 5. [Close Web Driver](#close-web-driver)

In [None]:
# Close the browser window
driver.quit()

<a id='process-data'></a>

## 6. [Process Data](#process-data)

For each file, get data from 2021 and calculate the number of days between when the data was posted and the last hour of data in that file

In [None]:
%%time
# Drop unnecessary columns
cols_to_drop = ["Size", "Type"]
df = df.drop(columns=cols_to_drop)

# Select single year of data
df = df[df["Name"].str.startswith("2021")]

# Get starting year and month of the zip filename
df["train_data_last_date"] = pd.to_datetime(
    df["Name"].str.split("-capital", expand=True)[0] + "01"
)

# Get the month end date corresponding to data in the zip file
df["train_data_last_date"] = (
    df["train_data_last_date"] + pd.offsets.MonthEnd(0) + pd.DateOffset(hours=23)
)

# Extract the number of days and hours between when the zip file was posted and the last
# hour of data available in the zip file
df["days_diff"] = pd.to_datetime(df["Date Modified"]) - df["train_data_last_date"]

df = df.sort_values(by="train_data_last_date")
with pd.option_context('display.max_colwidth', 100):
    display(df)

**Notes**
1. The following columns of metadata have been extracted
   - `Name`
     - name of the zip file containing the ridership data
   - `Date Modified`
     - date on which the data (zip file) was posted
   - `zip_file_url`
     - full path to download the zip file with data
   - `days_diff`
     - number of days between when the zip file was posted and the last hour of ridership data in the zip file

### Workflow Resources Used - web links to non-Python Packages

Add useful links to resources used in **this notebook only**.

<a id='links'></a>

## 7. [Links](#links)

1. [Bikeshare System Data Page](https://www.capitalbikeshare.com/system-data)
   - [direct link to HTML table](https://s3.amazonaws.com/capitalbikeshare-data/index.html)
2. [Chrome webdriver download page](https://chromedriver.chromium.org/downloads)

### Use a Footer to Direct Users to the Previous / Next notebook

<span style="float:left;">
    <a href="./01_dont_do_this.ipynb"><< 01_dont_do_this.ipynb</a>
</span>

<span style="float:right;">
    &#169; 2022 | <a href="https://github.com/elsdes3/documentation-tips">@elsdes3</a> (MIT)
</span>