Technical Project 1 - Bike Ride Sharing Data App
---
### Diogo Pessoa
### L00179663@atu.ie

In [14]:
format(0).__doc__

"str(object='') -> str\nstr(bytes_or_buffer[, encoding[, errors]]) -> str\n\nCreate a new string object from the given object. If encoding or\nerrors is specified, then the object must expose a data buffer\nthat will be decoded using the given encoding and error handler.\nOtherwise, returns the result of object.__str__() (if defined)\nor repr(object).\nencoding defaults to sys.getdefaultencoding().\nerrors defaults to 'strict'."

In [32]:
# Crawling index page

"""
This section assumes the file naming convention pattern based on visual review of https://divvy-tripdata.s3.amazonaws.com/index.html shared by Divvy.
 
 On the next few lines we'll build a list for files for the list of years. 
    https://divvy-tripdata.s3.amazonaws.com/<year><month>-divvy-tripdata.zip
    
 The endpoint doesn't require authentication, thence this is a simple GET request to download the data. 
 The code will add validation (Try, Catch) based on the response Code to avoid interruptions. 
   
"""
url = "https://divvy-tripdata.s3.amazonaws.com/"
year = [2020, 2021, 2022, 2023]
zip_files_url_list = [f"https://divvy-tripdata.s3.amazonaws.com/{y}{m:02d}-divvy-tripdata.zip" for y in year for m in range(1, 13)]

zip_files_url_list # Resulting list of zip files url.


['https://divvy-tripdata.s3.amazonaws.com/202001-divvy-tripdata.zip',
 'https://divvy-tripdata.s3.amazonaws.com/202002-divvy-tripdata.zip',
 'https://divvy-tripdata.s3.amazonaws.com/202003-divvy-tripdata.zip',
 'https://divvy-tripdata.s3.amazonaws.com/202004-divvy-tripdata.zip',
 'https://divvy-tripdata.s3.amazonaws.com/202005-divvy-tripdata.zip',
 'https://divvy-tripdata.s3.amazonaws.com/202006-divvy-tripdata.zip',
 'https://divvy-tripdata.s3.amazonaws.com/202007-divvy-tripdata.zip',
 'https://divvy-tripdata.s3.amazonaws.com/202008-divvy-tripdata.zip',
 'https://divvy-tripdata.s3.amazonaws.com/202009-divvy-tripdata.zip',
 'https://divvy-tripdata.s3.amazonaws.com/202010-divvy-tripdata.zip',
 'https://divvy-tripdata.s3.amazonaws.com/202011-divvy-tripdata.zip',
 'https://divvy-tripdata.s3.amazonaws.com/202012-divvy-tripdata.zip',
 'https://divvy-tripdata.s3.amazonaws.com/202101-divvy-tripdata.zip',
 'https://divvy-tripdata.s3.amazonaws.com/202102-divvy-tripdata.zip',
 'https://divvy-trip

Pull files from Divvy AWS S3 Bucket
---

In [39]:
import requests
import os

# List of urls 
zip_files_url_list

# local path
local_data_path = '/Users/macbook/code/BigdataAnalyticsPG2023/TechProject1/data'

for file_url in zip_files_url_list:
    try:
        
        file_name = file_url.split('/')[-1]
        print(f'loading data for file: {file_name}')
        # print(os.getcwd())
        # TODO  Check if file already exists before pulling zip file request
        
        r = requests.get(file_url, stream=True)
        if not r.status_code == 200:
            continue

        with open(os.path.join(local_data_path, file_name), 'wb') as f:
            f.write(r.content)
    except requests.exceptions.RequestException as e:
        raise SystemExit(e)

loading data for file: 202004-divvy-tripdata.zip
/Users/macbook/code/BigdataAnalyticsPG2023/TechProject1/Notebooks


Extract from ZipFile
----

In [40]:
import os
import zipfile

# # Iterate over all files in the directory
for filename in os.listdir(local_data_path):
    if filename.endswith('.zip'):
#         # Construct the full path to the file
        file_path = os.path.join(local_data_path, filename)
#         # Open the zip file
        with zipfile.ZipFile(file_path, 'r') as zip_ref:
            # Extract all the contents into the directory
            zip_ref.extractall(local_data_path)
            print(f"Extracted: {filename}")
print("All zip files extracted.")

Extracted: 202004-divvy-tripdata.zip
All zip files extracted.


Load csv into Apache Spark DataFrames
---

In [41]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Load CSVs").getOrCreate()    
# # Read the CSV file into a DataFrame
df = spark.read.csv(local_data_path, header=True)
#  TODO Build Schema, based on column data-types
df.show()
df.printSchema()

# Create & upload a text file.
# uploaded = drive.CreateFile({'title': 'Sample file.txt'})
# uploaded.SetContentString('Sample upload file content')
# uploaded.Upload()
# print('Uploaded file with ID {}'.format(uploaded.get('id')))
# '/content/drive/MyDrive/Colab Notebooks/BDA_PracticalProject/divvyData'

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/01/17 20:14:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

+----------------+-------------+-------------------+-------------------+--------------------+----------------+--------------------+--------------+---------+---------+-------+--------+-------------+
|         ride_id|rideable_type|         started_at|           ended_at|  start_station_name|start_station_id|    end_station_name|end_station_id|start_lat|start_lng|end_lat| end_lng|member_casual|
+----------------+-------------+-------------------+-------------------+--------------------+----------------+--------------------+--------------+---------+---------+-------+--------+-------------+
|A847FADBBC638E45|  docked_bike|2020-04-26 17:45:14|2020-04-26 18:12:03|        Eckhart Park|              86|Lincoln Ave & Div...|           152|  41.8964|  -87.661|41.9322|-87.6586|       member|
|5405B80E996FF60D|  docked_bike|2020-04-17 17:08:54|2020-04-17 17:17:03|Drake Ave & Fulle...|             503|     Kosciuszko Park|           499|  41.9244| -87.7154|41.9306|-87.7238|       member|
|5DD24A79A