In [1]:
import pandas as pd

* Sourcing CSV data

Download the 500 latest records from the Motor Vehicle Collisions - Crashes data from the NYC Open
Data website by entering the following URL into your browser: https://data.cityofnewyork.us/resource/h9gi-nx95.csv?$limit=500.

In [3]:
df_csv = pd.read_csv("data/h9gi-nx95.csv")

In [4]:
df_csv.head()

Unnamed: 0,crash_date,crash_time,borough,zip_code,latitude,longitude,location,on_street_name,off_street_name,cross_street_name,...,contributing_factor_vehicle_2,contributing_factor_vehicle_3,contributing_factor_vehicle_4,contributing_factor_vehicle_5,collision_id,vehicle_type_code1,vehicle_type_code2,vehicle_type_code_3,vehicle_type_code_4,vehicle_type_code_5
0,2021-09-11T00:00:00.000,2:39,,,,,,WHITESTONE EXPRESSWAY,20 AVENUE,,...,Unspecified,,,,4455765,Sedan,Sedan,,,
1,2022-03-26T00:00:00.000,11:45,,,,,,QUEENSBORO BRIDGE UPPER,,,...,,,,,4513547,Sedan,,,,
2,2023-11-01T00:00:00.000,1:29,BROOKLYN,11230.0,40.62179,-73.970024,"\n, \n(40.62179, -73.970024)",OCEAN PARKWAY,AVENUE K,,...,Unspecified,Unspecified,,,4675373,Moped,Sedan,Sedan,,
3,2022-06-29T00:00:00.000,6:55,,,,,,THROGS NECK BRIDGE,,,...,Unspecified,,,,4541903,Sedan,Pick-up Truck,,,
4,2022-09-21T00:00:00.000,13:21,,,,,,BROOKLYN BRIDGE,,,...,Unspecified,,,,4566131,Station Wagon/Sport Utility Vehicle,,,,


* Sourcing Parquet

    As with CSV and Excel, the Parquet format is a file type that contains data (table type). However, while CSV and Excel data is stored as a plain text file, Parquet actually stores data in its binary form. Unlike CSV files, which store data by row, Parquet files store data by column, which makes it easier to manipulate at the column level versus at the row level. In other words, Parquet vectorizes the data, which is why Parquet is the preferred data format when working with large, stored data files since vectorizing the data rapidly decreases computation time.

    Download one of the parquet files from the [New York Taxi and Limousine Commission](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) data set

In [5]:
df_parquet = pd.read_parquet("data/yellow_tripdata_2022-01.parquet") # this one is 36MB btw...
df_parquet.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,1.0,N,142,236,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,1.0,N,236,42,1,8.0,0.5,0.5,4.0,0.0,0.3,13.3,0.0,0.0
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,N,166,166,1,7.5,0.5,0.5,1.76,0.0,0.3,10.56,0.0,0.0
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,N,114,68,2,8.0,0.5,0.5,0.0,0.0,0.3,11.8,2.5,0.0
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,1.0,N,68,163,1,23.5,0.5,0.5,3.0,0.0,0.3,30.3,2.5,0.0


* Sourcing data from APIs

    For this example, we will be using one of the NYC Open Data website’s public HTTP APIs, which publishes data in JSON format: https://data.cityofnewyork.us/resource/h9ginx95.json?$limit=500. 

    We need to import a Python library, `certifi`, which contains “Root Certificates for validating the trustworthiness of SSL certificates while verifying the identity of TLS hosts”
    (documentation: https://pypi.org/project/certifi/)

    The `urllib3` Python library is a user-friendly HTTP client in Python that helps with some of the “under the hood” processes of retrying requests and dealing with HTTP redirects (documentation: https://pypi.org/project/urllib3/)

Define an url variable for the preceding NYC Open Data API URL, and check the API connection
status using `http.request('GET',url).status`. An API connection status informs you
what is happening with the request. Was it successful? Was it redirected? If it failed, why did it fail?
Here are the most common status codes for HTTP GET requests:

|Status Code|General Meaning|
|---|---|
|200|Successful connection|
|400|Error/bad request/incorrect data was sent|
|401|Authentication error|
|403|Access forbidden|
|404|Resource not found in server|

Once the status code confirms a successful connection, you’ll need to create a pool manager to read the
API response. The Pool Manager is a request method that handles connection pooling, which means
that once a request is made, each consecutive time the connection is requested, the Pool Manager
reuses the original connection in cached memory. This becomes particularly important when making
web requests that have a limited number of “allowed” connections per day.

In [7]:
import urllib3
import certifi
import json
import pandas as pd

In [8]:
# Define the URL to fetch data
url = 'https://data.cityofnewyork.us/resource/h9gi-nx95.json?$limit=500'

# Create a PoolManager to handle HTTP requests
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())

In [10]:
# Check if API is available to retrieve the data
response = http.request('GET', url)
apt_status = response.status
print(f"API Status: {apt_status}")

API Status: 200


In [11]:
if apt_status == 200:
    # Check if the response contains data
    try:
        raw_data = response.data.decode('utf-8')
        data = json.loads(raw_data)
        df_api = pd.json_normalize(data)
        print("Data loaded successfully.")
    except json.JSONDecodeError:
        print("Failed to decode JSON. The response may be empty or invalid.")
        df_api = pd.DataFrame()  # Empty DataFrame
else:
    print("Failed to retrieve data from the API.")
    df_api = pd.DataFrame()  # Empty DataFrame

Data loaded successfully.


In [12]:
df_api.head(10)

Unnamed: 0,crash_date,crash_time,on_street_name,off_street_name,number_of_persons_injured,number_of_persons_killed,number_of_pedestrians_injured,number_of_pedestrians_killed,number_of_cyclist_injured,number_of_cyclist_killed,...,contributing_factor_vehicle_3,vehicle_type_code_3,location.latitude,location.longitude,location.human_address,cross_street_name,contributing_factor_vehicle_4,vehicle_type_code_4,contributing_factor_vehicle_5,vehicle_type_code_5
0,2021-09-11T00:00:00.000,2:39,WHITESTONE EXPRESSWAY,20 AVENUE,2,0,0,0,0,0,...,,,,,,,,,,
1,2022-03-26T00:00:00.000,11:45,QUEENSBORO BRIDGE UPPER,,1,0,0,0,0,0,...,,,,,,,,,,
2,2023-11-01T00:00:00.000,1:29,OCEAN PARKWAY,AVENUE K,1,0,0,0,0,0,...,Unspecified,Sedan,40.62179,-73.970024,"{""address"": """", ""city"": """", ""state"": """", ""zip""...",,,,,
3,2022-06-29T00:00:00.000,6:55,THROGS NECK BRIDGE,,0,0,0,0,0,0,...,,,,,,,,,,
4,2022-09-21T00:00:00.000,13:21,BROOKLYN BRIDGE,,0,0,0,0,0,0,...,,,,,,,,,,
5,2023-04-26T00:00:00.000,13:30,WEST 54 STREET,,0,0,0,0,0,0,...,,,,,,,,,,
6,2023-11-01T00:00:00.000,7:12,HUTCHINSON RIVER PARKWAY,,0,0,0,0,0,0,...,,,,,,,,,,
7,2023-11-01T00:00:00.000,8:01,WEST 35 STREET,HENRY HUDSON RIVER,0,0,0,0,0,0,...,,,,,,,,,,
8,2023-04-26T00:00:00.000,22:20,,,0,0,0,0,0,0,...,,,,,,61 Ed Koch queensborough bridge,,,,
9,2021-09-11T00:00:00.000,9:35,,,0,0,0,0,0,0,...,,,40.667202,-73.8665,"{""address"": """", ""city"": """", ""state"": """", ""zip""...",1211 LORING AVENUE,,,,


* Sourcing Data from RDBMS tables

    There are two main categories of database systems: relational (RDBMS) and non-relational (non-RDBMS). RDBMSs, such as MySQL and Oracle, comprise structured data that is organized into rows and columns, which are structures that DataFrames in Pandas seek to represent. Non-RDBMSs, such as MongoDB, are considered unstructured since they lack the defined data table of their RDBMS counterparts. 

In [14]:
# Read sqlite query results into a pandas DataFrame
import sqlite3


# Create a SQLite database and connect to it
conn = sqlite3.connect("movies.sqlite")

# Define the SQL to create a 'movies' table
create_table_query = """
CREATE TABLE IF NOT EXISTS movies (
    id INTEGER PRIMARY KEY,
    title TEXT NOT NULL,
    genre TEXT NOT NULL,
    year INTEGER NOT NULL,
    rating REAL
);
"""

# Execute the query to create the table
with conn:
    conn.execute(create_table_query)

In [15]:
# Insert sample data into the 'movies' table
sample_data = [
    (1, "Inception", "Sci-Fi", 2010, 8.8),
    (2, "The Dark Knight", "Action", 2008, 9.0),
    (3, "Interstellar", "Sci-Fi", 2014, 8.6),
    (4, "Pulp Fiction", "Crime", 1994, 8.9),
    (5, "The Shawshank Redemption", "Drama", 1994, 9.3),
]

insert_query = "INSERT INTO movies (id, title, genre, year, rating) VALUES (?, ?, ?, ?, ?)"
with conn:
    conn.executemany(insert_query, sample_data)

# Close the connection
conn.close()

In [16]:
# Reconnect and query the table into a DataFrame to verify
with sqlite3.connect("movies.sqlite") as conn:
    df = pd.read_sql("SELECT * FROM movies", conn)

In [17]:
df

Unnamed: 0,id,title,genre,year,rating
0,1,Inception,Sci-Fi,2010,8.8
1,2,The Dark Knight,Action,2008,9.0
2,3,Interstellar,Sci-Fi,2014,8.6
3,4,Pulp Fiction,Crime,1994,8.9
4,5,The Shawshank Redemption,Drama,1994,9.3


* Sourcing data from webpages

    “scrape” or gather data from open source web pages to create data to ingest in ETL pipelines. We will use the following Wikipedia URL, https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal), to scrape the Gross Domestic Product (GDP) economic data for all the countries in the list.

In [20]:
df_html = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)', match = 'by country')

In [22]:
# Let's see how many tables are there with tag 'by county'
print(len(df_html)) # There are 4 tables

4


In [23]:
# Let's see the first table
df_html[0]

Unnamed: 0_level_0,Country/Territory,IMF[1][13],IMF[1][13],World Bank[14],World Bank[14],United Nations[15],United Nations[15]
Unnamed: 0_level_1,Country/Territory,Forecast,Year,Estimate,Year,Estimate,Year
0,World,115494312,2025,105435540,2023,100834796,2022
1,United States,30337162,2025,27360935,2023,25744100,2022
2,China,19534894,[n 1]2025,17794782,[n 3]2023,17963170,[n 1]2022
3,Germany,4921563,2025,4456081,2023,4076923,2022
4,Japan,4389326,2025,4212945,2023,4232173,2022
...,...,...,...,...,...,...,...
205,Kiribati,311,2024,279,2023,223,2022
206,Palau,308,2024,263,2023,225,2022
207,Marshall Islands,305,2024,284,2023,279,2022
208,Nauru,161,2024,154,2023,147,2022
