# Week 4 - C2 End-of-Course Project

You’ll complete an end-of-course project by creating a pipeline process to deliver data to a target table and developing reports based on project needs. You’ll also ensure that the pipeline is performing correctly and that there are built-in defenses against data quality issues.
Learning Objectives

- Identify business needs to determine a design for your portfolio project’s data pipeline.

- Analyze how database systems are designed, how to build BI tools such as pipelines and ETL systems, and how to optimize them to maximize performance to determine the most optimal data pipeline process.

-  Develop a data pipeline to deliver necessary data to a target table.

## Workplace Scenario

### Overview of Cyclistic Project

*Background:* 

In this fictitious workplace scenario, the imaginary company Cyclistic has partnered with the city of New York to provide shared bikes. Currently, there are bike stations located throughout Manhattan and neighboring boroughs. Customers are able to rent bikes for easy travel among stations at these locations. 

*Scenario:*

You are a newly hired BI professional at Cyclistic. The company’s Customer Growth Team is creating a business plan for next year. They want to understand how their customers are using their bikes; their top priority is identifying customer demand at different station locations. Previously, you gathered information from your meeting notes and completed important project planning documents. Now you are ready for the next part of your project! 

*Course 2 challenge:*

- Use project planning documents to identify key metrics and dashboard requirements

- Observe stakeholders in action to better understand how they use data

- Gather and combine necessary data

- Design reporting tables that can be uploaded to Tableau to create the final dashboard

### Key Takeaways

In Course 2, The Path to Insights: Data Models and Pipelines, you focused on understanding how data is stored, transformed, and delivered in a BI environment. 
Course 2 skills:

- Combine and transform data

- Identify key metrics

- Create target tables

- Practice working with BI tools

**Course 2 end-of-course project deliverables:**

- The necessary target tables

## Additional Info on Cyclistic

### The Team at Work

As you learned during your previous meeting with Cyclistic, the product development team has begun planning for the next year of Cyclistic’s bike-sharing program. Cyclistic's Customer Growth Team is creating a business plan for next year. The team wants to understand how their customers are using their bikes; their top priority is identifying customer demand at different station locations. The Cyclistic team posed an important primary question:

- How can we apply customer usage insights to inform new station growth?

Answering these questions starts with the data from the Cyclistic bikes themselves, which the team has provided you, and the reporting dashboard the team uses to gain insights. In addition to the explicit requests the stakeholders made, you realize a few key things about the team's current processes. 

First, you realize that there are stakeholders from a variety of different departments accessing and using this data with different levels of technical expertise. There are stakeholders from these teams:

- Product development

- Customer data

- Engineering

- Data analytics

- Data warehousing

- API

- IT

- Cyclistic executive

- Project management

For example, you realize that Earnest Cox, the VP of product development, is often requesting high-level insights into the data and rarely needs detailed overviews of the data. Alternatively, Tessa Blackwell from the data analytics team does explore the data in-depth and spends a lot more time reviewing the dashboard views. As you develop your reporting tools, you will want to find a way to answer both of these stakeholders’ needs. 

Additionally, one of your coworkers finds out you’re working on this project and shares a dataset they created recently for a project of their own that they think might help you: NYC zip codes. This dataset provides the zip codes for the different neighborhoods and boroughs in New York City; this will let you compare the bike data to the weather data more easily since you will be able to match the locations more accurately. It will also help you develop your map visualization later on.  

In [1]:
import google.auth
from google.cloud import bigquery
import pandas as pd

credentials, project = google.auth.default()

client = bigquery.Client(project= project,
                         credentials=credentials)

query = """
SELECT
    TRI.usertype,
    ZIPSTART.zip_code AS zip_code_start,
    ZIPSTARTNAME.borough borough_start,
    ZIPSTARTNAME.neighborhood AS neighborhood_start,
    ZIPEND.zip_code AS zip_code_end,
    ZIPENDNAME.borough borough_end,
    ZIPENDNAME.neighborhood AS neighborhood_end,
    DATE_ADD(DATE(TRI.starttime), INTERVAL 5 YEAR) AS start_day,
    DATE_ADD(DATE(TRI.stoptime), INTERVAL 5 YEAR) AS stop_day,
    WEA.temp AS day_mean_temperature, 
    WEA.wdsp AS day_mean_wind_speed, 
    WEA.prcp day_total_precipitation, 
    -- Group trips into 10 minute intervals to reduces the number of rows
    ROUND(CAST(TRI.tripduration / 60 AS INT64), -1) AS trip_minutes,
    COUNT(TRI.bikeid) AS trip_count
FROM
    `bigquery-public-data.new_york_citibike.citibike_trips` AS TRI
INNER JOIN
    `bigquery-public-data.geo_us_boundaries.zip_codes` ZIPSTART
    ON ST_WITHIN(
        ST_GEOGPOINT(TRI.start_station_longitude, TRI.start_station_latitude),
        ZIPSTART.zip_code_geom)
INNER JOIN
    `bigquery-public-data.geo_us_boundaries.zip_codes` ZIPEND
    ON ST_WITHIN(
        ST_GEOGPOINT(TRI.end_station_longitude, TRI.end_station_latitude),
        ZIPEND.zip_code_geom)
INNER JOIN
    `bigquery-public-data.noaa_gsod.gsod20*` AS WEA
    ON PARSE_DATE("%Y%m%d", CONCAT(WEA.year, WEA.mo, WEA.da)) = DATE(TRI.starttime)
INNER JOIN
    -- Note! Add your zip code table name, enclosed in backticks: `example_table`
    `cyclistic_nyc_zip.cyclistic_nyc_zip_list` AS ZIPSTARTNAME
    ON ZIPSTART.zip_code = CAST(ZIPSTARTNAME.zip AS STRING)
INNER JOIN
    -- Note! Add your zipcode table name, enclosed in backticks: `example_table`
    `cyclistic_nyc_zip.cyclistic_nyc_zip_list` AS ZIPENDNAME
    ON ZIPEND.zip_code = CAST(ZIPENDNAME.zip AS STRING)
WHERE
    -- This takes the weather data from one weather station
    WEA.wban = '94728' -- NEW YORK CENTRAL PARK
    -- Use data from 2014 and 2015
    AND EXTRACT(YEAR FROM DATE(TRI.starttime)) BETWEEN 2014 AND 2015
GROUP BY
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13;
"""

query_job = client.query(query)

query_job

QueryJob<project=dataflow-testing-391517, location=US, id=f2483e3f-0077-42cd-9a03-8acae38ca17a>

In [3]:
df = query_job.to_dataframe()
df

Unnamed: 0,usertype,zip_code_start,borough_start,neighborhood_start,zip_code_end,borough_end,neighborhood_end,start_day,stop_day,day_mean_temperature,day_mean_wind_speed,day_total_precipitation,trip_minutes,trip_count
0,Subscriber,10028,Manhattan,Upper East Side,10028,Manhattan,Upper East Side,2020-09-26,2020-09-26,66.5,5.2,0.0,0.0,32
1,Subscriber,10024,Manhattan,Upper West Side,10028,Manhattan,Upper East Side,2020-11-09,2020-11-09,51.1,1.7,0.0,10.0,6
2,Customer,10024,Manhattan,Upper West Side,10024,Manhattan,Upper West Side,2020-12-05,2020-12-05,45.7,2.0,0.0,40.0,3
3,Subscriber,11101,Queens,Northwest Queens,11101,Queens,Northwest Queens,2020-08-19,2020-08-19,80.0,4.0,0.0,10.0,52
4,Subscriber,11101,Queens,Northwest Queens,11222,Brooklyn,Greenpoint,2020-09-15,2020-09-15,72.0,2.1,0.0,20.0,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2238198,Subscriber,11238,Brooklyn,Central Brooklyn,10003,Manhattan,Lower East Side,2019-09-03,2019-09-03,79.4,4.6,0.0,30.0,1
2238199,Subscriber,11238,Brooklyn,Central Brooklyn,10013,Manhattan,Greenwich Village and Soho,2019-02-10,2019-02-10,25.1,5.4,0.1,40.0,1
2238200,Subscriber,11238,Brooklyn,Central Brooklyn,10012,Manhattan,Greenwich Village and Soho,2019-10-25,2019-10-25,58.1,4.3,0.0,30.0,1
2238201,Subscriber,11238,Brooklyn,Central Brooklyn,10019,Manhattan,Chelsea and Clinton,2019-11-11,2019-11-11,56.4,2.9,0.0,50.0,1


In [5]:
df.to_csv("doc/1_Cyclistic_Target_DF.csv")