CIS-4400-Project

For HW1 & HW2

Project Goal

This is as much a data exploratory project as much as a data warehouse and BI solution project. I'm looking to create a data pipeline where users would be able to monitor drinking water quality in NYC. This should allow us to perform analytics to identify trends in turbidity, flouride, coliform & e. coli over time. It should also allow us to localize areas where water quality levels are problematic.

Business Requirements

Being able to detect any deviations from expected levels of turbidity, fluoride, coliform & e.coli.
Being able to detect a trend in water quality over time
Retrieving information for either compliance or operational reasons

Functional Requirements

The system should return the latitude and longitude values for each instance of sampling where water quality is irregular.
The system should categorize instances where the four indicators are outside expected range
Users should be able to filter water quality information based on the reason for sampling.

Data Requirements

Data web-source: https://data.cityofnewyork.us/Environment/Drinking-Water-Quality-Distribution-Monitoring-Dat/bkwf-xfky/about_data
Data download-link:
- Water Quality: https://data.cityofnewyork.us/api/views/bkwf-xfky/rows.csv?fourfour=bkwf-xfky&cacheBust=1729258068&date=20241022&accessType=DOWNLOAD
  - Columns: 10
  - Rows: 149000
- Test Site Codes: https://data.cityofnewyork.us/api/views/bkwf-xfky/files/e93e4856-95f7-48d4-b4c0-fa54989cdbfc?download=true&filename=OpenData_Distribution_Water_Quality_Sampling_Sites_Updated_2021-0618.csv
  - Columns: 4
  - Rows: 401

Data Dictionary

https://data.cityofnewyork.us/api/views/bkwf-xfky/files/e1ea5f79-01fe-474c-b625-84a7cc004c27?download=true&filename=OpenData_DataDictionary_DWQO_Distribution_Chem-Micro_20210622.xlsx

Dimensions & Facts

Dimensions
- Sample Number | Sample Date | Sample Time | Sample Site | Sample class |
- Sample Station (SS) - Location Description
Facts
- Residual Free Chlorine (mg/L) | Turbidity (NTU) | Flouride (mg/L) |
- Coliform (Quanti-Tray)(MPN/100mL) | E.coli(Quanti-Tray)(MPN/100mL)
- X - Coordinate | Y - Coordinate

Architectures

Information Architecture

The following will show the two parts of information architecture: data processing and franchising.

Data Processing
The process begins by acquiring the download links for the drinking water & sampling site excel files. The API will only fetch up to 1000 rows of data.
- There will be a script which uses a function to assign the downloaded csv file to a DataFrame variable.
Data will then be reformatted into appropriate data types that we can work with in Python/Visual Studio.
- This will be done for both sampling site data frame and drinking water data frame
- Ensures that no NaN/Null values exist in fields which should not have them.
- Ensures that all numerical, float, date & string values are appropriately assigned
A script will be written to combine the two data frames in a logical way. The drinking water's site code should be able to refer to the sampling site's code and fetch the location string value alongside the lat & lon values.
Transforming the data here will be similar to cleaning the data.
- There will be a script which will detect values are amiss from what's expected.
- There will be a script to filter out unexpected values & input appropriate ones per field requirements.
Finally the refined data will be stored in local/external drive. Data mart is uncessary due to small size of total data. It's also not live and fast paced data.

Data Franchising
After the first portion of processing data, we'll load the data to be further edited for BI usage.
- There will be a script/function to filter & create subsets of data based on categorical or numerical requirements (according to metrics).
- There will be scripts which transform the base numerical values into proportions/percentages for BI graph or measuring uses.
- Appropriate data such as lon & lat will be aggregated to locate areas of water quality issues.
We will then store the modified data per their variables in the local drive & then run them through our OLAP software(s) to attain reports & graphs.

Data Architecture

Our data architecture provides a general overview of the whole process and its end goal. Our data integration will mainly be done via python/visual studio.
There isn't that many sources and the small size of our data doesn't warrant a complex storage setup.

Dimensional Modeling

Dimensional modeling below is done via DBSchema and will be conceptualized in Azuresynapse.

Data Storage

Azure Blob - Azure storage will be used and all relevant files will be downloaded and uploaded onto a private container. The script for this was written in google collab.
We'll extract raw data via API and clean the data. Then reupload the cleaned version as a separate file onto our blob storage.

Data Warehouse

Snowflake - Usage of online Data Warehouse service called snowflake to fetch and filter the raw data from Azure Blob.

Data Transformation & Cleaning

The goals are follows:

Ensure all fact attributes are either int or floats while removing or adjusting any unique characters.
Combine sample date and sample time into one column and format it into a date time value.
Reduce number of rows which might be skipped due to NaN values (Where reasonable & possible)
Finally, remove remaining rows that contain NaN values

Modeling

Dbt Cloud - A data modeling tool that I use to build out the SQL scripts to create my dimensional and facts tables. All scripts are then committed to GitHub.

Tableau Visual Representations

The visualization was done via Tableau Desktop. Data source is external and accessed via connection to Snowflake Warehouse.

Sample Filtering Dashboard

Below is the dashboard used to filter for samples taken based on date and the five measures: Coliform, Ecoli, Turbidity, Flouride & Residual Free Chlorine.

Coliform Dashboard

Below is the dashboard used to examine the pattern of coliform cases between two years selected. This is measured in a Quanti-Tray by MPN(most probable number)/100mL of water. It represents sewage contamination and should always be 0. Detection means there is insufficient treatment or there's a sewage leakage.

Ecoli Dashboard

Below is the dashboard used to examine the number of Ecoli cases in any given year. Ecoli should is measured by ratio of MPN(most probable number)/100mL of water. This should always be 0.

Turbidity Dashboard

Below is the dashboard used to examine the pattern of Turbidity between two selected years or sum results of years. Turbidity is a measure of water clarity and represented by Nephelometric Turbidity Units (NTU). The acceptable range for drinking water is below 1 NTU.

Flouride Dashboard

Below is a dashboard used to examine pattern of flourine in mg/L of water. An acceptable level is 1 and below. Flouride is added for dental health reasons.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Data Scripts		Data Scripts
models/waterquality		models/waterquality
DBS Logic Water Quality.dbs		DBS Logic Water Quality.dbs
README.md		README.md
WQTableau.twb		WQTableau.twb
dbt_project.yml		dbt_project.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CIS-4400-Project

Project Goal

Business Requirements

Functional Requirements

Data Requirements

Data Dictionary

Dimensions & Facts

Architectures

Information Architecture

Data Architecture

Dimensional Modeling

Data Storage

Data Warehouse

Data Transformation & Cleaning

Modeling

Tableau Visual Representations

Sample Filtering Dashboard

Coliform Dashboard

Ecoli Dashboard

Turbidity Dashboard

Flouride Dashboard

About

Uh oh!

Releases

Packages

Languages

YuCDG/Water-Quality-Sampling

Folders and files

Latest commit

History

Repository files navigation

CIS-4400-Project

Project Goal

Business Requirements

Functional Requirements

Data Requirements

Data Dictionary

Dimensions & Facts

Architectures

Information Architecture

Data Architecture

Dimensional Modeling

Data Storage

Data Warehouse

Data Transformation & Cleaning

Modeling

Tableau Visual Representations

Sample Filtering Dashboard

Coliform Dashboard

Ecoli Dashboard

Turbidity Dashboard

Flouride Dashboard

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages