Skip to content

YuCDG/Water-Quality-Sampling

Repository files navigation

CIS-4400-Project

For HW1 & HW2

Project Goal

  • This is as much a data exploratory project as much as a data warehouse and BI solution project. I'm looking to create a data pipeline where users would be able to monitor drinking water quality in NYC. This should allow us to perform analytics to identify trends in turbidity, flouride, coliform & e. coli over time. It should also allow us to localize areas where water quality levels are problematic.

Business Requirements

  • Being able to detect any deviations from expected levels of turbidity, fluoride, coliform & e.coli.
  • Being able to detect a trend in water quality over time
  • Retrieving information for either compliance or operational reasons

Functional Requirements

  • The system should return the latitude and longitude values for each instance of sampling where water quality is irregular.
  • The system should categorize instances where the four indicators are outside expected range
  • Users should be able to filter water quality information based on the reason for sampling.

Data Requirements

Data Dictionary

https://data.cityofnewyork.us/api/views/bkwf-xfky/files/e1ea5f79-01fe-474c-b625-84a7cc004c27?download=true&filename=OpenData_DataDictionary_DWQO_Distribution_Chem-Micro_20210622.xlsx

Dimensions & Facts

  • Dimensions
    • Sample Number | Sample Date | Sample Time | Sample Site | Sample class |
    • Sample Station (SS) - Location Description
  • Facts
    • Residual Free Chlorine (mg/L) | Turbidity (NTU) | Flouride (mg/L) |
    • Coliform (Quanti-Tray)(MPN/100mL) | E.coli(Quanti-Tray)(MPN/100mL)
    • X - Coordinate | Y - Coordinate

Architectures

Information Architecture

  • The following will show the two parts of information architecture: data processing and franchising.

    Data Processing Information Architecture 1
  • The process begins by acquiring the download links for the drinking water & sampling site excel files. The API will only fetch up to 1000 rows of data.
    • There will be a script which uses a function to assign the downloaded csv file to a DataFrame variable.
  • Data will then be reformatted into appropriate data types that we can work with in Python/Visual Studio.
    • This will be done for both sampling site data frame and drinking water data frame
    • Ensures that no NaN/Null values exist in fields which should not have them.
    • Ensures that all numerical, float, date & string values are appropriately assigned
  • A script will be written to combine the two data frames in a logical way. The drinking water's site code should be able to refer to the sampling site's code and fetch the location string value alongside the lat & lon values.
  • Transforming the data here will be similar to cleaning the data.
    • There will be a script which will detect values are amiss from what's expected.
    • There will be a script to filter out unexpected values & input appropriate ones per field requirements.
  • Finally the refined data will be stored in local/external drive. Data mart is uncessary due to small size of total data. It's also not live and fast paced data.

    Data Franchising Information Architecture 2
  • After the first portion of processing data, we'll load the data to be further edited for BI usage.
    • There will be a script/function to filter & create subsets of data based on categorical or numerical requirements (according to metrics).
    • There will be scripts which transform the base numerical values into proportions/percentages for BI graph or measuring uses.
    • Appropriate data such as lon & lat will be aggregated to locate areas of water quality issues.
  • We will then store the modified data per their variables in the local drive & then run them through our OLAP software(s) to attain reports & graphs.

Data Architecture

Data Architecture

  • Our data architecture provides a general overview of the whole process and its end goal. Our data integration will mainly be done via python/visual studio.
  • There isn't that many sources and the small size of our data doesn't warrant a complex storage setup.

Dimensional Modeling

  • Dimensional modeling below is done via DBSchema and will be conceptualized in Azuresynapse. image

Data Storage

  • Azure Blob - Azure storage will be used and all relevant files will be downloaded and uploaded onto a private container. The script for this was written in google collab.
  • We'll extract raw data via API and clean the data. Then reupload the cleaned version as a separate file onto our blob storage.

Data Warehouse

  • Snowflake - Usage of online Data Warehouse service called snowflake to fetch and filter the raw data from Azure Blob.

Data Transformation & Cleaning

The goals are follows:

  • Ensure all fact attributes are either int or floats while removing or adjusting any unique characters.
  • Combine sample date and sample time into one column and format it into a date time value.
  • Reduce number of rows which might be skipped due to NaN values (Where reasonable & possible)
  • Finally, remove remaining rows that contain NaN values

Modeling

  • Dbt Cloud - A data modeling tool that I use to build out the SQL scripts to create my dimensional and facts tables. All scripts are then committed to GitHub.

Tableau Visual Representations

The visualization was done via Tableau Desktop. Data source is external and accessed via connection to Snowflake Warehouse.

Sample Filtering Dashboard

Below is the dashboard used to filter for samples taken based on date and the five measures: Coliform, Ecoli, Turbidity, Flouride & Residual Free Chlorine. Sample Dashboard

Coliform Dashboard

Below is the dashboard used to examine the pattern of coliform cases between two years selected. This is measured in a Quanti-Tray by MPN(most probable number)/100mL of water. It represents sewage contamination and should always be 0. Detection means there is insufficient treatment or there's a sewage leakage. Coliform Dashboard

Ecoli Dashboard

Below is the dashboard used to examine the number of Ecoli cases in any given year. Ecoli should is measured by ratio of MPN(most probable number)/100mL of water. This should always be 0. Ecoli Dashboard

Turbidity Dashboard

Below is the dashboard used to examine the pattern of Turbidity between two selected years or sum results of years. Turbidity is a measure of water clarity and represented by Nephelometric Turbidity Units (NTU). The acceptable range for drinking water is below 1 NTU. Turbidity Dashboard

Flouride Dashboard

Below is a dashboard used to examine pattern of flourine in mg/L of water. An acceptable level is 1 and below. Flouride is added for dental health reasons. Flouride Dashboard

About

NYC Water quality analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published