For HW1 & HW2
- This is as much a data exploratory project as much as a data warehouse and BI solution project. I'm looking to create a data pipeline where users would be able to monitor drinking water quality in NYC. This should allow us to perform analytics to identify trends in turbidity, flouride, coliform & e. coli over time. It should also allow us to localize areas where water quality levels are problematic.
- Being able to detect any deviations from expected levels of turbidity, fluoride, coliform & e.coli.
- Being able to detect a trend in water quality over time
- Retrieving information for either compliance or operational reasons
- The system should return the latitude and longitude values for each instance of sampling where water quality is irregular.
- The system should categorize instances where the four indicators are outside expected range
- Users should be able to filter water quality information based on the reason for sampling.
- Data web-source: https://data.cityofnewyork.us/Environment/Drinking-Water-Quality-Distribution-Monitoring-Dat/bkwf-xfky/about_data
- Data download-link:
- Water Quality: https://data.cityofnewyork.us/api/views/bkwf-xfky/rows.csv?fourfour=bkwf-xfky&cacheBust=1729258068&date=20241022&accessType=DOWNLOAD
- Columns: 10
- Rows: 149000
- Test Site Codes: https://data.cityofnewyork.us/api/views/bkwf-xfky/files/e93e4856-95f7-48d4-b4c0-fa54989cdbfc?download=true&filename=OpenData_Distribution_Water_Quality_Sampling_Sites_Updated_2021-0618.csv
- Columns: 4
- Rows: 401
- Water Quality: https://data.cityofnewyork.us/api/views/bkwf-xfky/rows.csv?fourfour=bkwf-xfky&cacheBust=1729258068&date=20241022&accessType=DOWNLOAD
- Dimensions
- Sample Number | Sample Date | Sample Time | Sample Site | Sample class |
- Sample Station (SS) - Location Description
- Facts
- Residual Free Chlorine (mg/L) | Turbidity (NTU) | Flouride (mg/L) |
- Coliform (Quanti-Tray)(MPN/100mL) | E.coli(Quanti-Tray)(MPN/100mL)
- X - Coordinate | Y - Coordinate
- The following will show the two parts of information architecture: data processing and franchising.
Data Processing - The process begins by acquiring the download links for the drinking water & sampling site excel files. The API will only fetch up to 1000 rows of data.
- There will be a script which uses a function to assign the downloaded csv file to a DataFrame variable.
- Data will then be reformatted into appropriate data types that we can work with in Python/Visual Studio.
- This will be done for both sampling site data frame and drinking water data frame
- Ensures that no NaN/Null values exist in fields which should not have them.
- Ensures that all numerical, float, date & string values are appropriately assigned
- A script will be written to combine the two data frames in a logical way. The drinking water's site code should be able to refer to the sampling site's code and fetch the location string value alongside the lat & lon values.
- Transforming the data here will be similar to cleaning the data.
- There will be a script which will detect values are amiss from what's expected.
- There will be a script to filter out unexpected values & input appropriate ones per field requirements.
- Finally the refined data will be stored in local/external drive. Data mart is uncessary due to small size of total data. It's also not live and fast paced data.
Data Franchising - After the first portion of processing data, we'll load the data to be further edited for BI usage.
- There will be a script/function to filter & create subsets of data based on categorical or numerical requirements (according to metrics).
- There will be scripts which transform the base numerical values into proportions/percentages for BI graph or measuring uses.
- Appropriate data such as lon & lat will be aggregated to locate areas of water quality issues.
- We will then store the modified data per their variables in the local drive & then run them through our OLAP software(s) to attain reports & graphs.
- Our data architecture provides a general overview of the whole process and its end goal. Our data integration will mainly be done via python/visual studio.
- There isn't that many sources and the small size of our data doesn't warrant a complex storage setup.
- Azure Blob - Azure storage will be used and all relevant files will be downloaded and uploaded onto a private container. The script for this was written in google collab.
- We'll extract raw data via API and clean the data. Then reupload the cleaned version as a separate file onto our blob storage.
- Snowflake - Usage of online Data Warehouse service called snowflake to fetch and filter the raw data from Azure Blob.
The goals are follows:
- Ensure all fact attributes are either int or floats while removing or adjusting any unique characters.
- Combine sample date and sample time into one column and format it into a date time value.
- Reduce number of rows which might be skipped due to NaN values (Where reasonable & possible)
- Finally, remove remaining rows that contain NaN values
- Dbt Cloud - A data modeling tool that I use to build out the SQL scripts to create my dimensional and facts tables. All scripts are then committed to GitHub.
The visualization was done via Tableau Desktop. Data source is external and accessed via connection to Snowflake Warehouse.
Below is the dashboard used to filter for samples taken based on date and the five measures: Coliform, Ecoli, Turbidity, Flouride & Residual Free Chlorine.
Below is the dashboard used to examine the pattern of coliform cases between two years selected.
This is measured in a Quanti-Tray by MPN(most probable number)/100mL of water. It represents sewage contamination and should always be 0. Detection means there is insufficient treatment or there's a sewage leakage.
Below is the dashboard used to examine the number of Ecoli cases in any given year. Ecoli should is measured by ratio of MPN(most probable number)/100mL of water. This should always be 0.
Below is the dashboard used to examine the pattern of Turbidity between two selected years or sum results of years. Turbidity is a measure of water clarity and represented by Nephelometric Turbidity Units (NTU). The acceptable range for drinking water is below 1 NTU.
Below is a dashboard used to examine pattern of flourine in mg/L of water. An acceptable level is 1 and below. Flouride is added for dental health reasons.