Water Consumption in a Median Size City Study, Part of Udacity Data Scientist Nano Degree (Term 2).
The project consists of this repository and this Medium post. In this project, I have applied the Cross-industry standard process for data mining (CRISP-DM) methodology on the mentioned dataset to find the answers of some questions.
The method involves: Business Understanding, Data Understanding, Data Preparation, Modeling, and Evaluation.
I have selected this dataset because it is about water consumption, and my major is water and agricultural engineering, so I have a good background of water-related studies, especially about water conservation.
The First Stage: I have explored the data, downloaded the English description of the features (the original dataset is in Spanish), and explored the data set properties through pandas and Jupyter notebook.
I found too many missing values in the dataset, I have specified a criterion to drop or impute the missing values. I have made a small comparative study on imputing time series through the ImputingTimeSeriesData notebook here and published the description in this Medium post.
Next, I have applied the best imputing methodology to impute my data as shown in the WaterConsumtionSonora_Stage1 notebook.
The second stage: Due to the limitation of file upload size in GitHub, I attempted to reformat the datafile to be less than 100 MB, I have divided the data table to one primary and three child tables that will be joined through primary-foreign keys.
- The main table, I removed all the text from it and placed foreign keys instead
- The other three tables I have hot-encoded them, each with its suitable encoding
- The
Usage
table was encoded based on the properties of its values. - The
User
table was encoded based on the properties of its values. - The
Gauge
table was encoded based on the one-hot-encoding method. This stage is saved at the WaterConsumtionSonora_Stage2 notebook.
- The
The third stage: I have applied some ML algorithm to see the possibility of predicting either the reference consumption, or the average monthly consumption, or both.
The following models were applied: {With and without normalizing the data}
- linear_model.LinearRegression()
- linear_model.SGDRegressor()
- linear_model.PassiveAggressiveRegressor()
The models were applied to two shapes of the data:
- The full dataset with all usage features, user's features, and gauge features
- The dataset with all features except the gauge features.
The results showed that all the models have low score (<25% determination coefficient,
This stage was saved at the WaterConsumtionSonora_Stage3 notebook.
the fourth stage: In This stage I have analyzed all the features (except the gauge features) to find the effect of their existence/absence on the consumption of water in the city of Sonora.
The results showed that there are some effects of the month of the year on the monthly consumption. The results also show that there are some apparent effects of some features like industry, and agriculture. The detailed results are at the end of the WaterConsumtionSonora_Stage4 notebook.
The source of the dataset: https://www.kaggle.com/marcomolina/water-consumption-in-a-median-size-city/discussion/23722