Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
polynomial regression


Raising Awareness on Water Availability

"Day Zero" is a term that is used to refer to a situation of extreme water shortage - a situation that Cape Town came very close to facing last year. They have managed to avoid complete water shortage as of now, but it is a huge warning for the rest of the world. Many cities all around the world experience water shortages throughout the year, causing massive disruptions in day-to-day life of locals, businesses, and the economy. The ability to keep an eye on cities coming close to a ‘Day Zero’ event could be incredibly useful in not only effectively deploying aid, but also to take preventative actions early on so that local authorities are not forced to impose extreme water usage laws on short notice.

To help raise awareness to this situation, we used machine learning to develop a model to predict the water stress level of a country given a particular set of attributes. We divided the raw stress values (a scale from 0-100) into 6 categories:

Stress Value Stress Level
0 - 20 None
21 - 40 Low
41 - 60 Medium
61 - 80 Alert
81 - 100 High
>100 Critical

Though we do not have enough data to go down to the city level, viewing things from a larger perspective can certainly provide insight into how a nation should work as a whole to prevent reaching a Day Zero situation.

Click here to read a description of the project.

About the Data

Most of our data was gathered from the extensive AQUASTAT database which provides many different water metrics for every country for the past 60 years. Another source was used to get rainwater harvesting information.



  • total land cultivated (%) [AQUASTAT]: Percentage of the total land area of the country that has been cultivated
  • annual precipitation (mm/yr) [AQUASTAT]: Total depth of precipitation per year

Water Resources

  • rainwater harvesting awareness [multiple sources]: True/False value determined by whether or not rainwater harvesting is widely practiced
  • water consumption per capita (m3/year/inhabitant) [AQUASTAT]: Total amount of water withdrawn per capita
  • total renewable water resources per capita (m3/year/inhabitant) [AQUASTAT]: The maximum theoretical yearly amount of water available per person for a country at a given moment
  • desalination capacity (km3/year) [AQUASTAT]: Fresh water produced using brackish or salt water
  • water dependency ratio (%) [AQUASTAT]: Percentage of water that comes from other countries
  • agricultural water withdrawal (%) [AQUASTAT]: Percentage of total water withdrawn used for agriculture
  • industrial water withdrawal (%) [AQUASTA]: Percentage of total water withdrawn used for industrial purposes
  • municipal water withdrawal (%) [AQUASTAT]: Percentage of total water withdrawn used for municipal purposes
  • water stress level (%) [AQUASTAT]: Water stress level measured by dividing total water withdrawal by the total water available minus any water needed for environmental flow. This was used to determine the class label for each sample (see table above)

Dealing with Missing Attributes

Much of this data (at the time of this project) is difficult to measure and record for several countries, so there were many missing fields to deal with. To see what effect missing attributes had over a complete dataset, we generated two sample sets. The filled dataset uses linear regression to come up with a simple best-fit model for every attribute of every country. Any missing attributes are then populated using this model. Any negative values generated by the models are saturated at 0. The original uses ridge regression to complete only the stress column - the value we use to classify our data. All other missing attributes are left as missing.


Data from our sources were presented in several different formats and contained extra information that was not useful for our application, so we developed data purification scripts to keep only what was necessary. parse_csv contains functions to parse raw CSV files from our sources into datasets that are compatible with our machine learning application. All the clean datasets were then merged into one master dataset using master_gen.py. This script also handles the best-fit model creation (described above) for each country's attribute, categorizing the output stress levels into 6 classes, and splits the whole dataset into testing and training subsets.

AQUASTAT presents nearly all of their values in 5 year increments, however the initial start time for one attribute might be different from another. This led to instances where half the attributes were present for one year and the remaining attributes were present for the year or two years after. As a result, we had large gaps of information for every documented year. To fix this issue, we shifted the years of some attributes so that all attributes would follow the same 5 year time step for each country. For example, an attribute that has a year label of 1991 would be shifted to 1990 unless a value was already present in the 1990 space. This is under the assumption that values are relatively steady over the course of a couple years.

Building the Model

The target attribute that is used to train the machine learning model is water stress for a particular country and for a particular year. Using the randomly generated training and testing sets, we ran a series of tests to find the best algorithm that would classify the data. Given the type of inputs and outputs, we suspected using a decision tree or some instance based learner would be effective in classifying the data. We also predicted that a multilayer perceptron may do well with our wide range of attributes because of its flexibility. The results of several machine algorithms are shown below. Instances of these models can be found in the models directory.

Algorithm Accuracy on Filled Dataset Accuracy on Original Dataset
IBk 95.05% 88.26%
KStar 92.21% 89.28%
RandomForest 92.04% 88.26%
RandomTree 90.80% 84.69%
MultiClassClassifier 90.69% 86.22%
J48 88.88% 84.69%
LogitBoost 88.37% 84.69%
BayesNet 85.10% 85.00%
NaiveBayes 79.74% 67.85%
AdaBoostM1 79.40% 75.51%
ZeroR 79.40% 75.51%
MLP 74.77% 75.51%

NOTE: All models were generated using 10-Fold cross validation in Weka. The Details column provides information on what configuration gave the best results for the tested algorithm.

As expected, the best models were either trees or instance based learners. These algorithms did much better than the base test of ZeroR, showing that a model can certainly be generated with our sample size and attributes. On the original dataset, KStar performed the best while IBk did the worst. This is understandable since with so many missing attributes, a nearest neighbor approach becomes much less accurate. The MLP model performed much lower than expected, possibly because it was overfitting the training set. Further tuning could potentially improve this model, but there was unfortunately not enough time to test them all out due to the time it takes to generate just one model.

Looking at the performance for both datasets, the presence of missing attributes certainly has an effect on how well the algorithm can classify new samples. Hopefully in the near future (and with enough digging around), this dataset can become more complete as more information on each country's water usage is published.