# Capstone : Forecasting Water Levels in Chennai India


*By: Asher Lewis* [Github](https://github.com/abrahamlewis4867)

<img src="../assets/D9HGx7FVUAEawA-.jpeg" width="1400px">

## Problem Statement

For this project, we are going to try to forecast the average monthly water level for Chennai India's four main reservoirs using time-series data. The threshold of success of our model doing well enough that if it scores higher than the baseline model. The reason for doing so is that in 2019 Chennai experienced a water crisis which had millions of people left without water and required many trains and truck to get the city water. If we can forecast the monthly demand for a given reservoir we can get an idea of how and when the cities reservoirs run out of water. This information can potentially be used later down the line to predict future water demand. The water level is measured in millions of cubic feet. We are going to score our predictions using the Mean Squared Error (MSE). Water demand forecasting is hard in general so we have a rather modest goal for our model to score lower than the baseline's model MSE. This would translate to our model having an MSE closer to zero than the Baseline.

## Executive Summary

On 19 June 2019, Chennai city officials declared that "Day Zero", or the day when almost no water is left, had been reached, as all the four main reservoirs supplying water to the city had run dry. First in this project we first combined our two given data sets and saved them into a new csv for analysis and forecasting.

The workflow was than broken up in four separate notebooks with this fifth one serving as the place where the notebooks could all come together.

In each notebook, we analyzed trends and the nature of both the water level and rain. 
We then explained some elements of time-series data, such as the potential problem of data not being stationary. 

After this, we split our data and modeled. We ran a baseline model on each one of the reservoirs. After that, we ran an ARIMA model on each reservoir. 
For the ARIMA model, we looked at the residuals and plotted the predictions.



## Table of Contents


1. [y](./chembarambakkam_eda_and_modeling.ipynb)
1. [Loading packages and data](#Loading-packages-and-data)
1. [Data Cleaning](#Data-Cleaning)
1. [EDA](#EDA)
1. [Model Preparation](#Model-Preparation)
1. [Modeling](#Modeling)
1. [Model Selection](#Model-Selection)
1. [Model evaluation](#Model-evaluation)
1. [Conclusions and Recommendations](#Conclusions-and-Recommendations)
2. [References](#References)

## Data Dictionary 

|Feature|Type|Dataset|Description|
|---|---|---|---|
|Date|datetime64|chennai_reservoir_levels.csv| The date in year, month and day|
|poondi_water|Float64|chennai_reservoir_levels.csv|Water level of Poondi lake in Millions of Cubic Feet|
|cholavaram_water|Float64|chennai_reservoir_levels.csv|Water level of cholavaram lake in Millions of Cubic Feet|
|redhills_water|Float64|chennai_reservoir_levels.csv|Water level of Redhills lake in Millions of Cubic Feet|
|chembarambakkam_water|Float64|chennai_reservoir_levels.csv|Water level of Chembarambakkam lake in Millions of Cubic Feet|
|cholavaram_rain|Float64|chennai_reservoir_rainfall.csv|Rainfall for Cholavaram lake in millimeters|
|poondi_rain|Float64|chennai_reservoir_rainfall.csv|Rainfall for Poondi lake in millimeters|
|redhills_rain|Float64|chennai_reservoir_rainfall.csv|Rainfall for Redhills lake in millimeters|
|chembarambakkam_rain|Float64|chennai_reservoir_rainfall.csv|Rainfall for Chembarambakkam lake in millimeters|


Our data comes from [Chennai Metro and Sewer](https://chennaimetrowater.tn.gov.in/) and was gathered together on Kaggle. It contains data daily data from 2004 to the end of 2019.

<img src="../assets/chennai.png" width="1400px">

Red Hills, Cholavaram, Poondi and Chembarambakkam have a combined capacity of 11,057 mcft. These are the major sources of fresh water for the city.[source](https://chennaimetrowater.tn.gov.in/online_water_taxpayment.html)

---

## Loading packages and data
---

In [1]:
import pandas as pd # packages for reading in data
import numpy as np
import matplotlib.pyplot as plt  # packages for displaying data
from matplotlib.patches import Rectangle
import seaborn as sns
import statsmodels.api as sm
from statsmodels.tsa.stattools import acf, pacf #packages for statistics
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import r2_score
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from statsmodels.tsa.stattools import adfuller
from pmdarima import auto_arima


sns.set_style("darkgrid") # setting

In [2]:
water = pd.read_csv("../data/chennai_reservoir_levels.csv") #reading in data

In [3]:
rain = pd.read_csv("../data/chennai_reservoir_rainfall.csv") #reading in data

---

## Cleaning

In [4]:
rain.shape

(5836, 5)

In [5]:
water.shape

(5836, 5)

In [6]:
water.rename(columns={"POONDI": "poondi_water",   #each one on the right is the name and the one on the left is the name I am giving it
                     "CHOLAVARAM": "cholavaram_water",
                     "REDHILLS": "redhills_water",
                    "CHEMBARAMBAKKAM": "chembarambakkam_water",
                     "Date": "date"},inplace =True)

In [7]:
rain.rename(columns={"POONDI": "poondi_rain", 
                     "CHOLAVARAM": "cholavaram_rain",
                     "REDHILLS": "redhills_rain",
                    "CHEMBARAMBAKKAM": "chembarambakkam_rain",
                    "Date": "date"},inplace =True)

In [8]:
df = pd.merge(water.copy(), rain.copy() ,on ="date", how="outer") #merging on the shared column of date

In [9]:
df.isna().sum()

date                     0
poondi_water             0
cholavaram_water         0
redhills_water           0
chembarambakkam_water    0
poondi_rain              0
cholavaram_rain          0
redhills_rain            0
chembarambakkam_rain     0
dtype: int64

In [10]:
df

Unnamed: 0,date,poondi_water,cholavaram_water,redhills_water,chembarambakkam_water,poondi_rain,cholavaram_rain,redhills_rain,chembarambakkam_rain
0,01-01-2004,3.9,0.0,268.0,0.0,0.0,0.0,0.0,0.0
1,02-01-2004,3.9,0.0,268.0,0.0,0.0,0.0,0.0,0.0
2,03-01-2004,3.9,0.0,267.0,0.0,0.0,0.0,0.0,0.0
3,04-01-2004,3.9,0.0,267.0,0.0,0.0,0.0,0.0,0.0
4,05-01-2004,3.8,0.0,267.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
5831,19-12-2019,1535.0,139.0,2318.0,1397.0,0.0,0.0,0.0,0.0
5832,20-12-2019,1529.0,131.0,2335.0,1435.0,0.0,0.0,0.0,0.0
5833,21-12-2019,1522.0,123.0,2351.0,1473.0,0.0,0.0,0.0,0.0
5834,22-12-2019,1514.0,115.0,2369.0,1510.0,0.0,0.0,0.0,0.0


In [11]:
df = df[['date', 'poondi_water','poondi_rain','cholavaram_water',
   'cholavaram_rain', 'redhills_water', 'redhills_rain', 'chembarambakkam_water','chembarambakkam_rain' ]]

In [12]:
df

Unnamed: 0,date,poondi_water,poondi_rain,cholavaram_water,cholavaram_rain,redhills_water,redhills_rain,chembarambakkam_water,chembarambakkam_rain
0,01-01-2004,3.9,0.0,0.0,0.0,268.0,0.0,0.0,0.0
1,02-01-2004,3.9,0.0,0.0,0.0,268.0,0.0,0.0,0.0
2,03-01-2004,3.9,0.0,0.0,0.0,267.0,0.0,0.0,0.0
3,04-01-2004,3.9,0.0,0.0,0.0,267.0,0.0,0.0,0.0
4,05-01-2004,3.8,0.0,0.0,0.0,267.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
5831,19-12-2019,1535.0,0.0,139.0,0.0,2318.0,0.0,1397.0,0.0
5832,20-12-2019,1529.0,0.0,131.0,0.0,2335.0,0.0,1435.0,0.0
5833,21-12-2019,1522.0,0.0,123.0,0.0,2351.0,0.0,1473.0,0.0
5834,22-12-2019,1514.0,0.0,115.0,0.0,2369.0,0.0,1510.0,0.0


In [13]:
df.dtypes

date                      object
poondi_water             float64
poondi_rain              float64
cholavaram_water         float64
cholavaram_rain          float64
redhills_water           float64
redhills_rain            float64
chembarambakkam_water    float64
chembarambakkam_rain     float64
dtype: object

In [14]:
df.sort_index(inplace = True)

In [33]:
df.index.inferred_freq

'D'

In [15]:
df["date"] = pd.to_datetime(df.loc[: ,"date"], format='%d-%m-%Y') #Chaning the format to the American order

In [16]:
df.to_csv("../data/chennai_complete.csv",index=False)

---

## Conclusions and Recommendations

All of our models mangaged to get above our problem statments goal of higher score than 65% $R^2$. In fact most of them did quite well.
In our model we saw that the significant leading correlations with lags were hard to find so our regression models may be better at predicting the present than the future. Still at the end of the day regression models are quite powerful as well as being interpretable. 

There are many things we can do in the future such as implementing more complex Models such as ARIMA and SARIMA models. Another thing we could do is run the are existing models with differencing the data. Another thing we could have done is regularize the data.

It goes without being said but always getting more data is better. It would be nice to have such features such as temperature and exact water usage.


In terms of the data it was fascinating to see how in the data how much everything is man made from the reservoirs themselves to the water scarcity problem with the data. I would suggest better collection methods of water during the monsoon season. Another thing I would suggest is to get a better record of how the the people use the water. This is truly a crisis that unfortunately  awaits most cites unless we take the proper action.


1. [Duke University](http://people.duke.edu/~rnau/timereg.html)  
1. [Penn State](https://online.stat.psu.edu/stat462/node/188/) 
1. [dataquest](https://www.dataquest.io/blog/tutorial-time-series-analysis-with-pandas/)
1. [Indian Express](https://www.newindianexpress.com/cities/chennai/2019/jun/15/water-becomes-a-priced-possession-in-north-chennai-1990378.html)
1. [Our water in stress](https://ourworldindata.org/water-use-stress) 
1. [Water Project](https://www.wri.org/aqueduct/data)  
1. [UN](https://www.un-ihe.org/water-peace-and-security-partnership)
1. [Digital India](https://analyticsindiamag.com/solving-global-water-crisis-with-artificial-intelligence/)    
        
1. [Kaggle](https://www.kaggle.com/sudalairajkumar/exploration-to-quench-chennai-s-thirst)   
1. [towards data](https://towardsdatascience.com/almost-everything-you-need-to-know-about-time-series-860241bdc578)
1. [npr](https://www.npr.org/sections/goatsandsoda/2019/06/25/734534821/no-drips-no-drops-a-city-of-10-million-is-running-out-of-water)

1. [indian press](https://www.newindianexpress.com/cities/chennai/2019/jun/15/water-becomes-a-priced-possession-in-north-chennai-1990378.html
)
1. [wbr](https://www.wbur.org/onpoint/2019/08/01/india-chennai-water-shortage-crisis-infrastructure)
1. [cenus India](https://censusindia.gov.in/maps/Town_maps/chennai_Mun_cor_div.aspx)
1. [stack exchange](https://stats.stackexchange.com/questions/142248/difference-between-r-square-and-rmse-in-linear-regression)
1. [chenni sewer metro](https://chennaimetrowater.tn.gov.in/)