# Rainfall prediction using Linear regression

Rainfall Prediction is the application of science and technology to predict the amount of rainfall over a region. It is important to exactly determine the rainfall for effective use of water resources, crop productivity and pre-planning of water structures.

In this notebook we will use Linear Regression to predict the amount of rainfall. Linear Regression tells us how many inches of rainfall we can expect.

The dataset is a public weather dataset from Austin, Texas available on Kaggle. The dataset can be found [here](https://www.kaggle.com/grubenm/austin-weather).

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("austin_weather.csv")
df.head(10)

Unnamed: 0,Date,TempHighF,TempAvgF,TempLowF,DewPointHighF,DewPointAvgF,DewPointLowF,HumidityHighPercent,HumidityAvgPercent,HumidityLowPercent,...,SeaLevelPressureAvgInches,SeaLevelPressureLowInches,VisibilityHighMiles,VisibilityAvgMiles,VisibilityLowMiles,WindHighMPH,WindAvgMPH,WindGustMPH,PrecipitationSumInches,Events
0,2013-12-21,74,60,45,67,49,43,93,75,57,...,29.68,29.59,10,7,2,20,4,31,0.46,"Rain , Thunderstorm"
1,2013-12-22,56,48,39,43,36,28,93,68,43,...,30.13,29.87,10,10,5,16,6,25,0,
2,2013-12-23,58,45,32,31,27,23,76,52,27,...,30.49,30.41,10,10,10,8,3,12,0,
3,2013-12-24,61,46,31,36,28,21,89,56,22,...,30.45,30.3,10,10,7,12,4,20,0,
4,2013-12-25,58,50,41,44,40,36,86,71,56,...,30.33,30.27,10,10,7,10,2,16,T,
5,2013-12-26,57,48,39,39,36,33,79,63,47,...,30.4,30.34,10,9,7,12,3,17,0,
6,2013-12-27,60,53,45,41,39,37,83,65,47,...,30.39,30.34,10,9,7,7,1,11,T,
7,2013-12-28,62,51,40,43,39,33,92,64,36,...,30.17,30.04,10,10,7,10,2,14,T,
8,2013-12-29,64,50,36,49,41,28,92,76,60,...,30.1,29.99,10,10,4,17,5,24,0,
9,2013-12-30,44,40,35,31,26,21,75,60,45,...,30.33,30.26,10,10,10,13,5,21,0,


In [3]:
df.shape

(1319, 21)

Drop unnecessary columns in the data

In [4]:
df.drop(['Events', 'Date', 'SeaLevelPressureHighInches',  
                  'SeaLevelPressureLowInches'], axis=1, inplace = True)
df.head(10)

Unnamed: 0,TempHighF,TempAvgF,TempLowF,DewPointHighF,DewPointAvgF,DewPointLowF,HumidityHighPercent,HumidityAvgPercent,HumidityLowPercent,SeaLevelPressureAvgInches,VisibilityHighMiles,VisibilityAvgMiles,VisibilityLowMiles,WindHighMPH,WindAvgMPH,WindGustMPH,PrecipitationSumInches
0,74,60,45,67,49,43,93,75,57,29.68,10,7,2,20,4,31,0.46
1,56,48,39,43,36,28,93,68,43,30.13,10,10,5,16,6,25,0
2,58,45,32,31,27,23,76,52,27,30.49,10,10,10,8,3,12,0
3,61,46,31,36,28,21,89,56,22,30.45,10,10,7,12,4,20,0
4,58,50,41,44,40,36,86,71,56,30.33,10,10,7,10,2,16,T
5,57,48,39,39,36,33,79,63,47,30.4,10,9,7,12,3,17,0
6,60,53,45,41,39,37,83,65,47,30.39,10,9,7,7,1,11,T
7,62,51,40,43,39,33,92,64,36,30.17,10,10,7,10,2,14,T
8,64,50,36,49,41,28,92,76,60,30.1,10,10,4,17,5,24,0
9,44,40,35,31,26,21,75,60,45,30.33,10,10,10,13,5,21,0


Some values have 'T' which denotes trace rainfall. we need to replace all occurrences of T with 0 so that we can use the data in our model 

In [5]:
df.replace('T',0.0, inplace=True)

The data also contains '-' which indicates no or NIL. This means that data is not available we need to replace these values as well. 

In [6]:
df.replace('-',0.0,inplace=True)

We will use Scikit-learn’s linear regression model to train our dataset. Once the model is trained, we can give our own inputs for the various columns such as temperature, dew point, pressure, etc. to predict the weather based on these attributes.

In [8]:
from sklearn.linear_model import LinearRegression

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
x = df.drop('PrecipitationSumInches',axis=1)

In [11]:
y = df['PrecipitationSumInches']

In [12]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,random_state=1)

In [13]:
lr = LinearRegression()

In [14]:
lr.fit(x_train,y_train)

LinearRegression()

In [15]:
lr.score(x_train,y_train)

0.30544869828667065

In [16]:
lr.score(x_test,y_test)

0.26642909414007887

Test Accuracy is too Low