# Exploratory Data Analysis

The Notebook offers an example solution for the first milestone of the Rain Prediction project.



## Data 

We are using the [Rainfall prediction dataset from Kaggle](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package) which cotains daily weather observations from numerous Austrailian weather stations.


## Problem Statement
We aim to answer a very simple question - Whether it will rain tomorrow in Australia or not. 

## Importing Libraries

In [34]:
##importing libraries
import numpy as np  #for algebraic operations on arrays
import pandas as pd  #for data exploration and manipulation


##plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')


## Loading the weather data into a Dataframe

1. Using the pandas `read_csv()` method.
2. checking the first few rows using `head()`

In [35]:
data_path = './data/weatherAUS.csv'

##loading the dataset into a dataframe
df = pd.read_csv(data_path)

##preview the dataset
df.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,0.0,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,25.0,1010.6,1007.8,,,17.2,24.3,No,0.0,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,0.0,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,16.0,1017.6,1012.8,,,18.1,26.5,No,1.0,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,0.2,No


## Exploratory Data Analysis

1. Checking the shape of the data at hand.
2. Looking at all the columns after transposing the dataframe using `T` attribute function.
3. Identifying the target variable.
4. Dropping the `RISK_MM` column as it contains erroneous data.
5. Checking the dataframe information using `info()` method to identify the numerical and categorical featrures.

In [36]:
##to check the dimensions of the dataset
df.shape

(142193, 24)

In [37]:
##transposing the data to get a good understanding
df.head().T

Unnamed: 0,0,1,2,3,4
Date,2008-12-01,2008-12-02,2008-12-03,2008-12-04,2008-12-05
Location,Albury,Albury,Albury,Albury,Albury
MinTemp,13.4,7.4,12.9,9.2,17.5
MaxTemp,22.9,25.1,25.7,28,32.3
Rainfall,0.6,0,0,0,1
Evaporation,,,,,
Sunshine,,,,,
WindGustDir,W,WNW,WSW,NE,W
WindGustSpeed,44,44,46,24,41
WindDir9am,W,NNW,W,SE,ENE


There are a lot of variables/features here but the most interesting feature is the last column `RainTomorrow`. This is the target variable for our ML model which we want to predict.

It has 2 values:
* Yes - It will rain tomorrow.
* No - It will not rain tomorrow.


In [38]:
##As per the description of the dataset, 
##we have to drop RISK_MM column

df.drop(['RISK_MM'], axis=1, inplace=True)

In [39]:
##checking data information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142193 entries, 0 to 142192
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           142193 non-null  object 
 1   Location       142193 non-null  object 
 2   MinTemp        141556 non-null  float64
 3   MaxTemp        141871 non-null  float64
 4   Rainfall       140787 non-null  float64
 5   Evaporation    81350 non-null   float64
 6   Sunshine       74377 non-null   float64
 7   WindGustDir    132863 non-null  object 
 8   WindGustSpeed  132923 non-null  float64
 9   WindDir9am     132180 non-null  object 
 10  WindDir3pm     138415 non-null  object 
 11  WindSpeed9am   140845 non-null  float64
 12  WindSpeed3pm   139563 non-null  float64
 13  Humidity9am    140419 non-null  float64
 14  Humidity3pm    138583 non-null  float64
 15  Pressure9am    128179 non-null  float64
 16  Pressure3pm    128212 non-null  float64
 17  Cloud9am       88536 non-null

**Interpretting Data Information**
* We have 142193 rows, any column that contains lesser number of rows has missing values.
* We have 24 columns.
* There are categorical features that have data type `float64`.
* There are numerical features that have data type `object`.

## Summary statistics of numerical features 

In [None]:
##statistical summary of numerical variables
df.describe()

## Summary statistics of categorical features

In [41]:
##summary statistics for categorical columns
df.describe(include=['object'])


Unnamed: 0,Date,Location,WindGustDir,WindDir9am,WindDir3pm,RainToday,RainTomorrow
count,142193,142193,132863,132180,138415,140787,142193
unique,3436,49,16,16,16,2,2
top,2014-02-02,Canberra,W,N,SE,No,No
freq,49,3418,9780,11393,10663,109332,110316
