# Introduction

> **Will it rain tomorrow?**

That is the question we will try to answer in this project, practicing machine learning.

## Context

A vacation of a group of friends is about to start and they will arrive at the Melbourne Airport tomorrow. They rented a place near the airport and they want to go on foot to the rented place to save money in the beginning of their trip. But, if it is raining, they will have to go there using a mean of transportation, since they do not want to get wet. So, they want to know if it is going to rain tomorrow around the Melbourne Airport to know if they need to schedule a shuttle bus to take them to the rented place.

---

### Project Objective

> **Find if it will tomorrow around the Melbourne Airport.**

---

## Process

This project will be divided into 3 parts:

1. **Initial Analysis of the Variables**
2. Baseline
3. Development of the Machine Learning Model


# Part 01 | Initial Analysis of the Variables

The objective of the first part of this project is to do an **initial data cleaning and manipulation** to get the dataset ready for the next part (Baseline).

---

## Specific Objective

> **Reduce the quantity of null values.**

How? Concatenate the data of columns:
- `windgustdir` and `wind_gustdir`
- `windgustspeed` and `wind_gustspeed`

---

## Premises

- To develop the baseline, the variables that store values measured at determined period of the day will not be considered.


## Setup

### Import Libraries and Adjust Some Settings

In [1]:
# Import libraries
import pandas as pd
import numpy as np

In [2]:
#Change the number maximum of columns that are displayed
pd.set_option('display.max_columns', 50)

### Import Datasets

In [3]:
# Import the datasets
data = pd.read_csv('data/rain_data_aus.csv')
wind1 = pd.read_csv('data/wind_table_01.csv')
wind2 = pd.read_csv('data/wind_table_02.csv')
wind3 = pd.read_csv('data/wind_table_03.csv')
wind4 = pd.read_csv('data/wind_table_04.csv')
wind5 = pd.read_csv('data/wind_table_05.csv')
wind6 = pd.read_csv('data/wind_table_06.csv')
wind7 = pd.read_csv('data/wind_table_07.csv')
wind8 = pd.read_csv('data/wind_table_08.csv')

The dataset `data` has most the data, while the variable related to the wind is separated into 8 different datasets. However, the last two (`wind7` and `wind8`) seems to contain the same data.

In [4]:
# Check if the datasets 'wind7' and 'wind8' are the same
(wind7.shape[0] - wind7.isna().sum()) == np.sum(wind7 == wind8)

date             True
location         True
windgustdir      True
windgustspeed    True
winddir9am       True
winddir3pm       True
windspeed9am     True
windspeed3pm     True
dtype: bool

> Therefore, **they are the same dataset**. So, the `wind8` will not be considered.

In [5]:
#Concatenate all dataframes related to 'wind'
wind = pd.concat(objs=[wind1, wind2, wind3, wind4, wind5, wind6, wind7]).reset_index(drop=True)

#Check the results
wind

Unnamed: 0,date,location,wind_gustdir,wind_gustspeed,wind_dir9am,wind_dir3pm,wind_speed9am,wind_speed3pm,windgustdir,windgustspeed,winddir9am,winddir3pm,windspeed9am,windspeed3pm
0,2007-11-01,Canberra,NW,30.0,SW,NW,6.0,20.0,,,,,,
1,2007-11-02,Canberra,ENE,39.0,E,W,4.0,17.0,,,,,,
2,2007-11-03,Canberra,NW,85.0,N,NNE,6.0,6.0,,,,,,
3,2007-11-04,Canberra,NW,54.0,WNW,W,30.0,24.0,,,,,,
4,2007-11-05,Canberra,SSE,50.0,SSE,ESE,20.0,28.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142188,2017-06-25,Perth,,,,,,,E,26.0,SE,SE,4.0,11.0
142189,2017-06-25,SalmonGums,,,,,,,SE,15.0,SSE,E,7.0,6.0
142190,2017-06-25,Walpole,,,,,,,SSW,20.0,WNW,SSW,6.0,6.0
142191,2017-06-25,Hobart,,,,,,,NW,50.0,NNW,NNW,17.0,28.0


In [6]:
# Merge dataframes 'data' and 'wind'
df = pd.merge(left=data, right=wind, on=['date', 'location'])

#Check the result
df

Unnamed: 0,date,location,mintemp,maxtemp,rainfall,evaporation,sunshine,humidity9am,humidity3pm,pressure9am,pressure3pm,cloud9am,cloud3pm,temp9am,temp3pm,raintoday,amountOfRain,raintomorrow,temp,humidity,precipitation3pm,precipitation9am,modelo_vigente,wind_gustdir,wind_gustspeed,wind_dir9am,wind_dir3pm,wind_speed9am,wind_speed3pm,windgustdir,windgustspeed,winddir9am,winddir3pm,windspeed9am,windspeed3pm
0,2008-12-01,Albury,13.4,22.9,0.6,,,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,0.0,No,29.48,28.400000,12,5.115360,0.089825,W,44.0,W,WNW,20.0,24.0,,,,,,
1,2008-12-02,Albury,7.4,25.1,0.0,,,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,0.0,No,32.12,2.208569,10,21.497100,0.023477,WNW,44.0,NNW,WSW,4.0,22.0,,,,,,
2,2008-12-03,Albury,12.9,25.7,0.0,,,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,0.0,No,32.84,38.000000,17,20.782859,0.027580,WSW,46.0,W,WSW,19.0,26.0,,,,,,
3,2008-12-04,Albury,9.2,28.0,0.0,,,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,1.0,No,35.60,21.200000,8,12.028646,0.023962,NE,24.0,SE,E,11.0,9.0,,,,,,
4,2008-12-05,Albury,17.5,32.3,1.0,,,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,0.2,No,40.76,41.600000,9,11.883546,0.220164,W,41.0,ENE,NW,7.0,20.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142188,2017-06-20,Uluru,3.5,21.8,0.0,,,59.0,27.0,1024.7,1021.2,,,9.4,20.9,No,0.0,No,28.16,34.400000,12,5.848681,0.002556,,,,,,,E,31.0,ESE,E,15.0,13.0
142189,2017-06-21,Uluru,2.8,23.4,0.0,,,51.0,24.0,1024.6,1020.3,,,10.1,22.4,No,0.0,No,30.08,30.800000,10,6.653879,0.002053,,,,,,,E,31.0,SE,ENE,13.0,11.0
142190,2017-06-22,Uluru,3.6,25.3,0.0,,,56.0,21.0,1023.5,1019.1,,,10.9,24.5,No,0.0,No,32.36,27.200000,9,19.715976,0.023350,,,,,,,NNW,22.0,SE,N,13.0,9.0
142191,2017-06-23,Uluru,5.4,26.9,0.0,,,53.0,24.0,1021.0,1016.8,,,12.5,26.1,No,0.0,No,34.28,30.800000,12,0.985551,0.007195,,,,,,,N,37.0,SE,WNW,9.0,9.0


## Analyse columns: `windgustdir` and `wind_gustdir`

In [7]:
# Number of nulls in each column
print('Quantity of nulls on wind_gustdir column:',df['wind_gustdir'].isna().sum())
print('Quantity of nulls on windgustdir column : ',df['windgustdir'].isna().sum())

Quantity of nulls on wind_gustdir column: 105774
Quantity of nulls on windgustdir column :  45749


In [8]:
#Create a new column and add wind_gustdir values into it
df["wind_gustdir_complete"] = df["wind_gustdir"]

In [9]:
#Check if the new column were add
df.columns

Index(['date', 'location', 'mintemp', 'maxtemp', 'rainfall', 'evaporation',
       'sunshine', 'humidity9am', 'humidity3pm', 'pressure9am', 'pressure3pm',
       'cloud9am', 'cloud3pm', 'temp9am', 'temp3pm', 'raintoday',
       'amountOfRain', 'raintomorrow', 'temp', 'humidity', 'precipitation3pm',
       'precipitation9am', 'modelo_vigente', 'wind_gustdir', 'wind_gustspeed',
       'wind_dir9am', 'wind_dir3pm', 'wind_speed9am', 'wind_speed3pm',
       'windgustdir', 'windgustspeed', 'winddir9am', 'winddir3pm',
       'windspeed9am', 'windspeed3pm', 'wind_gustdir_complete'],
      dtype='object')

In [10]:
# Replace null values:
# How? Add windgustdir values into column wind_gustdir_complete (which aready have wind_gustdir values)
indices = df.loc[df['wind_gustdir'].isna(),'wind_gustdir'].index
values = df.iloc[indices,df.columns.get_loc("windgustdir")] #iloc on the lines:indices and columns:windgustdir
df.loc[df['wind_gustdir'].isna(),'wind_gustdir_complete'] = values #column wind_gustdir_complete

In [11]:
#After concatenate, it still having 9330 missing values
df['wind_gustdir_complete'].isna().sum()

9330

In [12]:
# Check if the quantity of missing values of the new column it's correct
df[((df.wind_gustdir.isna()) & (df.windgustdir.isna()))].shape

(9330, 36)

> So, the column **wind_gustdir_complete** it will be used for windgustdir analysis. <br>
The others (wind_gustdir and windgustdir) could be ignored.

## Analyse columns: `windgustspeed`   and `wind_gustspeed`

In [13]:
print('Quantity of nulls on wind_gustspeed column:',df['wind_gustspeed'].isna().sum())
print('Quantity of nulls on windgustspeed column : ',df['windgustspeed'].isna().sum())

Quantity of nulls on wind_gustspeed column: 105767
Quantity of nulls on windgustspeed column :  45696


In [14]:
#Creating a new column and add wind_gustspeed values into it
df["wind_gustspeed_complete"] = df["wind_gustspeed"]

In [15]:
#Check if the new column were add
df.columns

Index(['date', 'location', 'mintemp', 'maxtemp', 'rainfall', 'evaporation',
       'sunshine', 'humidity9am', 'humidity3pm', 'pressure9am', 'pressure3pm',
       'cloud9am', 'cloud3pm', 'temp9am', 'temp3pm', 'raintoday',
       'amountOfRain', 'raintomorrow', 'temp', 'humidity', 'precipitation3pm',
       'precipitation9am', 'modelo_vigente', 'wind_gustdir', 'wind_gustspeed',
       'wind_dir9am', 'wind_dir3pm', 'wind_speed9am', 'wind_speed3pm',
       'windgustdir', 'windgustspeed', 'winddir9am', 'winddir3pm',
       'windspeed9am', 'windspeed3pm', 'wind_gustdir_complete',
       'wind_gustspeed_complete'],
      dtype='object')

In [16]:
#Replacing null values:
# How? Add windgustspeed values into column wind_gustspeed_complete (which aready have wind_gustspeed values)
indices = df.loc[df['wind_gustspeed'].isna(),'wind_gustspeed'].index
values = df.iloc[indices,df.columns.get_loc("windgustspeed")] #iloc on the lines:indices and columns:windgustspeed
df.loc[df['wind_gustspeed'].isna(),'wind_gustspeed_complete'] = values #column wind_gustspeed_complete

In [17]:
#After concatenate, it still having 9330 missing values
df['wind_gustspeed_complete'].isna().sum()

9270

In [18]:
# Check if the quantity of missing values of the new column it's correct
df[((df.wind_gustspeed.isna()) & (df.windgustspeed.isna()))].shape

(9270, 37)

> So, the column **wind_gustspeed_complete** it will be used for windgustspeed analysis. <br>
The others (wind_gustspeed and windgustspeed) could be ignored.

## Analyse the relation of columns created

In [19]:
df['wind_gustspeed_complete'].isna().sum()

9270

In [20]:
df['wind_gustdir_complete'].isna().sum()

9330

In [21]:
(df['wind_gustspeed_complete'].isna() == df['wind_gustdir_complete'].isna()).value_counts()

True     142133
False        60
dtype: int64

In [22]:
df[(df['wind_gustspeed_complete'].isna() != df['wind_gustdir_complete'].isna())]

Unnamed: 0,date,location,mintemp,maxtemp,rainfall,evaporation,sunshine,humidity9am,humidity3pm,pressure9am,pressure3pm,cloud9am,cloud3pm,temp9am,temp3pm,raintoday,amountOfRain,raintomorrow,temp,humidity,precipitation3pm,precipitation9am,modelo_vigente,wind_gustdir,wind_gustspeed,wind_dir9am,wind_dir3pm,wind_speed9am,wind_speed3pm,windgustdir,windgustspeed,winddir9am,winddir3pm,windspeed9am,windspeed3pm,wind_gustdir_complete,wind_gustspeed_complete
12280,2010-02-05,Moree,23.0,32.5,0.2,9.2,3.0,65.0,54.0,1009.0,1005.8,8.0,7.0,25.5,28.5,No,25.2,Yes,41.0,66.8,7,12.853595,0.657475,,43.0,NE,NE,39.0,15.0,,,,,,,,43.0
20884,2009-09-24,NorfolkIsland,17.9,20.9,0.0,5.2,4.4,83.0,77.0,1010.3,1008.4,3.0,8.0,20.1,16.7,No,3.4,Yes,27.08,94.4,5,6.013741,0.782613,,59.0,NNW,W,24.0,28.0,,,,,,,,59.0
26990,2010-03-29,Richmond,19.2,27.7,0.0,12.4,,80.0,59.0,1017.4,1016.3,,,21.8,25.3,No,10.2,Yes,35.24,72.8,11,18.412731,0.518979,,24.0,SW,WSW,9.0,11.0,,,,,,,,24.0
26991,2010-03-30,Richmond,19.3,20.7,10.2,1.8,,99.0,95.0,1019.8,1018.9,,,19.5,20.6,Yes,21.0,Yes,26.84,116.0,10,10.336788,0.967862,,28.0,,WSW,,7.0,,,,,,,,28.0
44544,2008-05-12,Canberra,9.4,19.2,0.0,2.2,7.7,73.0,47.0,1024.2,1020.3,7.0,1.0,12.1,18.8,No,0.0,No,25.04,58.4,6,16.361778,0.010009,,24.0,E,NNW,4.0,15.0,,,,,,,,24.0
45512,2011-01-06,Canberra,10.3,24.1,11.0,6.2,9.0,74.0,54.0,1011.2,1010.3,5.0,6.0,17.6,22.1,Yes,0.0,No,30.92,66.8,7,9.728456,0.14191,,33.0,ESE,NE,11.0,13.0,,,,,,,,33.0
57721,2011-10-16,Bendigo,3.1,18.0,0.0,,,59.0,26.0,1019.4,1022.5,1.0,3.0,11.4,16.1,No,0.0,No,23.6,33.2,10,11.929733,0.028123,,,,,,,,54.0,W,WSW,15.0,33.0,,54.0
86782,2014-06-11,Cairns,19.6,23.9,2.4,4.2,1.3,91.0,76.0,1012.9,1010.8,8.0,7.0,21.0,23.6,Yes,0.4,No,30.68,93.2,14,14.97775,0.406175,,,,,,,,54.0,SE,E,24.0,30.0,,54.0
99018,2014-11-01,MountGambier,8.9,14.4,3.8,10.2,10.4,52.0,48.0,1008.4,1013.6,5.0,6.0,12.5,13.6,Yes,1.2,Yes,19.28,59.6,5,12.340105,0.626378,,,,,,,,83.0,,,37.0,44.0,,83.0
102328,2015-09-11,Nuriootpa,5.7,21.5,0.0,3.3,11.1,66.0,41.0,1026.8,1024.1,2.0,,15.3,21.1,No,0.0,No,27.8,51.2,10,5.993715,0.004022,,,,,,,,37.0,,,,,,37.0


> **Most days have missing values in both `wind_gustspeed_complete` and `wind_gustdir_complete`**, meaning that there might had a problem in those days or no wind was detected. However, **60 days there is a values for the column `wind_gustdir_complete` but not for the column `wind_gustspeed_complete`**, which indicates that there was wind that day.

In [23]:
# Check the minimum values for the wind speed
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
mintemp,141556.0,12.1864,6.403283,-8.5,7.6,12.0,16.8,33.9
maxtemp,141871.0,23.226784,7.117618,-4.8,17.9,22.6,28.2,48.1
rainfall,140787.0,2.349974,8.465173,0.0,0.0,0.0,0.8,371.0
evaporation,81350.0,5.469824,4.188537,0.0,2.6,4.8,7.4,145.0
sunshine,74377.0,7.624853,3.781525,0.0,4.9,8.5,10.6,14.5
humidity9am,140419.0,68.84381,19.051293,0.0,57.0,70.0,83.0,100.0
humidity3pm,138583.0,51.482606,20.797772,0.0,37.0,52.0,66.0,100.0
pressure9am,128179.0,1017.653758,7.105476,980.5,1012.9,1017.6,1022.4,1041.0
pressure3pm,128212.0,1015.258204,7.036677,977.1,1010.4,1015.2,1020.0,1039.6
cloud9am,88536.0,4.437189,2.887016,0.0,1.0,5.0,7.0,9.0


> We do not have any registration of wind speed or direction = 0. So, **probably the missing values represent a no detection of wind**.

## Export dataset

In [24]:
df.to_csv('exported_df/complete_dataset.csv', index=False)