<p style="text-align:center"> 
    <a href="https://www.linkedin.com/in/flavio-aguirre-12784a252/" target="_blank"> 
    <img src="../../assets/logo.png" width="200" alt="Flavio Aguirre Logo"> 
    </a>
</p>

# <h1 align="center"><font size="7"><strong>Weather Wise</strong></font></h1>
<hr>

## Data Wrangling

In [16]:
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

As we saw, ``sunshine`` and ``evaporation`` seem like important features, but they have many missing values, too many to impute.

In [17]:
# load the dataset
df = pd.read_csv('../../data/raw/weatherAUS-data.csv')

In [23]:
df.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


### Delete all rows with missing values
For simplicity, we'll delete the rows with missing values ​​and see what's left.

In [24]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 56420 entries, 6049 to 142302
Data columns (total 23 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           56420 non-null  object 
 1   Location       56420 non-null  object 
 2   MinTemp        56420 non-null  float64
 3   MaxTemp        56420 non-null  float64
 4   Rainfall       56420 non-null  float64
 5   Evaporation    56420 non-null  float64
 6   Sunshine       56420 non-null  float64
 7   WindGustDir    56420 non-null  object 
 8   WindGustSpeed  56420 non-null  float64
 9   WindDir9am     56420 non-null  object 
 10  WindDir3pm     56420 non-null  object 
 11  WindSpeed9am   56420 non-null  float64
 12  WindSpeed3pm   56420 non-null  float64
 13  Humidity9am    56420 non-null  float64
 14  Humidity3pm    56420 non-null  float64
 15  Pressure9am    56420 non-null  float64
 16  Pressure3pm    56420 non-null  float64
 17  Cloud9am       56420 non-null  float64
 18  Cloud3p

Since we still have 56,000 observations left after discarding missing values, we may not need to impute any missing values.
Let's see how we do it.

In [25]:
df.columns

Index(['Date', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation',
       'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
       'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
       'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
       'Temp3pm', 'RainToday', 'RainTomorrow'],
      dtype='object')

## Data Leak Considerations
Consider the above descriptions of the dataset columns. Are there any practical limitations to predicting whether it will rain tomorrow with the available data?

If we adjust our approach and seek to predict today's rainfall using historical weather data up to and including yesterday, we can legitimately use all available features. This change would be especially useful for practical applications, such as deciding whether to bike to work today.

With this new goal, we should update the names of the rainfall columns to avoid confusion.

In [26]:
df = df.rename(columns={'RainToday': 'RainYesterday',
                        'RainTomorrow': 'RainToday'
                        })

## Data Granularity
We need to ask ourselves: Would weather patterns have the same predictability in very different locations in Australia? I think not.
The probability of rain in one location may be much higher than in another.
Using all locations requires a more complex model, as it must adapt to local weather patterns.
Let's look at how many observations we have for each location and see if we can focus on a smaller region.

## Location Selection
We could investigate grouping the cities in the ``"Location"`` column by distance (we used Folium for this step).
We discovered that Watsonia is only 15 km from Melbourne and Melbourne Airport is only 18 km away.
Let's group these three locations together and use only their weather data to build our localized prediction model.
Since there could still be slight variations in weather patterns, we'll keep "Location" as a categorical variable.

In [27]:
df = df[df.Location.isin(['Melbourne','MelbourneAirport','Watsonia',])]
df. info()

<class 'pandas.core.frame.DataFrame'>
Index: 7557 entries, 64191 to 80997
Data columns (total 23 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           7557 non-null   object 
 1   Location       7557 non-null   object 
 2   MinTemp        7557 non-null   float64
 3   MaxTemp        7557 non-null   float64
 4   Rainfall       7557 non-null   float64
 5   Evaporation    7557 non-null   float64
 6   Sunshine       7557 non-null   float64
 7   WindGustDir    7557 non-null   object 
 8   WindGustSpeed  7557 non-null   float64
 9   WindDir9am     7557 non-null   object 
 10  WindDir3pm     7557 non-null   object 
 11  WindSpeed9am   7557 non-null   float64
 12  WindSpeed3pm   7557 non-null   float64
 13  Humidity9am    7557 non-null   float64
 14  Humidity3pm    7557 non-null   float64
 15  Pressure9am    7557 non-null   float64
 16  Pressure3pm    7557 non-null   float64
 17  Cloud9am       7557 non-null   float64
 18  Cloud3pm

We still have 7,557 records, which should be enough to build a reasonably good model.

More data can always be collected if needed.

## Extracting a Seasonality Feature

Now let's consider the ``"Date"`` column. We expect weather patterns to be seasonal, with different levels of predictability in winter and summer, for example.

There may also be some variation with ``"Year"``, but we'll ignore that for now.

We'll design a ``"Season"`` feature from ``"Date"`` and remove it later, as it's likely less informative than the season.

A simple way to do this is to define a function that assigns seasons to given months and then use that function to transform the ``"Date"`` column.

### Create a function to assign dates to seasons

In [28]:
def date_to_season(date):
    month = date.month
    if (month == 12) or (month == 1) or (month == 2):
        return 'Summer'
    elif (month == 3) or (month == 4) or (month == 5):
        return 'Autumn'
    elif (month == 6) or (month == 7) or (month == 8):
        return 'Winter'
    elif (month == 9) or (month == 10) or (month == 11):
        return 'Spring'

In [29]:
# Convert the "Date" column to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Apply the function to the "Date" column
df['Season'] = df['Date'].apply(date_to_season)

df=df.drop(columns='Date')
df

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainYesterday,RainToday,Season
64191,MelbourneAirport,11.2,19.9,0.0,5.6,8.8,SW,69.0,W,SW,...,37.0,1005.1,1006.4,7.0,7.0,15.9,18.1,No,Yes,Summer
64192,MelbourneAirport,7.8,17.8,1.2,7.2,12.9,SSE,56.0,SW,SSE,...,43.0,1018.0,1019.3,6.0,7.0,12.5,15.8,Yes,No,Summer
64193,MelbourneAirport,6.3,21.1,0.0,6.2,10.5,SSE,31.0,E,S,...,35.0,1020.8,1017.6,1.0,7.0,13.4,19.6,No,No,Summer
64194,MelbourneAirport,8.1,29.2,0.0,6.4,12.5,SSE,35.0,NE,SSE,...,23.0,1016.2,1012.8,5.0,4.0,16.0,28.2,No,No,Summer
64195,MelbourneAirport,9.7,29.0,0.0,7.4,12.3,SE,33.0,SW,SSE,...,31.0,1011.9,1010.3,6.0,2.0,19.4,27.1,No,No,Summer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80992,Watsonia,3.6,14.5,0.0,2.4,8.8,NNE,41.0,ENE,NNE,...,66.0,1028.4,1025.0,1.0,7.0,5.2,13.8,No,No,Winter
80994,Watsonia,4.8,13.3,0.4,0.6,0.0,NNW,24.0,NE,NNE,...,63.0,1028.5,1025.1,7.0,7.0,5.6,12.4,No,No,Winter
80995,Watsonia,5.6,13.1,0.0,1.6,6.0,NNW,52.0,NE,N,...,67.0,1019.0,1014.0,1.0,7.0,8.8,11.6,No,Yes,Winter
80996,Watsonia,6.9,12.1,3.2,1.8,5.6,SSW,24.0,WNW,SW,...,61.0,1018.7,1017.3,2.0,7.0,7.9,11.0,Yes,No,Winter


In [30]:
df.describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm
count,7557.0,7557.0,7557.0,7557.0,7557.0,7557.0,7557.0,7557.0,7557.0,7557.0,7557.0,7557.0,7557.0,7557.0,7557.0,7557.0
mean,10.471589,20.698743,1.705836,4.666905,6.431878,43.741829,16.551145,20.133651,71.933704,52.193992,1017.83336,1015.981051,5.233558,5.270081,14.141352,19.174606
std,4.480357,6.525832,4.99321,3.321487,3.894928,15.606706,10.82158,9.472907,16.612418,17.635123,7.730309,7.544022,2.522287,2.342999,4.979955,6.32023
min,-2.1,8.4,0.0,0.0,0.0,9.0,2.0,2.0,11.0,6.0,988.9,988.2,0.0,0.0,-0.6,6.2
25%,7.3,15.6,0.0,2.2,3.2,31.0,9.0,13.0,62.0,41.0,1012.8,1011.0,3.0,4.0,10.6,14.3
50%,10.1,19.4,0.0,4.0,6.6,41.0,13.0,19.0,72.0,51.0,1018.0,1016.4,7.0,6.0,13.7,18.0
75%,13.6,24.6,1.0,6.4,9.6,54.0,22.0,26.0,84.0,63.0,1023.0,1021.1,7.0,7.0,17.0,22.9
max,30.5,46.8,84.0,23.8,13.9,122.0,67.0,76.0,100.0,100.0,1039.3,1036.0,8.0,8.0,36.4,46.1


It looks like we have a good set of features to work with.

Let's continue building our model.

But wait, let's see how well-balanced our goal is.

### We save the data frame update

In [31]:
df.to_csv('../../data/processed/weatherAUS-data-clean.csv', index=False)
print("Data saved to ../../data/processed/weatherAUS-data-clean.csv")

Data saved to ../../data/processed/weatherAUS-data-clean.csv


<hr>

## Author

<a href="https://www.linkedin.com/in/flavio-aguirre-12784a252/">**Flavio Aguirre**</a>
<br>
<a href="https://coursera.org/share/e27ae5af81b56f99a2aa85289b7cdd04">***Data Scientist***</a>