<p style="text-align:center"> 
    <a href="https://www.linkedin.com/in/flavio-aguirre-12784a252/" target="_blank"> 
    <img src="../../assets/logo.png" width="200" alt="Flavio Aguirre Logo"> 
    </a>
</p>

# <h1 align="center"><font size="7"><strong>Weather Wise</strong></font></h1>
<hr>

## Preprocessing and Modeling

Now that we have our data with new features and having determined that we have unbalanced classes, we will proceed to divide the data into training and test sets to ensure the target stratification.

In [1]:
import pandas as pd
import joblib

from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier


import warnings
warnings.filterwarnings("ignore")

We use our new dataset with the new features

In [35]:
df = pd.read_csv('../../data/processed/weatherAUS-data-engineered.csv')
print('load success')
df.head()

load success


Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,RainToday,Season,TempDiff,TempChange,PressureDiff,HumidityDiff,WindSpeedDiff,AvgHumidity,AvgTemp,RainfallPerSunshine
0,MelbourneAirport,11.2,19.9,0.0,5.6,8.8,SW,69.0,W,SW,...,Yes,Summer,8.7,2.2,1.3,-18.0,10.0,46.0,17.0,0.0
1,MelbourneAirport,7.8,17.8,1.2,7.2,12.9,SSE,56.0,SW,SSE,...,No,Summer,10.0,3.3,1.3,-7.0,-5.0,46.5,14.15,0.092308
2,MelbourneAirport,6.3,21.1,0.0,6.2,10.5,SSE,31.0,E,S,...,No,Summer,14.8,6.2,-3.2,-16.0,6.0,43.0,16.5,0.0
3,MelbourneAirport,8.1,29.2,0.0,6.4,12.5,SSE,35.0,NE,SSE,...,No,Summer,21.1,12.2,-3.4,-44.0,18.0,45.0,22.1,0.0
4,MelbourneAirport,9.7,29.0,0.0,7.4,12.3,SE,33.0,SW,SSE,...,No,Summer,19.3,7.7,-1.6,-20.0,11.0,41.0,23.25,0.0


In [36]:
# Define the features and target variable
X = df.drop(columns=['RainToday'])
y = df['RainToday']

### We split the dataset into the training set and the test set.

In [37]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size = 0.2,
    stratify = y,
    random_state = 42
)

### Define preprocessing transformers for numeric and categorical features
We can automatically detect numeric and categorical columns and map them to separate numeric and categorical features.

In [38]:
numeric_features = X_train.select_dtypes(include=['float']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

In [39]:
print(f"Numeric features: {numeric_features}\n")
print(f"Categorical features: {categorical_features}")

Numeric features: ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm', 'TempDiff', 'TempChange', 'PressureDiff', 'HumidityDiff', 'WindSpeedDiff', 'AvgHumidity', 'AvgTemp', 'RainfallPerSunshine']

Categorical features: ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainYesterday', 'Season']


Let's define separate transformers for both types of features and combine them into a single preprocessing transformer

In [40]:
# Scale numeric features
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])

# One-hot encode categorical features
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

We now combine the transformers into a single preprocessing column transformer.

In [42]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

We created a pipeline combining preprocessing with a Random Forest classifier

<details>
    <strong>
        A random forest regression model was chosen because, given the technical requirements of the problem to be solved, it is the ideal candidate. This does not mean we cannot experiment with other models.
    </strong>
</details>


In [43]:
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

### We define a parameter grid to use in a cross-validation grid search model optimizer

In [44]:
param_grid = {
    'classifier__n_estimators': [50, 100],
    'classifier__max_depth': [None, 10, 20],
    'classifier__min_samples_split': [2, 5]
}

We performed grid search cross-validation to find the optimal model parameters to best fit the training data. To do this, we selected a cross-validation method that ensures the desired stratification during validation.

In [45]:
cv = StratifiedKFold(n_splits=5, shuffle=True)

We instantiate and adjust GridSearchCV to the pipeline

In [None]:
model = GridSearchCV(pipeline, param_grid, cv=cv, scoring='accuracy', verbose=2)  
model.fit(X_train, y_train)

### We observe the best parameters and the best cross-validation score

In [47]:
print(f"Best parameters found: {model.best_params_}\n")
print(f"Best cross-validation score: {model.best_score_:.2f}")

Best parameters found: {'classifier__max_depth': None, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 100}

Best cross-validation score: 0.86


### Now let's look at the estimated score of our model

In [48]:
test_score = model.score(X_test, y_test)  
print(f"Test set score: {test_score:.2f}")

Test set score: 0.85


So we have a reasonably accurate classifier, expected to correctly predict whether it will rain today in the Melbourne area approximately 85% of the time.
Let's analyze the results in more detail.

We proceed to save the best model for evaluation and final conclusions.

In [49]:
joblib.dump(model, '../../models/model_randomforest_precipicheck.pkl')
print("Saved model")

Saved model


<hr>

## Author

<a href="https://www.linkedin.com/in/flavio-aguirre-12784a252/">**Flavio Aguirre**</a>
<br>
<a href="https://coursera.org/share/e27ae5af81b56f99a2aa85289b7cdd04">***Data Scientist***</a>