![tower_bridge](tower_bridge.jpeg)

As the climate changes, predicting the weather becomes ever more important for businesses. Since the weather depends on a lot of different factors, you will want to run a lot of experiments to determine what the best approach is to predict the weather. In this project, you will run experiments for different regression models predicting the mean temperature, using a combination of `sklearn` and `MLflow`.

You will be working with data stored in `london_weather.csv`, which contains the following columns:
- **date** - recorded date of measurement - (**int**)
- **cloud_cover** - cloud cover measurement in oktas - (**float**)
- **sunshine** - sunshine measurement in hours (hrs) - (**float**)
- **global_radiation** - irradiance measurement in Watt per square meter (W/m2) - (**float**)
- **max_temp** - maximum temperature recorded in degrees Celsius (°C) - (**float**)
- **mean_temp** - mean temperature in degrees Celsius (°C) - (**float**)
- **min_temp** - minimum temperature recorded in degrees Celsius (°C) - (**float**)
- **precipitation** - precipitation measurement in millimeters (mm) - (**float**)
- **pressure** - pressure measurement in Pascals (Pa) - (**float**)
- **snow_depth** - snow depth measurement in centimeters (cm) - (**float**)

In [None]:
# Run this cell to import the modules you require
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import mlflow
import mlflow.sklearn

In [None]:
# Read in the data
weather = pd.read_csv("london_weather.csv")
weather.head()

In [None]:
# Determine the column names, data types, number of non-null vales
weather.info()

In [None]:
# Data cleaning
# Working with the date column
weather['date'] = pd.to_datetime(weather['date'], format='%Y%m%d')
weather.info()

In [None]:
# Extracting more date information
weather['year'] = weather['date'].dt.year 
weather['month'] = weather['date'].dt.month

In [None]:
# Exploratory data analysis
weather_numerical = ['cloud_cover', 'sunshine', 'global_radiation', 'max_temp', 'mean_temp', 'min_temp', 'precipitation', 'pressure', 'snow_depth',]
weather_per_month = weather.groupby(['year', 'month'], as_index=False)[weather_numerical].mean()

# Visualizing temperature
sns.lineplot(x='year', y='mean_temp', data=weather_per_month, ci=None)
plt.show()
sns.heatmap(weather.corr(), annot=True)
plt.show()

In [None]:
# Feature selection
# Filter features
feature_selection = ['cloud_cover', 'sunshine', 'precipitation', 'pressure', 'global_radiation', 'month']
weather = weather.dropna(subset=['mean_temp'])
X = weather[feature_selection]
y = weather['mean_temp']

In [None]:
# Preprocess data
# Imputing and normalizing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Machine learning training and evaluation
# Building a for loop to try different hyperparameters
for idx, depth in enumerate([1, 2, 10]):
    run_name = f"run_{idx}"

    with mlflow.start_run(run_name=run_name):
        
        lr = LinearRegression().fit(X_train, y_train) 
        dtr= DecisionTreeRegressor(random_state=42, max_depth=depth).fit(X_train, y_train)
        rfr = RandomForestRegressor(random_state=42, max_depth=depth).fit(X_train, y_train)
    
        # Logging and evaluating
        mlflow.sklearn.log_model(lr, 'Linear Regression')
        mlflow.sklearn.log_model(dtr, 'Decision Tree Regressor')
        mlflow.sklearn.log_model(rfr, 'Random Forest Regressor')

        y_pred_lr = lr.predict(X_test)
        lr_rmse = mean_squared_error(y_test, y_pred_lr, squared=False)
        y_pred_dtr = dtr.predict(X_test)
        dtr_rmse = mean_squared_error(y_test, y_pred_dtr, squared=False)
        y_pred_rfr = rfr.predict(X_test)
        rfr_rmse = mean_squared_error(y_test, y_pred_rfr, squared=False)
        
        mlflow.log_param('max_depth', depth)
        mlflow.log_metric('rmse_lr', lr_rmse)
        mlflow.log_metric('rmse_dtr', dtr_rmse)
        mlflow.log_metric('rmse_rfr', rfr_rmse)

In [None]:
# Searching your logged results
# Searching runs
experiment_results = mlflow.search_runs()
experiment_results