## Machine Learning Experiments

In this section, we will experiment with various machine learning models to analyze and predict weather patterns based on historical data. We aim to build predictive models that capture trends and relationships in the weather dataset, which may help in forecasting or understanding key factors influencing weather conditions.

### Goals
1. **Model Selection**: Test different algorithms ( linear regression, KNN ) to identify the most suitable models for weather prediction.
2. **Feature Engineering**: Explore ways to transform and optimize features to improve model performance.
3. **Hyperparameter Tuning**: Experiment with tuning parameters for each model to achieve the best results.
4. **Model Evaluation**: Use appropriate metrics to evaluate and compare model performance, ensuring reliable and interpretable outcomes.

Throughout these experiments, we will document our findings, assess each model’s effectiveness, and refine our approach based on performance metrics. The ultimate goal is to establish a model pipeline that consistently performs well on unseen data, making it a reliable tool for weather analysis and prediction.


In [1]:
# Importing the needed libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import pickle

In [2]:
# Reading the modified version of the dataset
df = pd.read_csv('weather_model.csv')
df.head()

Unnamed: 0,Summary,Precip Type,Temp,Apparent Temp,Humidity,Wind Speed,Wind Bearing,Visibility,Pressure
0,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251,15.8263,1015.13
1,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259,15.8263,1015.63
2,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204,14.9569,1015.94
3,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269,15.8263,1016.41
4,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259,15.8263,1016.51


In [3]:
# Modifying some column names
new_column_names = {
    'Precip Type': 'Precip_Type',
    'Apparent Temp': 'Apparent_Temp',
    'Wind Speed': 'Wind_Speed',
    'Wind Bearing': 'Wind_Bearing'
}

df=df.rename(columns=new_column_names)

In [4]:
# changing the values for snow or rain to 0 and 1
df['Precip_Type'] = df['Precip_Type'].replace({
    'rain': 0,
    'snow': 1
})

In [5]:
# celeting summary column
del df['Summary']

## Baseline Expirement 
- Through getting the average for the temp feature column, We can determine The mean absolute error before trying different algorithms.
- This is done to double check the model is it giving better results as well.

In [6]:
# creating a column for the temp average
df['Avg_Temp'] = df['Temp'].mean()

In [7]:
# printing the mean abosulte error to make it as a reference for the model performance
print('MAE for baseline without ML model =', mean_absolute_error(df['Temp'], df['Avg_Temp']))

MAE for baseline without ML model = 7.94119279385541


In [8]:
# removing the average column as it is no longer needed
del df['Avg_Temp']

## Splitting Data
- In the Below cell, the data will be split into features and label.
- The Temp will be the label as we want the model to predict it after all.

In [9]:
x = df.drop('Temp', axis=1)
y = df['Temp']

## Train-Test Split
- Data will be split into train and test
- 80 % of the data will be for the train and the rest will be unseen to the model for testing

In [10]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=7)

In [11]:
# separating numerical from categorical data
df.select_dtypes(['int','float']).columns

Index(['Precip_Type', 'Temp', 'Apparent_Temp', 'Humidity', 'Wind_Speed',
       'Wind_Bearing', 'Visibility', 'Pressure'],
      dtype='object')

In [12]:
numerical_features = ['Apparent_Temp', 'Humidity', 'Wind_Speed', 'Wind_Bearing',
       'Visibility', 'Pressure']

## Machine Learning Pipeline

This pipeline streamlines the process of preparing data, training models, and evaluating performance. It includes the following steps:

1. **Data Preprocessing**
   - **Scaling Numerical Features**: The pipeline applies scaling to numerical features to standardize them, which is important for models sensitive to feature magnitudes.
   - **Encoding Categorical Features**: Categorical variables are encoded to convert them into a suitable format for machine learning models, allowing the pipeline to handle both numerical and categorical data seamlessly.

2. **Cross-Validation**
   - To ensure robust performance evaluation, we incorporated **cross-validation** rather than a single train-validation-test split. Cross-validation divides the dataset into multiple subsets, or "folds," training the model on different combinations of these folds and averaging the results. 
   - This approach provides a more reliable estimate of model performance across various data subsets, reducing the risk of overfitting or underfitting and ensuring that our model’s performance is consistent and generalizable.

3. **Model Evaluation**
   - The pipeline evaluates model performance using metrics calculated across cross-validation folds and on the entire dataset.
   - Metrics used:
     - **R² Score**: Indicates how well the model explains the variance in the target variable.
     - **Mean Absolute Error (MAE)**: Measures the average magnitude of prediction errors, providing an intuitive sense of how far off predictions are from actual values.

By wrapping these steps in a function, the pipeline can take any specified model and scaler, making it efficient to test different configurations and quickly assess their performance on the weather dataset. Cross-validation further enhances this process by providing a more comprehensive evaluation of model performance.
e weather dataset.


In [13]:
def experiment(model, scaler):
    preprocessing=ColumnTransformer(transformers=[
        ('scaling', scaler, numerical_features)
    ])
    
    # pipeline step is for applying the preprocssing steps and choosing the model 
    pipeline=Pipeline(steps=[
        ('preprocssing', preprocessing),
        ('modeling', model)
    ])
    
    # fitting the model 
    pipeline.fit(x_train, y_train)
    

    result = cross_val_score(pipeline, x_train, y_train)
    print("""
    *Evaluation for the model performance* :
    --------------------------------------""")
    print(result)
    print('cross val mean =', result.mean())
    print('cross val std =', result.std())

    
    # prediciton for train and test data
    train_pred = pipeline.predict(x_train)

    # measures for the train data
    print("""
    *Results for the train data* :
    ----------------------------""")
    print('r2 score =', r2_score(y_train,train_pred))
    print('MAE =', mean_absolute_error(y_train, train_pred))

In [14]:
# EXP 1 (LinearRegression)
experiment(Log(r), StandardScaler())


    *Evaluation for the model performance* :
    --------------------------------------
[0.99043734 0.99061349 0.99027145 0.99036336 0.99031834]
cross val mean = 0.9904007971539798
cross val std = 0.0001195845557328748

    *Results for the train data* :
    ----------------------------
r2 score = 0.9904046137832546
MAE = 0.7328181269540707


In [16]:
# EXP 2 (KNN)
experiment(KNeighborsRegressor(), StandardScaler())


    *Evaluation for the model performance* :
    --------------------------------------
[0.99042488 0.99021604 0.9900409  0.98997099 0.99008898]
cross val mean = 0.9901483552262483
cross val std = 0.0001597372534250494

    *Results for the train data* :
    ----------------------------
r2 score = 0.9942050296064215
MAE = 0.5484878232793082


## Hyperparameter Tuning

After experimenting with both **linear regression** and **K-Nearest Neighbors**, it was shown that, for the values of the mean absolute error **(MAE)** and **R2 score** (closer to a value of 1 is better), both values are slightly better in **KNN** . Therefore, tuning will be performed for KNN.

- 
To optimize the **K-Nearest Neighbors (KNN)** model, we will focus on tuning the `n_neighbors` parameter. This controls the number of nearest neighbors used to make predictions:

- **`n_neighbors`**: A smaller value may increase model sensitivity to the local data but risks overfitting, while a larger value provides smoother predictions but may underfit. We will use cross-validation to identify the optimal `n_neighbors` value for the best performance.




In [17]:
preprocessing=ColumnTransformer(transformers=[
    ('scaling', StandardScaler(), numerical_features),
])

# pipeline step is for applying the preprocssing steps and choosing the model 
pipeline=Pipeline(steps=[
    ('preprocssing', preprocessing),
    ('modeling', KNeighborsRegressor())
])

# hyperparameter tuning
search_space = {
    'modeling__n_neighbors': range(5, 22)
}


# fitting the model 
grid = GridSearchCV(pipeline, param_grid=search_space)
grid.fit(x_train, y_train)


train_pred=grid.predict(x_train)
test_pred = grid.predict(x_test)
print("""
*Results for the test data* :
----------------------------""")
print('r2 score =', r2_score(y_test,test_pred))
print('MAE =', mean_absolute_error(y_test, test_pred))


*Results for the test data* :
----------------------------
r2 score = 0.9913674765998518
MAE = 0.666544472060994


In [18]:
grid.best_params_

{'modeling__n_neighbors': 6}

In [19]:
pickle.dump(grid, open('weather_api.pkl', 'wb'))  # Save as 'weather_api.pkl'

## Model Deployment with Flask

In this section, we save the trained model and deploy it as a web service using Flask. This allows us to serve the model and make predictions based on new input data through a REST API.

Steps:
1. **Save the Model**: The trained model is saved as a pickle file (`weather_api.pkl`) to easily reload it for predictions.
2. **Load the Model**: Reload the saved model to ensure it works as expected.
3. **Create Flask API**: Using Flask, we set up an endpoint (`/predict`) that accepts POST requests. The API:
   - Receives input data in JSON format, processes it, and makes predictions using the loaded model.
   - Returns the prediction as JSON, making it simple to integrate with other applications.

This deployment approach enables real-time predictions and makes the model accessible to other systems.


In [20]:
import pickle
pickle.dump(grid, open('weather_api.pkl', 'wb'))

In [21]:
loaded_model = pickle.load(open('weather_api.pkl', 'rb'))

In [81]:
from flask import Flask, request, jsonify
import pandas as pd
import pickle
import numpy as np

app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    loaded_model = pickle.load(open('weather_api.pkl', 'rb'))
    data=request.get_json(force=True)
    print(data)
    input_data = {
        'Precip_Type' : [int(data['Precip_Type'])],
        'Apparent_Temp': [float(data['Apparent_Temp'])],
        'Humidity': [float(data['Humidity'])],
        'Wind_Speed': [float(data['Wind_Speed'])],
        'Wind_Bearing': [float(data['Wind_Bearing'])],
        'Visibility': [float(data['Visibility'])],
        'Pressure': [float(data['Pressure'])]
    }

    df=pd.DataFrame(input_data)
    prediction=loaded_model.predict(df)

    return jsonify({'prediction':prediction.tolist()})

if __name__=='__main__':
    app.run(debug=False)
    

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
