## Problem
This project addresses a supervised learning problem aimed at predicting the likelihood of rainfall based on various meteorological parameters. By leveraging historical weather data and machine learning models, we seek to provide accurate and actionable predictions for industries and individuals reliant on weather insights.



### Use Cases
The project aligns with business goals by transforming weather data into predictive insights. Potential use cases include:

- Agriculture: Enabling farmers to optimize irrigation schedules and crop protection measures.
- Logistics: Assisting delivery companies in route planning to avoid rain-affected delays.
- Event Planning: Providing actionable weather predictions to reduce disruptions in outdoor events.
- Energy Management: Helping utility companies forecast demand fluctuations caused by weather changes.

### Measurable Success Criteria:

Accurate and timely rainfall predictions directly impacting operational decisions.
Reduction in costs associated with weather-related disruptions.

## Project Goals and Key Performance Indicators (KPIs)

### Project Goals
1. Develop a machine learning model capable of achieving high prediction accuracy for rainfall events.
2. Ensure the model’s predictions are interpretable and actionable for stakeholders.
3. Optimize the model to provide predictions in near-real-time.

### Key Performance Indicators (KPIs):
- **Model Accuracy:** Achieve a prediction accuracy of at least 90% on the validation dataset.
- **False Positive Rate (FPR):** Maintain an FPR below 10% to avoid unnecessary alarms.
- **Processing Time:** Ensure predictions are generated in less than 1 second per query.
- **Stakeholder Alignment:** Collect periodic feedback and ensure at least 85% satisfaction with the model’s usability and insights.

## Importations

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
rainpath = "/content/drive/MyDrive/datasets/Rainfall.csv"

In [None]:
df = pd.read_csv(rainpath)

## Data Preparation

In [None]:
df.head()

Unnamed: 0,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,rainfall,sunshine,winddirection,windspeed
0,1,1025.9,19.9,18.3,16.8,13.1,72,49,yes,9.3,80.0,26.3
1,2,1022.0,21.7,18.9,17.2,15.6,81,83,yes,0.6,50.0,15.3
2,3,1019.7,20.3,19.3,18.0,18.4,95,91,yes,0.0,40.0,14.2
3,4,1018.9,22.3,20.6,19.1,18.8,90,88,yes,1.0,50.0,16.9
4,5,1015.9,21.3,20.7,20.2,19.9,95,81,yes,0.0,40.0,13.7


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   day                     366 non-null    int64  
 1   pressure                366 non-null    float64
 2   maxtemp                 366 non-null    float64
 3   temparature             366 non-null    float64
 4   mintemp                 366 non-null    float64
 5   dewpoint                366 non-null    float64
 6   humidity                366 non-null    int64  
 7   cloud                   366 non-null    int64  
 8   rainfall                366 non-null    object 
 9   sunshine                366 non-null    float64
 10           winddirection  365 non-null    float64
 11  windspeed               365 non-null    float64
dtypes: float64(8), int64(3), object(1)
memory usage: 34.4+ KB


In [None]:
df.columns

Index(['day', 'pressure ', 'maxtemp', 'temparature', 'mintemp', 'dewpoint',
       'humidity ', 'cloud ', 'rainfall', 'sunshine', '         winddirection',
       'windspeed'],
      dtype='object')

In [None]:
#notice the leading space in winddirection column name
#thus, trim the spaces in column names
df.rename(str.strip, axis='columns', inplace=True)

In [None]:
#examine qty of missing data per column
df.isnull().sum()

Unnamed: 0,0
day,0
pressure,0
maxtemp,0
temparature,0
mintemp,0
dewpoint,0
humidity,0
cloud,0
rainfall,0
sunshine,0


In [None]:
#examine the rows with missing value
df[df.isnull().any(axis=1)]

Unnamed: 0,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,rainfall,sunshine,winddirection,windspeed
160,9,1005.7,31.7,28.2,26.6,25.7,86,79,yes,6.5,,


In [None]:
#examine the rows adjacent rows to that with missing value
df.loc[155:167]

Unnamed: 0,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,rainfall,sunshine,winddirection,windspeed
155,4,1007.9,33.8,28.7,24.7,25.4,83,75,yes,5.5,220.0,20.8
156,5,1008.8,30.4,26.9,25.0,24.3,86,87,yes,0.7,20.0,9.8
157,6,1008.8,29.1,26.2,24.8,24.7,91,80,yes,2.2,20.0,11.2
158,7,1008.1,30.7,28.1,26.3,25.4,86,75,yes,5.7,20.0,9.5
159,8,1006.3,30.0,27.1,24.1,25.1,89,85,yes,3.1,190.0,12.6
160,9,1005.7,31.7,28.2,26.6,25.7,86,79,yes,6.5,,
161,10,1005.7,31.1,27.9,26.6,25.8,89,80,yes,4.5,220.0,14.6
162,11,1005.9,27.8,26.6,25.4,25.4,93,85,yes,0.0,230.0,20.0
163,12,1005.7,29.2,27.1,25.4,25.8,93,96,yes,0.0,220.0,29.8
164,13,1005.0,31.5,29.7,28.5,26.7,84,90,yes,1.3,220.0,24.3


In [None]:
#summary stat
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
day,366.0,15.756831,8.823592,1.0,8.0,16.0,23.0,31.0
pressure,366.0,1013.742623,6.414776,998.5,1008.5,1013.0,1018.1,1034.6
maxtemp,366.0,26.191257,5.978343,7.1,21.2,27.75,31.2,36.3
temparature,366.0,23.747268,5.632813,4.9,18.825,25.45,28.6,32.4
mintemp,366.0,21.894536,5.594153,3.1,17.125,23.7,26.575,30.0
dewpoint,366.0,19.989071,5.997021,-0.4,16.125,21.95,25.0,26.7
humidity,366.0,80.177596,10.06247,36.0,75.0,80.5,87.0,98.0
cloud,366.0,71.128415,21.798012,0.0,58.0,80.0,88.0,100.0
sunshine,366.0,4.419399,3.934398,0.0,0.5,3.5,8.2,12.1
winddirection,365.0,101.506849,81.723724,10.0,40.0,70.0,190.0,350.0


In [None]:
#encode rainfall values
#first, check the unique values in the column
df['rainfall'].unique()

array(['yes', 'no'], dtype=object)

In [None]:
# map all no -> 0 and yes -> 1 and create a new column for the encoded format
rainfall_mapping = {'no':0, 'yes':1}
df['rainfall_encoded'] = df['rainfall'].map(rainfall_mapping)

In [None]:
df.head()

Unnamed: 0,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,rainfall,sunshine,winddirection,windspeed,rainfall_encoded
0,1,1025.9,19.9,18.3,16.8,13.1,72,49,yes,9.3,80.0,26.3,1
1,2,1022.0,21.7,18.9,17.2,15.6,81,83,yes,0.6,50.0,15.3,1
2,3,1019.7,20.3,19.3,18.0,18.4,95,91,yes,0.0,40.0,14.2,1
3,4,1018.9,22.3,20.6,19.1,18.8,90,88,yes,1.0,50.0,16.9,1
4,5,1015.9,21.3,20.7,20.2,19.9,95,81,yes,0.0,40.0,13.7,1


In [None]:
#drop the rainfall column
new_df = df.drop('rainfall', axis=1)

In [None]:
new_df.head()

Unnamed: 0,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,sunshine,winddirection,windspeed,rainfall_encoded
0,1,1025.9,19.9,18.3,16.8,13.1,72,49,9.3,80.0,26.3,1
1,2,1022.0,21.7,18.9,17.2,15.6,81,83,0.6,50.0,15.3,1
2,3,1019.7,20.3,19.3,18.0,18.4,95,91,0.0,40.0,14.2,1
3,4,1018.9,22.3,20.6,19.1,18.8,90,88,1.0,50.0,16.9,1
4,5,1015.9,21.3,20.7,20.2,19.9,95,81,0.0,40.0,13.7,1


In [None]:
#replace missing value with mean value of respective column
for col in new_df.columns:
  new_df[col] = new_df[col].fillna(new_df[col].mean())

## Exploratory Data Analysis