<a href="https://colab.research.google.com/github/abhishekmishra-bareilly/Machine-learning/blob/main/Whether_Humidity_Predictation(DecisionTreeClassifier).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Whether Humidity Predictation**
**Problem: Use morning sensor signals as features to predict whether the humidity will be high at 3pm**

## **Import the dependancy**


In [79]:
# Import the dependancy
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, auc
from sklearn.tree import DecisionTreeClassifier

## **Import data**

In [80]:
# Import data
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning/lecture/data/Copy of daily_weather.csv')

In [81]:
data.head()

Unnamed: 0,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,high_humidity_3pm
0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,1
1,917.347688,71.403843,101.935179,2.443009,140.471549,3.533324,0.0,0.0,24.328697,0
2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,0
3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,0
4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,1


In [82]:
data.shape

(1095, 10)

## **EDA**

### **Data cleaning**

In [83]:
data.isnull().sum()

air_pressure_9am          3
air_temp_9am              5
avg_wind_direction_9am    4
avg_wind_speed_9am        3
max_wind_direction_9am    3
max_wind_speed_9am        4
rain_accumulation_9am     6
rain_duration_9am         3
relative_humidity_9am     0
high_humidity_3pm         0
dtype: int64

In [84]:
# Removing all null values from the data
data.dropna(inplace=True)

In [85]:
data.isnull().sum()

air_pressure_9am          0
air_temp_9am              0
avg_wind_direction_9am    0
avg_wind_speed_9am        0
max_wind_direction_9am    0
max_wind_speed_9am        0
rain_accumulation_9am     0
rain_duration_9am         0
relative_humidity_9am     0
high_humidity_3pm         0
dtype: int64



*   Now our data is clean



### **Data describe**

In [86]:
data.describe()

Unnamed: 0,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,high_humidity_3pm
count,1064.0,1064.0,1064.0,1064.0,1064.0,1064.0,1064.0,1064.0,1064.0,1064.0
mean,918.90318,65.022609,142.306756,5.485793,148.480424,6.999714,0.182023,266.393697,34.07744,0.49718
std,3.17904,11.168033,69.149472,4.534427,67.154911,5.59079,1.534493,1503.092216,25.356668,0.500227
min,907.99,36.752,15.5,0.693451,28.9,1.185578,0.0,0.0,6.09,0.0
25%,916.595376,57.398,65.979244,2.245529,76.335351,3.064608,0.0,0.0,15.093365,0.0
50%,918.942281,65.778479,165.937461,3.869906,176.35,4.943637,0.0,0.0,23.135,0.0
75%,921.169054,73.530872,191.1,7.264463,201.125,8.747888,0.0,0.0,44.66,1.0
max,929.32,98.906,343.4,23.554978,312.2,29.84078,24.02,17704.0,92.62,1.0


## **Data processing** 

### **Check data destribution of target lable**

In [87]:
data['high_humidity_3pm'].value_counts()

0    535
1    529
Name: high_humidity_3pm, dtype: int64

* **Data is almost equally distributed**
* **0 --> Not Humidity**
* **1 --> Humidity**

### **Split feature and target as x and y**

In [88]:
x = data.drop(columns = 'high_humidity_3pm', axis = 1)
y = data['high_humidity_3pm']

In [89]:
x.columns

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am'],
      dtype='object')

In [90]:
print(x)

      air_pressure_9am  air_temp_9am  avg_wind_direction_9am  \
0           918.060000     74.822000              271.100000   
1           917.347688     71.403843              101.935179   
2           923.040000     60.638000               51.000000   
3           920.502751     70.138895              198.832133   
4           921.160000     44.294000              277.800000   
...                ...           ...                     ...   
1090        918.900000     63.104000              192.900000   
1091        918.710000     49.568000              241.600000   
1092        916.600000     71.096000              189.300000   
1093        912.600000     58.406000              172.700000   
1094        921.530000     77.702000               97.100000   

      avg_wind_speed_9am  max_wind_direction_9am  max_wind_speed_9am  \
0               2.080354              295.400000            2.863283   
1               2.443009              140.471549            3.533324   
2              

In [91]:
print(y)

0       1
1       0
2       0
3       0
4       1
       ..
1090    1
1091    1
1092    1
1093    1
1094    0
Name: high_humidity_3pm, Length: 1064, dtype: int64


## **Split data as train and test data**

In [92]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.33, random_state = 324)

In [93]:
print(x.shape, x_train.shape, x_test.shape)

(1064, 9) (712, 9) (352, 9)


## **Model**

### **Decision Tree Classifier**

In [94]:
model = DecisionTreeClassifier(criterion='entropy', max_leaf_nodes=10, random_state=0)

**Fitting Model**

In [95]:
# Fitting Model
model.fit(x_train, y_train)

DecisionTreeClassifier(criterion='entropy', max_leaf_nodes=10, random_state=0)

## **Model Evaluation**

### **Score on training data**

In [96]:
train_data_pred = model.predict(x_train)
score_training = accuracy_score(y_train,train_data_pred) 

In [97]:
print('Score on training data:-',score_training)

Score on training data:- 0.8834269662921348


### **Score on testing data**

In [98]:
test_data_pred = model.predict(x_test)
score_testing = accuracy_score(y_test,test_data_pred) 

In [99]:
print('Score on testing data:-',score_testing)

Score on testing data:- 0.9090909090909091
