## Classification of Weather Data using scikit-learn
In this notebook, we will use scikit-learn to perform a decision tree based classification of weather data.

**Importing necessary libraries**

In [1]:
import pandas as pd 
from sklearn.metrics import accuracy_score 
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

Create a Pandas DataFrame to read from a CSV file

In [2]:
data = pd.read_csv("../input/daily_weather.csv")
data.head()

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
0,0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16
1,1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697,19.426597
2,2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46
3,3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547
4,4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74


## Daily Weather Data Description


The file **daily_weather.csv** is a comma-separated file that contains weather data. This data comes from a weather station located in San Diego, California. The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity. Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.


In [3]:
# Let's look at the columns in the dataset 
data.columns

Index(['number', 'air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'relative_humidity_3pm'],
      dtype='object')

<br>Each row in daily_weather.csv captures weather data for a separate day.  <br><br>
Sensor measurements from the weather station were captured at one-minute intervals.  These measurements were then processed to generate values to describe daily weather. Since this dataset was created to classify low-humidity days vs. non-low-humidity days (that is, days with normal or high humidity), the variables included are weather measurements in the morning, with one measurement, namely relatively humidity, in the afternoon.  The idea is to use the morning weather values to predict whether the day will be low-humidity or not based on the afternoon measurement of relative humidity.

Each row, or sample, consists of the following variables:

* **number:** unique number for each row
* **air_pressure_9am:** air pressure averaged over a period from 8:55am to 9:04am (*Unit: hectopascals*)
* **air_temp_9am:** air temperature averaged over a period from 8:55am to 9:04am (*Unit: degrees Fahrenheit*)
* **air_wind_direction_9am:** wind direction averaged over a period from 8:55am to 9:04am (*Unit: degrees, with 0 means coming from the North, and increasing clockwise*)
* **air_wind_speed_9am:** wind speed averaged over a period from 8:55am to 9:04am (*Unit: miles per hour*)
* ** max_wind_direction_9am:** wind gust direction averaged over a period from 8:55am to 9:10am (*Unit: degrees, with 0 being North and increasing clockwise*)
* **max_wind_speed_9am:** wind gust speed averaged over a period from 8:55am to 9:04am (*Unit: miles per hour*)
* **rain_accumulation_9am:** amount of rain accumulated in the 24 hours prior to 9am (*Unit: millimeters*)
* **rain_duration_9am:** amount of time rain was recorded in the 24 hours prior to 9am (*Unit: seconds*)
* **relative_humidity_9am:** relative humidity averaged over a period from 8:55am to 9:04am (*Unit: percent*)
* **relative_humidity_3pm:** relative humidity averaged over a period from 2:55pm to 3:04pm (*Unit: percent *)


### Data Cleaning 

We have to look for any null values and get rid of those values to make a clean data set

In [4]:
data.isnull().any().any()

True

This means that there are sum null values. Let's have a look at the columns which have null values. 

In [5]:
data.isnull().sum() 

number                    0
air_pressure_9am          3
air_temp_9am              5
avg_wind_direction_9am    4
avg_wind_speed_9am        3
max_wind_direction_9am    3
max_wind_speed_9am        4
rain_accumulation_9am     6
rain_duration_9am         3
relative_humidity_9am     0
relative_humidity_3pm     0
dtype: int64

In [6]:
# Print the rows with missing values 
data[data.isnull().any(axis = 1)]

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
16,16,917.89,,169.2,2.192201,196.8,2.930391,0.0,0.0,48.99,51.19
111,111,915.29,58.82,182.6,15.613841,189.0,,0.0,0.0,21.5,29.69
177,177,915.9,,183.3,4.719943,189.9,5.346287,0.0,0.0,29.26,46.5
262,262,923.596607,58.380598,47.737753,10.636273,67.145843,13.671423,0.0,,17.990876,16.461685
277,277,920.48,62.6,194.4,2.751436,,3.869906,0.0,0.0,52.58,54.03
334,334,916.23,75.74,149.1,2.751436,187.5,4.183078,,1480.0,31.88,32.9
358,358,917.44,58.514,55.1,10.021491,,12.705819,0.0,0.0,13.88,25.93
361,361,920.444946,65.801845,49.823346,21.520177,61.886944,25.549112,,40.364018,12.278715,7.618649
381,381,918.48,66.542,90.9,3.467257,89.4,4.406772,,0.0,20.64,14.35
409,409,,67.853833,65.880616,4.328594,78.570923,5.216734,0.0,0.0,18.487385,20.356594


#### Data Cleaning Steps 

In [7]:
# We do not need to number the rows as Pandas provides its's own indexing 
del data['number']
data.columns

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'relative_humidity_3pm'],
      dtype='object')

Next, we drop out the rows with null values

In [8]:
before_rows = data.shape[0]
data = data.dropna()
after_rows = data.shape[0]

In [9]:
print("The number of dropped rows are {}".format(before_rows - after_rows))

The number of dropped rows are 31


## Convert to a Classification task

**Binarize the relative humidity_3pm to 0 or 1**

We are assigning the values 0 or 1 and adding a new column 'high humidity label'. We are basically classifying the data into two categories ( binary problem ) by setting a desired value ( 24.99 , in this case ) to be the threshold and anything above is high ( 1 ) and anything below is low ( 0 ). 

In [10]:
clean_data = data.copy() # New data frame to avoid confusion 
clean_data['high_humidity_label'] = (clean_data['relative_humidity_3pm'] > 24.99) * 1
print(clean_data['high_humidity_label'])

0       1
1       0
2       0
3       0
4       1
5       1
6       0
7       1
8       0
9       1
10      1
11      1
12      1
13      1
14      0
15      0
17      0
18      1
19      0
20      0
21      1
22      0
23      1
24      0
25      1
26      1
27      1
28      1
29      1
30      1
       ..
1064    1
1065    1
1067    1
1068    1
1069    1
1070    1
1071    1
1072    0
1073    1
1074    1
1075    0
1076    0
1077    1
1078    0
1079    1
1080    0
1081    0
1082    1
1083    1
1084    1
1085    1
1086    1
1087    1
1088    1
1089    1
1090    1
1091    1
1092    1
1093    1
1094    0
Name: high_humidity_label, Length: 1064, dtype: int64


Target is now stored as y. Here, target is the label - 'high_humidity_label'

In [11]:
y = clean_data[['high_humidity_label']].copy()
y

Unnamed: 0,high_humidity_label
0,1
1,0
2,0
3,0
4,1
5,1
6,0
7,1
8,0
9,1


In [12]:
clean_data['relative_humidity_3pm'].head()

0    36.160000
1    19.426597
2    14.460000
3    12.742547
4    76.740000
Name: relative_humidity_3pm, dtype: float64

In [13]:
y.head()

Unnamed: 0,high_humidity_label
0,1
1,0
2,0
3,0
4,1


### Use 9am Sensor signals to predict Humidity at 3PM

In [14]:
time = '9am'
features = list(clean_data.columns[clean_data.columns.str.contains(time)])

# we do not need relative humidity at 9am 
features.remove('relative_humidity_9am')

features

['air_pressure_9am',
 'air_temp_9am',
 'avg_wind_direction_9am',
 'avg_wind_speed_9am',
 'max_wind_direction_9am',
 'max_wind_speed_9am',
 'rain_accumulation_9am',
 'rain_duration_9am']

In [15]:
# Make the data of these features as X
X = clean_data[features].copy()
#X

## Perform the test and Train split

### REMINDER: Training Phase

* In the **training phase**, the learning algorithm uses the training data to adjust the model’s parameters to minimize errors.  At the end of the training phase, you get the trained model.


* In the **testing phase**, the trained model is applied to test data.  Test data is separate from the training data, and is previously unseen by the model.  The model is then evaluated on how it performs on the test data.  The goal in building a classifier model is to have the model perform well on training as well as test data.


In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 324)

Let us look at these sets using the following commands below. 

In [17]:
# type(X_train)
# type(X_test)
# type(y_train)
# type(y_test)
# X_train.head()
# #y_train.describe()

In [18]:
y_train.describe()


Unnamed: 0,high_humidity_label
count,712.0
mean,0.494382
std,0.50032
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [19]:
X_train.describe()

Unnamed: 0,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am
count,712.0,712.0,712.0,712.0,712.0,712.0,712.0,712.0
mean,918.913897,65.194366,142.12333,5.568288,147.789373,7.089629,0.202399,282.884615
std,3.147923,11.210412,68.773699,4.467828,67.186609,5.486311,1.628988,1584.404987
min,907.99,36.752,15.5,0.782929,31.8,1.185578,0.0,0.0
25%,916.727792,57.3755,65.282862,2.304048,75.978656,3.168182,0.0,0.0
50%,919.0,65.967556,166.225018,3.958486,176.8,5.077854,0.0,0.0
75%,921.134993,73.9085,190.5,7.504934,201.4,9.104346,0.0,0.0
max,929.32,91.112,343.4,21.541732,299.2,26.351153,24.02,17704.0


### Fit the model on the training set

We will build a model using the Decision Tree Classifier using the fit functions. 

In [20]:
humidity_classifier = DecisionTreeClassifier(max_leaf_nodes = 10, random_state = 0)
humidity_classifier.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=10,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

In [21]:
type(humidity_classifier)

sklearn.tree.tree.DecisionTreeClassifier

### Test the model on thetesting set



In [22]:
predictions = humidity_classifier.predict(X_test)
type(predictions)

numpy.ndarray

In [23]:
predictions[:10]
#predictions[:len(predictions)]

array([0, 0, 1, 1, 1, 1, 0, 0, 0, 1])

In [24]:
y_test[['high_humidity_label']][:10]


Unnamed: 0,high_humidity_label
456,0
845,0
693,1
259,1
723,1
224,1
300,1
442,0
585,1
1057,1


So, we have the actual values in the y_test result and the predicted values in the predictions set and now we can compare all the predictions with the actual results to see if we were correct or not. Let's find out the accuracy using accuracy_score.

### Measure the accuracy of the data

In [25]:
accuracy_score(y_test, y_pred = predictions)

0.8153409090909091

### Measuring the mean squared error

In [26]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred = predictions)

0.1846590909090909

We have predictd the humidity at 3PM based on the 9AM measurements with an 81% accuracy and 19% loss which are very good stats. Hence, it's a success. 