**Machine learning** is a part of Artificial Intelligence, that focuses on creating algorithms that **allow computer to learn and make decision using data, instead of following explicit instructions from programmers.**

There are three main types of machine learning: Supervised Learning, Unsupervised Learning, and Reinforcement Learning. Today, we will take a closer look at Supervised Learning algorithms, focusing specifically on Classification models

**Classification is based on training on a data set and giving the model ability to learn boudaries that separate categories of data. Depending on the dataset and the type of model there are endless possibilities how the boundaries will look like. Later on a new data in the model can be categories simply by its location.**

# Importing Libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Data Preprocessing

## Loading the Dataset

In [3]:
# load the data
df = pd.read_csv('Temperature_predictions.csv')

# get information about data
print(df)

df = df.iloc[:, 1:]
print(df)

print(df.columns)
print(df.shape)

    Unnamed: 0  Temperature  Humidity  WindSpeed  Rainy
0            0     0.374540  0.031429          7      0
1            1     0.950714  0.636410          3      0
2            2     0.731994  0.314356          0      1
3            3     0.598658  0.508571          7      1
4            4     0.156019  0.907566          3      1
..         ...          ...       ...        ...    ...
95          95     0.493796  0.349210          3      0
96          96     0.522733  0.725956          3      1
97          97     0.427541  0.897110          0      1
98          98     0.025419       NaN          7      1
99          99     0.107891  0.779876          2      0

[100 rows x 5 columns]
    Temperature  Humidity  WindSpeed  Rainy
0      0.374540  0.031429          7      0
1      0.950714  0.636410          3      0
2      0.731994  0.314356          0      1
3      0.598658  0.508571          7      1
4      0.156019  0.907566          3      1
..          ...       ...        ...    

## Splitting Dataset into Input and Output Features

In [4]:
# think what are the inputs and what is the output, what data we need for model training and what operation should we do on data
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

print(X)
print(y)

    Temperature  Humidity  WindSpeed
0      0.374540  0.031429          7
1      0.950714  0.636410          3
2      0.731994  0.314356          0
3      0.598658  0.508571          7
4      0.156019  0.907566          3
..          ...       ...        ...
95     0.493796  0.349210          3
96     0.522733  0.725956          3
97     0.427541  0.897110          0
98     0.025419       NaN          7
99     0.107891  0.779876          2

[100 rows x 3 columns]
0     0
1     0
2     1
3     1
4     1
     ..
95    0
96    1
97    1
98    1
99    0
Name: Rainy, Length: 100, dtype: int64


## Missing data

**What we can do with missing data?** </br>
-> ignore rows with missing data </br>
-> replace them (average, median, most frequent value, predict with algorithm, ...)

In [5]:
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imp_mean.fit(X.iloc[:, 0:2])
X.iloc[:, 0:2] = imp_mean.transform(X.iloc[:, 0:2])

print(X)

    Temperature  Humidity  WindSpeed
0      0.374540  0.031429          7
1      0.950714  0.636410          3
2      0.731994  0.314356          0
3      0.598658  0.508571          7
4      0.156019  0.907566          3
..          ...       ...        ...
95     0.493796  0.349210          3
96     0.522733  0.725956          3
97     0.427541  0.897110          0
98     0.025419  0.478728          7
99     0.107891  0.779876          2

[100 rows x 3 columns]


## Splitting data into training and testing dataset

In [6]:
# it will put random rows into training and testing dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

print(X_test)
print(y_test)

    Temperature  Humidity  WindSpeed
83     0.063558  0.877339          5
53     0.894827  0.489453          4
70     0.772245  0.478728          7
45     0.662522  0.036887          3
44     0.258780  0.284840          2
39     0.440152  0.971782          1
22     0.292145  0.318003          4
80     0.863103  0.341066          7
10     0.020584  0.289751          2
0      0.374540  0.031429          7
18     0.431945  0.892559          3
30     0.607545  0.417411          2
73     0.815461  0.226496          5
33     0.948886  0.337615          5
90     0.119594  0.093103          8
4      0.156019  0.907566          3
76     0.771270  0.690938          9
77     0.074045  0.386735          1
12     0.832443  0.929698          1
31     0.170524  0.222108          5
55     0.921874  0.242055          0
88     0.887213  0.529651          2
26     0.199674  0.818015          6
42     0.034389  0.497249          0
69     0.986887  0.590893          2
83    0
53    0
70    0
45    1
44    

# Training the Decision Tree Classification Model 

**How does Decision Tree algorithm work?** </br>
Here we got an example and the decision tree algorithm will cut up our dataset in several iterations.  
**How does the splits are done?** 
If u take a closer look - the split are done to maximize the number of a point from a same category in each of these splits. Thats a very basic way to explain, in the reality it counts with some entropy and complex math, if u wanna know more I will leave a link with explanation videos u can watch after this tutorial *P

In [7]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

# Testing the Model with Predictions

In [8]:
# print(classifier.predict([[0.034389,0.497249,0]]))

# Assuming your feature names are 'Temperature', 'Humidity', and 'WindSpeed'
input_data = pd.DataFrame([[0.034389, 0.497249, 0]], columns=['Temperature', 'Humidity', 'WindSpeed'])

# Make the prediction
print(classifier.predict(input_data))

[1]


In [9]:
y_pred = classifier.predict(X_test)
print(y_pred)
y_test = np.array(y_test)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))

[1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 0 0 1 1]
[[1 0]
 [0 0]
 [0 0]
 [0 1]
 [0 0]
 [1 1]
 [0 1]
 [0 1]
 [1 0]
 [0 0]
 [1 1]
 [0 0]
 [0 1]
 [0 1]
 [0 0]
 [1 1]
 [0 1]
 [0 1]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [0 1]
 [1 0]
 [1 1]]


In [10]:
# Here will be explanation in the video
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

[[8 3]
 [8 6]]
