#                               FUTURE PREDICTION MODEL

Firstly, we are importing pandas and numpy module.Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool. Numpy refers to numeric python.

In [1]:
import pandas as pd
import numpy as np

# Reading the dataset

We are reading the csv file and storing it as dataframe 

In [2]:
df=pd.read_csv("Dataset - Sheet3.csv")

# Data pre-processing

head() method - Returns the first 5 rows of the dataframe.

In [3]:
df.head()

Unnamed: 0,Date,Day,Time,Empty level in cm(Total size=100cm)
0,03/01/21,Monday,7:00 AM,80
1,03/01/21,Monday,9:00 AM,60
2,03/01/21,Monday,11:00 AM,60
3,03/01/21,Monday,1:00 PM,50
4,03/01/21,Monday,3:00 PM,90


tail() method -  Returns the last 5 rows of the dataframe.

In [4]:
df.tail()

Unnamed: 0,Date,Day,Time,Empty level in cm(Total size=100cm)
239,03/28/21,Sunday,3:00 PM,60
240,03/28/21,Sunday,4:00 PM,40
241,03/28/21,Sunday,5:00 PM,35
242,03/28/21,Sunday,6:00 PM,100
243,03/28/21,Sunday,7:00 PM,80


Columns attribute return the column labels of the given Dataframe.

In [5]:
df.columns

Index(['Date', 'Day', 'Time', 'Empty level in cm(Total size=100cm)'], dtype='object')

dtypes attributes returns a Series with the data type of each column.

In [6]:
df.dtypes

Date                                   object
Day                                    object
Time                                   object
Empty level in cm(Total size=100cm)     int64
dtype: object

describe() method is used to view some basic statistical details like percentile, mean, std etc. of a data frame

In [7]:
df.describe()

Unnamed: 0,Empty level in cm(Total size=100cm)
count,244.0
mean,67.860656
std,24.060731
min,15.0
25%,50.0
50%,70.0
75%,90.0
max,100.0


info() function is used to print a concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 4 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   Date                                 244 non-null    object
 1   Day                                  244 non-null    object
 2   Time                                 244 non-null    object
 3   Empty level in cm(Total size=100cm)  244 non-null    int64 
dtypes: int64(1), object(3)
memory usage: 7.8+ KB


Using the below snippet code we are converting the datatypes of the columns(object-->int).

In [9]:
from sklearn.preprocessing import LabelEncoder
category= ['Date','Day','Time'] 
encoder= LabelEncoder()
for i in category:   
    df[i] = encoder.fit_transform(df[i]) 
df.dtypes

Date                                   int32
Day                                    int32
Time                                   int32
Empty level in cm(Total size=100cm)    int64
dtype: object

corr() is used to find the pairwise correlation of all columns in the dataframe. Any nan values are automatically excluded. For any non-numeric data type columns in the dataframe it is ignored. Note: The correlation of a variable with itself is 1.

In [10]:
df.corr()

Unnamed: 0,Date,Day,Time,Empty level in cm(Total size=100cm)
Date,1.0,-0.056907,-0.020373,-0.032041
Day,-0.056907,1.0,0.018148,-0.104016
Time,-0.020373,0.018148,1.0,0.200872
Empty level in cm(Total size=100cm),-0.032041,-0.104016,0.200872,1.0


From the above result we can conclude that all variables are independent of each other. So the idea of linear regression also fails 

*values attribute returns the numpy representation of the given DataFrame.
'Date','Day','Time' are taken into X and these independent variables 

In [11]:
X=df[['Date','Day','Time']].values
X[0:5]    #we are converting dataframe into numpy arrays.

array([[ 0,  1,  9],
       [ 0,  1, 12],
       [ 0,  1,  1],
       [ 0,  1,  3],
       [ 0,  1,  5]])

*values attribute returns the numpy representation of the given DataFrame.
'Empty level in cm(Total size=100cm)' are taken into Y and it is a dependent variables 

In [12]:
Y=df['Empty level in cm(Total size=100cm)'].values      #we are converting dataframe into numpy arrays.
Y[0:5]

array([80, 60, 60, 50, 90], dtype=int64)

Using the below snippet code we are converting the datatype of the X(int64-->float).

In [13]:
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X[0:5]

array([[-1.72883387, -1.04653253,  0.69851356],
       [-1.72883387, -1.04653253,  1.50755142],
       [-1.72883387, -1.04653253, -1.45892073],
       [-1.72883387, -1.04653253, -0.91956216],
       [-1.72883387, -1.04653253, -0.38020358]])

# Data Splitting

From the below snippet code we are dividing our dataset into two parts and that are 1.train set 2.test set basically test set will be 20% of the dataset
Assume a dataset contains 100 data points then 80 data points are used as training set and 20 data points as test set. 

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (195, 3) (195,)
Test set: (49, 3) (49,)


# Model Training

We are using DecisionTreeClassifier for training the model.
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.
DecisionTreeClassifier takes as input two arrays: an array X, sparse or dense, of shape (n_samples, n_features) holding the training samples, and an array Y of integer values, shape (n_samples,), holding the class labels for the training samples

In [15]:
from sklearn.tree import DecisionTreeClassifier

We are creating a object of DecisionTreeClassifier named model

In [16]:
model = DecisionTreeClassifier()

# fit the model with the training data
model.fit(X_train,y_train)

DecisionTreeClassifier()

We are using the trained model and predicting the test set.

In [17]:
y_pred = model.predict(X_test)

Using the below snippet code we are calculating r2_score, mse, and accuracy of our model.

In [18]:
import sklearn.metrics as metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
accuracy = metrics.accuracy_score(y_pred,y_test)
print("Accuracy : %s" % "{0:.3%}".format(accuracy))            #Accuracy of the model
print("r2_score:",r2_score(y_test, y_pred))                      #R2 value
print("mse:",mean_squared_error(y_test, y_pred)/49)                 #MSE(Mean Square Error) of the model

Accuracy : 77.551%
r2_score: 0.7691828120397397
mse: 2.344439816743024


Trained model is almost 80% accurate.

# Predicting the future values based on trained model. 

In [19]:
predict=pd.read_csv("prediction.csv")

In [20]:
from sklearn.preprocessing import LabelEncoder
category= ['Date','Day','Time'] 
encoder= LabelEncoder()
for i in category:   
    predict[i] = encoder.fit_transform(predict[i]) 
predict.dtypes

Date    int32
Day     int32
Time    int32
dtype: object

In [21]:
N=predict[['Date','Day','Time']].values
N[0:5]

array([[ 0,  1,  9],
       [ 0,  1, 12],
       [ 0,  1,  1],
       [ 0,  1,  3],
       [ 0,  1,  5]])

In [22]:
from sklearn import preprocessing
N = preprocessing.StandardScaler().fit(N).transform(N.astype(float))
N[0:5]

array([[-1.69378677, -1.04653253,  0.69851356],
       [-1.69378677, -1.04653253,  1.50755142],
       [-1.69378677, -1.04653253, -1.45892073],
       [-1.69378677, -1.04653253, -0.91956216],
       [-1.69378677, -1.04653253, -0.38020358]])

In [23]:
yhat = model.predict(N)

In [24]:
print(yhat)

[ 80  60  60  50  90  80  80  70  70  40 100  90  70  70  70  70  40 100
  75  65  35 100  85  52  35  28 100  90  85  76  53  32 100  85  80  80
  65  38  22 100  95  80  65  42  15 100  90  70  90  75  45  25 100  68
  65  65  60  40  35 100  80]


In [25]:
predict["Empty level in cm(Total size=100cm)"]=yhat

In [26]:
predict.head()

Unnamed: 0,Date,Day,Time,Empty level in cm(Total size=100cm)
0,0,1,9,80
1,0,1,12,60
2,0,1,1,60
3,0,1,3,50
4,0,1,5,90


In [27]:
predict["Day"].replace({1:"Monday",5:"Tuesday",6:"Wednesday",4:"Thursday",0:"Friday",2:"Saturday",3:"Sunday"},inplace=True)

In [28]:
predict.to_csv("Prediction_output.csv")