<a href="https://colab.research.google.com/github/abdulkham1d0v/hackathon-projects/blob/main/CASE1_Airkickers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OperaBuild Case1

## About Project

There is a set of data: a plan of construction work with a deadline for their completion, and there are actual results of the work performed.

`Main Objective`: It is necessary to develop a predictive algorithm that will suggest **delays in time**, taking into account *previous experience*.

## Project plan

1.  **Download and prepare the data.**

2. **Train and test the various models:**

    2.1. Getting features and target from prepared dataset
    
    2.2. Splitting features and target into train and test
    
    2.3.Training and testing various Regression algorithms
    
    2.4. Saving the model

3. **General Conclusion**

### Downloading  and preparing the data. 

**WARNING**

On the cell which have located at the bottom of this cell we will install `catboost` library for further usings. If you have this library, there is no need to run this cell or you ignore it with commenting

In [None]:
#pip install catboost

In [None]:
#Libraries which needed to work with dataset
from json import load
from datetime import datetime
 
 
import numpy as np
 
#Used for data visualization as table
import pandas as pd

In [None]:
#Predicting models
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
 
 
import catboost as cb

In [None]:
#Different tools needed to work with model
#Splitter(train ,test)
from sklearn.model_selection import train_test_split
 
#For evaluating model score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
 
#For saving model
import joblib

In [None]:
ds_case_1 = [] #Here we will store our dataset
#we will use this for extracting new info from existing data
seasons = [-1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4]

In [None]:
#days_between function will evaluate planned days by using start_date and end_date
def days_between(d1, d2):
    d1 = datetime.strptime(d1, "%Y-%m-%d")
    d2 = datetime.strptime(d2, "%Y-%m-%d")
    return abs((d2 - d1).days)
 
#get_month function will return month of given date
def get_month(d1):
    d1 = datetime.strptime(d1, "%Y-%m-%d")
    return d1.month
 
#Here we have opened our dataset and saved into ds_1
with open("wbs.task1.json", 'r') as file:
    ds_1 = load(file)
 
#By using loop we have fetched each row one by one and extracted new and existing information
for item in ds_1:
    if item['end_date'] != None and item.get('duration') != None: #We have observed that our dataset contains NA's, so this filtering condition
        days = days_between(item['start_date'].split('T')[0], item['end_date'].split('T')[0])
        ds_case_1.append({
            'plan': item.get('plan'), 
            'fact': item.get('fact'), 
            'start_date': item.get('start_date'), 
            'end_date': item.get('end_date'), 
            'start_season': seasons[get_month(item.get('start_date').split('T')[0])], #Here we get starting season of construction using get_month function and seasons list
            'end_season': seasons[get_month(item.get('end_date').split('T')[0])], #Here we get ending season of construction using get_month function and seasons list
            'days_between': days, #by using a days_between function we get difference days with starting and ending date
            'duration': item.get('duration')
        })

In [None]:
data = pd.DataFrame(ds_case_1)
display(data.head())
display(data.shape)

Unnamed: 0,plan,fact,start_date,end_date,start_season,end_season,days_between,duration
0,15.0,15.0,2020-11-11T10:49:23,2020-11-29T18:00:00,3,3,18,90
1,105.59,52.9453,2020-12-18T10:49:41,2020-11-29T18:00:00,3,3,19,90
2,100.0,100.0,2020-11-12T03:27:10,2020-11-29T18:00:00,3,3,17,90
3,90.12,90.12,2020-12-25T03:03:54,2020-11-29T18:00:00,3,3,26,90
4,575.87,575.87,2020-09-29T09:04:56,2020-11-29T18:00:00,3,3,61,60


(1222, 8)

In [None]:
print('-' * 100)
display(data.corr())
print('-' * 100)

----------------------------------------------------------------------------------------------------


Unnamed: 0,plan,start_season,end_season,days_between,duration
plan,1.0,-0.015395,0.099298,-0.05321,0.086259
start_season,-0.015395,1.0,0.280688,-0.362701,0.091525
end_season,0.099298,0.280688,1.0,-0.353988,0.487933
days_between,-0.05321,-0.362701,-0.353988,1.0,-0.156686
duration,0.086259,0.091525,0.487933,-0.156686,1.0


----------------------------------------------------------------------------------------------------


### Conclusion:

Our first step has finished. Here we have *downloaded, opened  and prepared* data for further usings. As you see above we got pretty enough data. One more thing here: we have `extracted new informations` from existing datasets. We did this because we have very **few features** for training data. And this will lead the `less scoring of metrics`.

## Train and test the various models:

### Getting `features` and `target` from *prepared dataset*

In [None]:
X = []
y = []

for item in ds_case_1:
    try:
        X.append([
            float(item['plan']),
            float(item['fact']),
            float(item['days_between']),
            float(item['start_season']),
            float(item['end_season'])
        ])
        y.append(float(item['duration']))
    except:
        print(item)

In [None]:
display(len(X))
display(len(y))

1222

1222

Here we have saved our `features` into **X** list and `target` into **y**. 

### Splitting `features` and `target` into *train* and *test*

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print('X_train size:',len(X_train))
print('X_test size:',len(X_test))
print('y_train size:',len(y_train))
print('y_test size:',len(y_test))

X_train size: 916
X_test size: 306
y_train size: 916
y_test size: 306


Here we have **splitted** our `features` and `target` into test set and train set.

### Training and testing various Regression algorithms

#### For scoring all models we will use **R2_SCORE**. (For choosing best model)

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

pred = lr.predict(X_test)

rmse = (np.sqrt(mean_squared_error(y_test, pred)))
r2 = lr.score(X_test,y_test)
mae = mean_absolute_error(y_test, pred)
print("Testing performance")
print("RMSE: {:.2f}".format(rmse))
print("MAE: {:.2f}".format(mae))
print("R2: {:.2f}".format(r2))

Testing performance
RMSE: 34.20
MAE: 20.50
R2: 0.22


*First of all*, we have `trained` and `tested` **Linear Regression model**. **Linear model** `did  not do this task quite well`. Let's see is there any other models that can do better than *Linear*? )

In [None]:
neigh = KNeighborsRegressor(n_neighbors=3)
neigh.fit(X_train, y_train)

pred = neigh.predict(X_test)

rmse = (np.sqrt(mean_squared_error(y_test, pred)))
r2 = neigh.score(X_test,y_test)
mae = mean_absolute_error(y_test, pred)
print("Testing performance")
print("RMSE: {:.2f}".format(rmse))
print("MAE: {:.2f}".format(mae))
print("R2: {:.2f}".format(r2))

Testing performance
RMSE: 22.96
MAE: 9.96
R2: 0.65


Wow, **KNeighbors** did this  about`three times` better than **Linear**. Is there any other effective ones than **KNeighbors**? Hmmm....

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

In [None]:
for depth in range(1, 51, 1):
  model = DecisionTreeRegressor(random_state=12345, max_depth=depth)
  model.fit(X_train, y_train)
  print('In a {} depth score is: {}'.format(depth, model.score(X_test, y_test)))

In a 1 depth score is: 0.23737883812384108
In a 2 depth score is: 0.33212628991743043
In a 3 depth score is: 0.4736155860137039
In a 4 depth score is: 0.660432171450133
In a 5 depth score is: 0.6786124272858381
In a 6 depth score is: 0.7675999972161068
In a 7 depth score is: 0.753520777413524
In a 8 depth score is: 0.7865826873162671
In a 9 depth score is: 0.7937795948034531
In a 10 depth score is: 0.8024662126369155
In a 11 depth score is: 0.7931164600000418
In a 12 depth score is: 0.7783497696857703
In a 13 depth score is: 0.7909000008737048
In a 14 depth score is: 0.7859312839495701
In a 15 depth score is: 0.8018425313400461
In a 16 depth score is: 0.7971993350683325
In a 17 depth score is: 0.8001958296084694
In a 18 depth score is: 0.7919820779512531
In a 19 depth score is: 0.7919820779512531
In a 20 depth score is: 0.7919820779512531
In a 21 depth score is: 0.7919820779512531
In a 22 depth score is: 0.7919820779512531
In a 23 depth score is: 0.7919820779512531
In a 24 depth score 

**DecisionTreeRegressor** quite better than **KNeighbours** algorithm. It showed best result at `depth = 10`.

In [None]:
dtr = DecisionTreeRegressor(random_state=12345, max_depth = 10)
dtr.fit(X_train, y_train)

pred = dtr.predict(X_test)

r2 = dtr.score(X_test,y_test)
mae = mean_absolute_error(y_test, pred)
rmse = (np.sqrt(mean_squared_error(y_test, pred)))
print("Testing performance")
print("RMSE: {:.2f}".format(rmse))
print("MAE: {:.2f}".format(mae))
print("R2: {:.2f}".format(r2))

Testing performance
RMSE: 17.19
MAE: 5.90
R2: 0.80


In [None]:
for estimator in range(10, 301, 10):
  model = RandomForestRegressor(random_state=12345, n_estimators=estimator)
  model.fit(X_train, y_train)
  print('In a {} depth score is: {}'.format(estimator, model.score(X_test, y_test)))

In a 10 depth score is: 0.8434825672609527
In a 20 depth score is: 0.8452063912219023
In a 30 depth score is: 0.8494415712579961
In a 40 depth score is: 0.8459005340238712
In a 50 depth score is: 0.8459623175390188
In a 60 depth score is: 0.8491980653906437
In a 70 depth score is: 0.8507955379939665
In a 80 depth score is: 0.848708903229766
In a 90 depth score is: 0.8507789301628194
In a 100 depth score is: 0.8504155097143017
In a 110 depth score is: 0.8478809638930076
In a 120 depth score is: 0.8481941363939132
In a 130 depth score is: 0.8484276377566016
In a 140 depth score is: 0.8494747207632252
In a 150 depth score is: 0.8501247489967625
In a 160 depth score is: 0.8503505545911764
In a 170 depth score is: 0.8500891177279315
In a 180 depth score is: 0.8506279539755741
In a 190 depth score is: 0.8506981822188318
In a 200 depth score is: 0.8506749455593097
In a 210 depth score is: 0.8498926945682811
In a 220 depth score is: 0.8503259701288919
In a 230 depth score is: 0.850961356023753

In all cases **RandomForestRegressor** was better than **DecisionTreeRegressor**. And it's best result was at `n_estimators = 260`

In [None]:
rfr = RandomForestRegressor(random_state=12345, n_estimators=260)
rfr.fit(X_train, y_train)

pred = rfr.predict(X_test)

rmse = (np.sqrt(mean_squared_error(y_test, pred)))
r2 = rfr.score(X_test,y_test)
mae = mean_absolute_error(y_test, pred)
print("Testing performance")
print("RMSE: {:.2f}".format(rmse))
print("MAE: {:.2f}".format(mae))
print("R2: {:.2f}".format(r2))

Testing performance
RMSE: 14.90
MAE: 6.30
R2: 0.85


Now we are going to use **CatBoost** `gradient booster` from **Yandex**. This algorithm very similar to the Neural Networks it will learn it's own mistake.

In [None]:
train_dataset = cb.Pool(X_train, y_train) 
test_dataset = cb.Pool(X_test, y_test)

model = cb.CatBoostRegressor(loss_function = 'RMSE',eval_metric = 'R2')
grid = {'iterations': [250, 300, 400],
        'learning_rate': [0.09,0.2],
        'depth': [2, 4, 6, 8],
        'l2_leaf_reg': [0.2, 0.5, 1, 3]}
model.grid_search(grid, train_dataset)

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
166:	learn: 0.9866050	test: 0.9396917	best: 0.9399098 (161)	total: 336ms	remaining: 267ms
167:	learn: 0.9866509	test: 0.9398393	best: 0.9399098 (161)	total: 338ms	remaining: 265ms
168:	learn: 0.9866828	test: 0.9398170	best: 0.9399098 (161)	total: 340ms	remaining: 263ms
169:	learn: 0.9867590	test: 0.9397924	best: 0.9399098 (161)	total: 341ms	remaining: 261ms
170:	learn: 0.9867765	test: 0.9398514	best: 0.9399098 (161)	total: 343ms	remaining: 259ms
171:	learn: 0.9868088	test: 0.9398283	best: 0.9399098 (161)	total: 345ms	remaining: 257ms
172:	learn: 0.9868503	test: 0.9398062	best: 0.9399098 (161)	total: 347ms	remaining: 255ms
173:	learn: 0.9869137	test: 0.9398130	best: 0.9399098 (161)	total: 349ms	remaining: 253ms
174:	learn: 0.9869799	test: 0.9397994	best: 0.9399098 (161)	total: 351ms	remaining: 251ms
175:	learn: 0.9870190	test: 0.9397648	best: 0.9399098 (161)	total: 353ms	remaining: 248ms
176:	learn: 0.9870

{'cv_results': defaultdict(list,
             {'iterations': [0,
               1,
               2,
               3,
               4,
               5,
               6,
               7,
               8,
               9,
               10,
               11,
               12,
               13,
               14,
               15,
               16,
               17,
               18,
               19,
               20,
               21,
               22,
               23,
               24,
               25,
               26,
               27,
               28,
               29,
               30,
               31,
               32,
               33,
               34,
               35,
               36,
               37,
               38,
               39,
               40,
               41,
               42,
               43,
               44,
               45,
               46,
               47,
               48,
               49,
             

In [None]:
model.is_fitted()

True

Now our model is *fitted*. And let's **test** our model `in action!`

In [None]:
pred = model.predict(X_test)

rmse = (np.sqrt(mean_squared_error(y_test, pred)))
r2 = model.score(X_test,y_test)
mae = mean_absolute_error(y_test, pred)
print("Testing performance")
print("RMSE: {:.2f}".format(rmse))
print("MAE: {:.2f}".format(mae))
print("R2: {:.2f}".format(r2))

Testing performance
RMSE: 14.64
MAE: 6.64
R2: 0.86


In [None]:
model.get_params()

{'depth': 8,
 'eval_metric': 'R2',
 'iterations': 400,
 'l2_leaf_reg': 0.2,
 'learning_rate': 0.09,
 'loss_function': 'RMSE'}

In [None]:
y_test[:10]

[90.0, 90.0, 60.0, 30.0, 60.0, 30.0, 29.0, 90.0, 90.0, 60.0]

In [None]:
pred[:10]

array([89.96658332, 92.92080255, 58.28576067, 30.57027193, 61.03433752,
       31.87620112, 74.98018436, 89.82246063, 87.70059224, 57.31031553])

Toooodaaay'sss chammmpiiiionnn is **CatBooster gradient boosting algorithm** from Yandex. This model is not only show best `r2_score`, but also it is `other scores` are good!

### Saving the model

#### Use cell on the below for loading your model.

In [None]:
#path = "./case_1_final_86.joblib"#'Path for saving your file\name_for_file.joblib'
#joblib.dump(model, path)

['./case_1_final_86.joblib']

## General Conclusion

We have downloaded and prepared data for making predictions. Then we have divided `features` and `target` from data. After that we splitted `features` and `target` into `train` and `test`. Then we have performed prediction with various **Regression models**. According results which we get **CatBoostRegression** was the `best`. **KNeighborsRegressor**  about **1.5 times** `less accurate` than **CatBooster**, **LinearRegression** was `less accurate` about more than **3 times**.Also, **RandomForest** and **DecisionTree** were very close to **Catbooster**, but learning from own mistakes made better **Catbooster** algorithm. But **CatBooster algorithm** `slowest one`. But we can handle this by getting best parameters for future using in our dataset.

In [None]:
model1 = joblib.load('case1.joblib')
model1.get_params()

{'depth': 2,
 'eval_metric': 'R2',
 'iterations': 400,
 'l2_leaf_reg': 0.5,
 'learning_rate': 0.4,
 'loss_function': 'RMSE'}

In [None]:
#model.predict([246.53,246.53,12,3,3])

91.78926725449733