# Homework 1: ML pipeline

This lab assignment consists of several parts. You are supposed to make some transformations, train some models, estimate the quality of the models and explain your results.

Several comments:
* Don't hesitate to ask questions, it's a good practice
* No private/public sharing, please. The copied assignments will be graded with 0 points

# Reading the data

Today we work with the [dataset](https://archive.ics.uci.edu/ml/datasets/Statlog+%28Vehicle+Silhouettes%29), describing different cars for multiclass ($k=4$) classification problem. The data is available below.

In [3]:
!pip install ucimlrepo

403.02s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
Collecting ucimlrepo
  Using cached ucimlrepo-0.0.3-py3-none-any.whl.metadata (5.2 kB)
Using cached ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3


In [4]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
statlog_vehicle_silhouettes = fetch_ucirepo(id=149)

**Abstract**

In [5]:
print(statlog_vehicle_silhouettes.metadata.abstract)

3D objects within a 2D image by application of an ensemble of shape feature extractors to the 2D silhouettes of the objects.


**Summary**

In [6]:
print(''.join(statlog_vehicle_silhouettes.metadata.additional_info.summary))

The purpose is to classify a given silhouette as one of four types of vehicle, using  a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.  

HISTORY:

This data was originally gathered at the TI in 1986-87 by JP Siebert. It was partially financed by Barr and Stroud Ltd. The original purpose was to find a method of distinguishing 3D objects within a 2D image by application of an ensemble of shape feature extractors to the 2D silhouettes of the objects. Measures of shape features extracted from example silhouettes of objects to be discriminated were used to generate a classification rule tree by means of computer induction.

This object recognition strategy was successfully used to discriminate between silhouettes of model cars, vans and buses viewed from constrained elevation but all angles of rotation.

The rule tree classification performance compared favourably to MDC (Minimum Distance Classifier) and k-NN (k-Nearest Neighbour

**Features**

In [7]:
statlog_vehicle_silhouettes.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,COMPACTNESS,Feature,Integer,,,,no
1,CIRCULARITY,Feature,Integer,,,,no
2,DISTANCE CIRCULARITY,Feature,Integer,,,,no
3,RADIUS RATIO,Feature,Integer,,,,no
4,PR.AXIS ASPECT RATIO,Feature,Integer,,,,no
5,MAX.LENGTH ASPECT RATIO,Feature,Integer,,,,no
6,SCATTER RATIO,Feature,Integer,,,,no
7,ELONGATEDNESS,Feature,Integer,,,,no
8,PR.AXIS RECTANGULARITY,Feature,Integer,,,,no
9,MAX.LENGTH RECTANGULARITY,Feature,Integer,,,,no


**Full info**

Very verbose

In [8]:
# import pprint

# pp = pprint.PrettyPrinter(compact=True)
# pp.pprint(statlog_vehicle_silhouettes.metadata)

**Data**

In [9]:
data = statlog_vehicle_silhouettes.data.features 
target = statlog_vehicle_silhouettes.data.targets 

In [10]:
data

Unnamed: 0,COMPACTNESS,CIRCULARITY,DISTANCE CIRCULARITY,RADIUS RATIO,PR.AXIS ASPECT RATIO,MAX.LENGTH ASPECT RATIO,SCATTER RATIO,ELONGATEDNESS,PR.AXIS RECTANGULARITY,MAX.LENGTH RECTANGULARITY,SCALED VARIANCE ALONG MAJOR AXIS,SCALED VARIANCE ALONG MINOR AXIS,SCALED RADIUS OF GYRATION,SKEWNESS ABOUT MAJOR AXIS,SKEWNESS ABOUT MINOR AXIS,KURTOSIS ABOUT MINOR AXIS,KURTOSIS ABOUT MAJOR AXIS,HOLLOWS RATIO
0,95.0,48,83,178,72,10,162,42,20,159,176,379,184,70,6,16,187,197
1,91.0,41,84,141,57,9,149,45,19,143,170,330,158,72,9,14,189,199
2,104.0,50,106,209,66,10,207,32,23,158,223,635,220,73,14,9,188,196
3,93.0,41,82,159,63,9,144,46,19,143,160,309,127,63,6,10,199,207
4,85.0,44,70,205,103,52,149,45,19,144,241,325,188,127,9,11,180,183
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
841,93.0,39,87,183,64,8,169,40,20,134,200,422,149,72,7,25,188,195
842,89.0,46,84,163,66,11,159,43,20,159,173,368,176,72,1,20,186,197
843,106.0,54,101,222,67,12,222,30,25,173,228,721,200,70,3,4,187,201
844,86.0,36,78,146,58,7,135,50,18,124,155,270,148,66,0,25,190,195


In [11]:
target

Unnamed: 0,class
0,van
1,van
2,saab
3,van
4,bus
...,...
841,saab
842,van
843,saab
844,saab


## Task 1: Split data for evaluation

Put training and validation data to corresponding variables

In [18]:
# your code here
from sklearn.model_selection import train_test_split


X_train, X_val, y_train, y_val = train_test_split(data, target, test_size=0.25)

In [19]:
print(X_train.shape, y_train.shape)
print(X_val.shape, y_val.shape)

(634, 18) (634, 1)
(212, 18) (212, 1)


Methods `describe` and `info` deliver some useful information.

In [20]:
X_train.describe()

Unnamed: 0,COMPACTNESS,CIRCULARITY,DISTANCE CIRCULARITY,RADIUS RATIO,PR.AXIS ASPECT RATIO,MAX.LENGTH ASPECT RATIO,SCATTER RATIO,ELONGATEDNESS,PR.AXIS RECTANGULARITY,MAX.LENGTH RECTANGULARITY,SCALED VARIANCE ALONG MAJOR AXIS,SCALED VARIANCE ALONG MINOR AXIS,SCALED RADIUS OF GYRATION,SKEWNESS ABOUT MAJOR AXIS,SKEWNESS ABOUT MINOR AXIS,KURTOSIS ABOUT MINOR AXIS,KURTOSIS ABOUT MAJOR AXIS,HOLLOWS RATIO
count,633.0,634.0,634.0,634.0,634.0,634.0,634.0,634.0,634.0,634.0,634.0,634.0,634.0,634.0,634.0,634.0,634.0,634.0
mean,93.791469,44.884858,81.899054,169.121451,61.94795,8.65142,168.339117,41.146688,20.589905,147.635647,188.340694,438.219243,174.501577,72.476341,6.421136,12.654574,188.801262,195.687697
std,8.167595,6.496882,15.91444,33.907915,9.614799,5.507709,33.432776,9.173538,2.658372,15.295966,30.743304,173.852576,33.304865,7.748428,5.59685,9.146979,9.143145,7.458804
min,73.0,33.0,36.0,73.0,47.0,2.0,6.0,26.0,17.0,20.0,127.0,184.0,109.0,59.0,0.0,0.0,19.0,181.0
25%,88.0,40.0,70.0,140.25,57.0,6.25,147.0,33.0,19.0,136.0,168.0,319.0,148.0,67.0,2.0,5.0,184.25,191.0
50%,93.0,44.0,80.0,168.5,61.0,8.0,157.0,43.0,20.0,146.0,179.0,365.0,173.0,71.0,6.0,11.0,189.0,197.0
75%,100.0,49.0,98.0,195.75,65.0,10.0,198.0,46.0,23.0,159.0,216.75,586.75,198.0,75.0,9.0,19.0,193.0,201.0
max,117.0,100.0,112.0,322.0,199.0,73.0,265.0,162.0,40.0,188.0,285.0,1018.0,401.0,127.0,72.0,41.0,206.0,211.0


In [21]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 634 entries, 608 to 794
Data columns (total 18 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   COMPACTNESS                       633 non-null    float64
 1   CIRCULARITY                       634 non-null    int64  
 2   DISTANCE CIRCULARITY              634 non-null    int64  
 3   RADIUS RATIO                      634 non-null    int64  
 4   PR.AXIS ASPECT RATIO              634 non-null    int64  
 5   MAX.LENGTH ASPECT RATIO           634 non-null    int64  
 6   SCATTER RATIO                     634 non-null    int64  
 7   ELONGATEDNESS                     634 non-null    int64  
 8   PR.AXIS RECTANGULARITY            634 non-null    int64  
 9   MAX.LENGTH RECTANGULARITY         634 non-null    int64  
 10  SCALED VARIANCE ALONG MAJOR AXIS  634 non-null    int64  
 11  SCALED VARIANCE ALONG MINOR AXIS  634 non-null    int64  
 12  SCALED RADI

# Machine Learning pipeline
Here you are supposed to perform the desired transformations. Please, explain your results briefly after each task.

## Task 2: Data preprocessing

Make some transformations of the dataset which you think is relevant. Briefly explain the transformations

In [None]:
# your code here

## Task 3: Basic logistic regression

* Find optimal hyperparameters for logistic regression with cross-validation on the `train` data (small grid search is enough, no need to find the *best* parameters)
* Estimate the model quality with `f1` and `accuracy` scores
* Plot a ROC-curve for the trained model

Hint: For the multiclass case you might use [`scikitplot` library](https://scikit-plot.readthedocs.io/en/stable/metrics.html) (e.g. `scikitplot.metrics.plot_roc(true_labels, predicted_proba)`)

In [None]:
# your code here

In [None]:
# Install scikit-plot
# Warning: if you a running locally, it's better to call it from terminal in the corresponding
# virtual environment instead of calling pip from within jupyter

# ! pip install scikit-plot

## Task 4: PCA explained variance plot

Apply the PCA to the train part of the data. Build the explaided variance plot. 

In [None]:
# your code here

## Task 5: PCA trasformation

Select the appropriate number of components. Briefly explain your choice.

*Use `fit` and `transform` methods to transform the `train` and `val` parts.*

In [None]:
# your code here

**Note**:
From this point `sklearn` [composite estimators](https://scikit-learn.org/stable/modules/compose.html) might be useful to perform transformations on the data.\
Refer to the [Pipeline docs](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) for more information.

## Task 6: Logistic regression on PCA-preprocessed data

Find optimal hyperparameters for logistic regression with cross-validation on the transformed by PCA `train` data.

* Estimate the model quality with `f1` and `accuracy` scores
* Plot a ROC-curve for the trained model

*Note: please, use the following subset of technical hyperparameters for logistic regression: `multi_class='multinomial'`, `solver='saga'` and `tol=1e-3`*

In [None]:
# your code here

## Task 7: Decision tree

Now train a desicion tree on the same data. Find optimal tree hparams using cross-validation.

Measure the model quality using the same metrics you used above.

In [None]:
# your code here

## Task 8: Bagging

Here starts the ensembling part.

First we will use the _Bagging_ approach. Build an ensemble of $N$ algorithms varying N from $N_{min}=2$ to $N_{max}=100$ (with step 5).

We will build two ensembles: of logistic regressions and of decision trees.

*Comment: each ensemble should be constructed from models of the same family, so logistic regressions should not be mixed up with decision trees.*


*Hint 1: To build a _Bagging_ ensebmle varying the ensemble size efficiently you might generate $N_{max}$ subsets of `train` data (of the same size as the original dataset) using bootstrap procedure once. Then you train a new instance of logistic regression/decision tree with optimal hyperparameters you estimated before on each subset (so you train it from scratch). Finally, to get an ensemble of $N$ models you average the $N$ out of $N_{max}$ models predictions.*

*Hint 2: sklearn might help you with this taks. Some appropriate function/class might be out there.*

* Plot `f1` and `accuracy` scores plots w.r.t. the size of the ensemble.
* Briefly analyse the plot. What is the optimal number of algorithms? Explain your answer.
* How do you think, are the hyperparameters for the decision trees you found above optimal for trees used in ensemble? 

In [None]:
# your code here

## Task 9: Random Forest

Now we will work with the Random Forest (its `sklearn` implementation).

Plot `f1` and `accuracy` scores plots w.r.t. the number of trees in Random Forest.

* What is the optimal number of trees you've got? Is it different from the optimal number of logistic regressions/decision trees in previous part? Explain the results briefly.

In [None]:
# your code here

## Task 10: Learning curve

Your goal is to estimate, how does the model behaviour change with the increase of the `train` dataset size.

Split the training data into 10 equal (almost) parts. Then train the models from above (Logistic regression, Desicion Tree, Random Forest) with optimal hyperparameters you have selected on 1 part, 2 parts (combined, so the train size in increased by 2 times), 3 parts and so on.

Build a plot of `accuracy` and `f1` scores on `val` part, varying the `train` dataset size (so the axes will be score - dataset size).

Analyse the final plot. Can you make any conlusions using it?

In [22]:
# your code here

## Task 11: Boosting

Your goal is to build a boosting ensemble using CatBoost, Lightgbm or xgboost package.
Please, do not use the sklearn API for these models.

Find optimal number of decision trees in the boosting ensembe using grid search.
Please, explain your answer.

In [23]:
# your code here