# Tema 5 - Modele de regresie

In [1]:
import numpy as np
import pandas as pd
import sklearn

print('At the moment the program was made:')
print('\tNumPy Version = 1.20.2')
print('\tScykit_learn Version = 0.24.1')
print('\tPandas Version = 1.2.3\n')

print(f'Current NumPy Version = {np.__version__}')
print(f'Current Sklearn Version = {sklearn.__version__}')
print(f'Current Pandas Version = {pd.__version__}')

At the moment the program was made:
	NumPy Version = 1.20.2
	Scykit_learn Version = 0.24.1
	Pandas Version = 1.2.3

Current NumPy Version = 1.20.2
Current Sklearn Version = 0.24.1
Current Pandas Version = 1.2.4


In [3]:
zeros: np.ndarray = np.zeros(10)
CPU_model_df: pd.DataFrame = pd.DataFrame({
    'Model':list(zeros), 
    'Search_strategy': list(zeros), 
    'test_neg_mean_absolute_error': list(zeros), 
    'test_neg_mean_squared_error': list(zeros), 
    'test_neg_median_absolute_error': list(zeros),
    'train_neg_mean_absolute_error': list(zeros),
    'train_neg_mean_squared_error': list(zeros),
    'train_neg_median_absolute_error': list(zeros),
    'fit_time': list(zeros),
    'score_time': list(zeros)
})

## CPU Computer Hardware

* CPU Computer Hardware; excludeti din dataset coloanele: vendor name, model name, estimated relative performance; se va estima coloana "published relative performance".

In [4]:
CPU_data: pd.DataFrame = pd.read_csv("data/machine.data", delimiter=",", header=None)
CPU_data = CPU_data.drop([0, 1, 9], axis = 1)
#CPU_data = CPU_data.drop(1, axis = 1)
#CPU_data = CPU_data.drop(9, axis = 1)
CPU_data.head()

Unnamed: 0,2,3,4,5,6,7,8
0,125,256,6000,256,16,128,198
1,29,8000,32000,32,8,32,269
2,29,8000,32000,32,8,32,220
3,29,8000,32000,32,8,32,172
4,29,8000,16000,32,8,16,132


In [15]:
CPU_X: np.ndarray = CPU_data.iloc[:, :-1].values
CPU_y: np.ndarray = CPU_data.iloc[:, -1].values

In [19]:
from sklearn.linear_model import LinearRegression, LogisticRegression, LassoLars, ARDRegression, PassiveAggressiveRegressor, TheilSenRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, DotProduct
from sklearn.svm import SVR

from sklearn.model_selection import cross_validate, GridSearchCV, RandomizedSearchCV

from typing import List, Dict, Union, Any

In [20]:
RegressionModel = Union[LinearRegression, LogisticRegression, LassoLars, ARDRegression, PassiveAggressiveRegressor, TheilSenRegressor,
                       KernelRidge, GaussianProcessRegressor, SVR]
SearchModel = Union[GridSearchCV, RandomizedSearchCV]

In [22]:
def populate_dataFrame(dataFrame: pd.DataFrame, scores: Dict[str, List[float]], regression_model: str, grid_search: str, index: int) -> None:
    '''
    Function to populate a Pandas Dataframe given as parameter depending on a scores dictionary.
    :param dataFrame: pandas Dataframe to be populated
    :param scores: dictionary with keys of type string and values of type float that contains the values to be put in :param dataFrame:
    :param regression_model: string representing the name of the regression model used for finding the scores
    :param grid_search: string representing the name used for searching the best hyperparameters
    :param index: integer representing the line of :param dataFrame: that will be populated
    :returns None:
    '''
    dataFrame.iloc[index, 0] = regression_model
    dataFrame.iloc[index, 1] = grid_search
    dataFrame.iloc[index, 2] = scores['test_neg_mean_absolute_error'].mean()
    dataFrame.iloc[index, 3] = scores['test_neg_mean_squared_error'].mean()
    dataFrame.iloc[index, 4] = scores['test_neg_median_absolute_error'].mean()
    dataFrame.iloc[index, 5] = scores['train_neg_mean_absolute_error'].mean()
    dataFrame.iloc[index, 6] = scores['train_neg_mean_squared_error'].mean()
    dataFrame.iloc[index, 7] = scores['train_neg_median_absolute_error'].mean()
    dataFrame.iloc[index, 8] = scores['fit_time'].mean()
    dataFrame.iloc[index, 9] = scores['score_time'].mean()


### LinearRegression

In [6]:
from sklearn import metrics

In [23]:
regression: RegressionModel = LinearRegression()
parameter_grid: Dict[str, List[Any]] = {"fit_intercept":[True, False], "normalize":[True, False], "n_jobs":[None, 2, 4, 6]}

In [25]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(CPU_X, CPU_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'fit_intercept': True, 'n_jobs': None, 'normalize': False}


In [26]:
tunned_regression: RegressionModel = LinearRegression(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tunned_regression, CPU_X, CPU_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(CPU_model_df, grid_scores, 'LinearRegression', 'GridSearchCV', 0)

In [27]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(CPU_X, CPU_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'normalize': False, 'n_jobs': 6, 'fit_intercept': True}


In [28]:
tunned_regression: RegressionModel = LinearRegression(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tunned_regression, CPU_X, CPU_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(CPU_model_df, random_scores, 'LinearRegression', 'RandomizedSearchCV', 1)
CPU_model_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,LinearRegression,GridSearchCV,-43.378262,-6383.656697,-27.053792,-36.695674,-3243.698611,-25.581576,0.000836,0.002523
1,LinearRegression,RandomizedSearchCV,-43.378262,-6383.656697,-27.053792,-36.695674,-3243.698611,-25.581576,0.001215,0.001262
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### DecisionTreeRegressor

In [29]:
regression: RegressionModel = DecisionTreeRegressor()
parameter_grid: Dict[str, List[Any]] = {'splitter':['best', 'random'], 'max_depth': [1, 3, 5, 7], 'min_samples_leaf':[1, 2, 3, 4, 5],
                 'min_weight_fraction_leaf':[0.1, 0.2, 0.3, 0.4, 0.5], 'max_features':['auto', 'log2', 'sqrt', None],
                 'max_leaf_nodes':[None, 10, 20, 30, 40, 50]}

In [31]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(CPU_X, CPU_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'max_depth': 5, 'max_features': 'sqrt', 'max_leaf_nodes': 10, 'min_samples_leaf': 2, 'min_weight_fraction_leaf': 0.1, 'splitter': 'best'}


In [32]:
tuned_regression: RegressionModel = DecisionTreeRegressor(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, CPU_X, CPU_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(CPU_model_df, grid_scores, 'DecisionTreeRegressor', 'GridSearchCV', 2)

In [33]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(CPU_X, CPU_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'splitter': 'best', 'min_weight_fraction_leaf': 0.3, 'min_samples_leaf': 3, 'max_leaf_nodes': 50, 'max_features': 'sqrt', 'max_depth': 5}


In [34]:
tuned_regression: RegressionModel = DecisionTreeRegressor(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tuned_regression, CPU_X, CPU_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(CPU_model_df, random_scores, 'DecisionTreeRegressor', 'RandomizedSearchCV', 3)

CPU_model_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,LinearRegression,GridSearchCV,-43.378262,-6383.656697,-27.053792,-36.695674,-3243.698611,-25.581576,0.000836,0.002523
1,LinearRegression,RandomizedSearchCV,-43.378262,-6383.656697,-27.053792,-36.695674,-3243.698611,-25.581576,0.001215,0.001262
2,DecisionTreeRegressor,GridSearchCV,-54.243176,-14087.920619,-19.656434,-49.061471,-12064.236053,-16.83751,0.000587,0.002497
3,DecisionTreeRegressor,RandomizedSearchCV,-66.16732,-18731.135958,-27.235594,-68.544491,-18385.452782,-26.036149,0.000994,0.001433
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### KernelRidge

In [35]:
regression: RegressionModel = KernelRidge()
parameter_grid: Dict[str, List[Any]] = {'kernel':['linear', 'rbf'], 'alpha':[1, 2, 3, 5], 'gamma':[None, 0.1, 0.001]}

In [36]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(CPU_X, CPU_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'alpha': 5, 'gamma': None, 'kernel': 'linear'}


In [37]:
tuned_regression: RegressionModel = KernelRidge(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, CPU_X, CPU_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(CPU_model_df, grid_scores, 'KernelRidge', 'GridSearchCV', 4)

In [38]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(CPU_X, CPU_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'kernel': 'linear', 'gamma': 0.1, 'alpha': 5}


In [39]:
tuned_regression: RegressionModel = KernelRidge(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tuned_regression, CPU_X, CPU_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(CPU_model_df, random_scores, 'KernelRidge', 'RandomizedSearchCV', 5)

CPU_model_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,LinearRegression,GridSearchCV,-43.378262,-6383.656697,-27.053792,-36.695674,-3243.698611,-25.581576,0.000836,0.002523
1,LinearRegression,RandomizedSearchCV,-43.378262,-6383.656697,-27.053792,-36.695674,-3243.698611,-25.581576,0.001215,0.001262
2,DecisionTreeRegressor,GridSearchCV,-54.243176,-14087.920619,-19.656434,-49.061471,-12064.236053,-16.83751,0.000587,0.002497
3,DecisionTreeRegressor,RandomizedSearchCV,-66.16732,-18731.135958,-27.235594,-68.544491,-18385.452782,-26.036149,0.000994,0.001433
4,KernelRidge,GridSearchCV,-47.222355,-8067.967074,-30.918495,-39.472751,-3987.727118,-24.955848,0.002393,0.000801
5,KernelRidge,RandomizedSearchCV,-47.222355,-8067.967074,-30.918495,-39.472751,-3987.727118,-24.955848,0.00309,0.002053
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### GaussianProcessRegressor

In [41]:
from sklearn.preprocessing import MinMaxScaler
scaler: MinMaxScaler() = MinMaxScaler()
CPU_X_scaled: np.ndarray = scaler.fit_transform(CPU_X)

In [45]:
regression: RegressionModel = GaussianProcessRegressor()
parameter_grid: Dict[str, List[Any]] = [{"alpha":[1e-2, 1e-3], "kernel":[RBF(l) for l in np.logspace(-1, 1, 2)]},
                 {"alpha":[1e-2, 1e-3], "kernel": [DotProduct(sigma_0) for sigma_0 in np.logspace(-1, 1, 2)]}]

In [46]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(CPU_X_scaled, CPU_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)


ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)


{'alpha': 0.01, 'kernel': DotProduct(sigma_0=0.1)}


ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)


In [47]:
tuned_regression: RegressionModel = GaussianProcessRegressor(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, CPU_X_scaled, CPU_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(CPU_model_df, grid_scores, 'GaussianProcessRegressor', 'GridSearchCV', 6)

ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)


In [48]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(CPU_X_scaled, CPU_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)


ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)


{'kernel': DotProduct(sigma_0=0.1), 'alpha': 0.01}


ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)


In [49]:
tuned_regression: RegressionModel = GaussianProcessRegressor(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tuned_regression, CPU_X_scaled, CPU_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(CPU_model_df, random_scores, 'GaussianProcessRegressor', 'RandomizedSearchCV', 7)

CPU_model_df.head(n=10)

ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)


Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,LinearRegression,GridSearchCV,-43.378262,-6383.656697,-27.053792,-36.695674,-3243.698611,-25.581576,0.000836,0.002523
1,LinearRegression,RandomizedSearchCV,-43.378262,-6383.656697,-27.053792,-36.695674,-3243.698611,-25.581576,0.001215,0.001262
2,DecisionTreeRegressor,GridSearchCV,-54.243176,-14087.920619,-19.656434,-49.061471,-12064.236053,-16.83751,0.000587,0.002497
3,DecisionTreeRegressor,RandomizedSearchCV,-66.16732,-18731.135958,-27.235594,-68.544491,-18385.452782,-26.036149,0.000994,0.001433
4,KernelRidge,GridSearchCV,-47.222355,-8067.967074,-30.918495,-39.472751,-3987.727118,-24.955848,0.002393,0.000801
5,KernelRidge,RandomizedSearchCV,-47.222355,-8067.967074,-30.918495,-39.472751,-3987.727118,-24.955848,0.00309,0.002053
6,GaussianProcessRegressor,GridSearchCV,-43.327007,-6367.059231,-27.07793,-36.664309,-3243.095266,-25.468359,0.120663,0.001997
7,GaussianProcessRegressor,RandomizedSearchCV,-43.327007,-6367.059231,-27.07793,-36.664309,-3243.095266,-25.468359,0.121615,0.000597
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### LogisticRegression

In [50]:
regression: RegressionModel = LogisticRegression()
parameter_grid: Dict[str, List[Any]] = {'penalty': ['l1', 'l2'], 'C': np.logspace(-4, 4, 20), 'solver':['liblinear'], 'max_iter':[1e4]}

In [51]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(CPU_X_scaled, CPU_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)



{'C': 1438.44988828766, 'max_iter': 10000.0, 'penalty': 'l2', 'solver': 'liblinear'}


In [52]:
tuned_regression: RegressionModel = LogisticRegression(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, CPU_X_scaled, CPU_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(CPU_model_df, grid_scores, 'LogisticRegression', 'GridSearchCV', 8)



In [53]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(CPU_X_scaled, CPU_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)



{'solver': 'liblinear', 'penalty': 'l1', 'max_iter': 10000.0, 'C': 1438.44988828766}


In [54]:
tuned_regression: RegressionModel = LogisticRegression(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tuned_regression, CPU_X_scaled, CPU_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(CPU_model_df, random_scores, 'LogisticRegression', 'RandomizedSearchCV', 9)

CPU_model_df.head(n=10)



Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,LinearRegression,GridSearchCV,-43.378262,-6383.656697,-27.053792,-36.695674,-3243.698611,-25.581576,0.000836,0.002523
1,LinearRegression,RandomizedSearchCV,-43.378262,-6383.656697,-27.053792,-36.695674,-3243.698611,-25.581576,0.001215,0.001262
2,DecisionTreeRegressor,GridSearchCV,-54.243176,-14087.920619,-19.656434,-49.061471,-12064.236053,-16.83751,0.000587,0.002497
3,DecisionTreeRegressor,RandomizedSearchCV,-66.16732,-18731.135958,-27.235594,-68.544491,-18385.452782,-26.036149,0.000994,0.001433
4,KernelRidge,GridSearchCV,-47.222355,-8067.967074,-30.918495,-39.472751,-3987.727118,-24.955848,0.002393,0.000801
5,KernelRidge,RandomizedSearchCV,-47.222355,-8067.967074,-30.918495,-39.472751,-3987.727118,-24.955848,0.00309,0.002053
6,GaussianProcessRegressor,GridSearchCV,-43.327007,-6367.059231,-27.07793,-36.664309,-3243.095266,-25.468359,0.120663,0.001997
7,GaussianProcessRegressor,RandomizedSearchCV,-43.327007,-6367.059231,-27.07793,-36.664309,-3243.095266,-25.468359,0.121615,0.000597
8,LogisticRegression,GridSearchCV,-47.049245,-8983.801394,-15.8,-12.37346,-659.393449,-3.6,0.09186,0.00244
9,LogisticRegression,RandomizedSearchCV,-45.088153,-10453.114402,-17.7,-8.923154,-437.374137,-0.2,0.943974,0.001049


In [55]:
CPU_model_df.iloc[:, 2:] = np.abs(CPU_model_df.iloc[:, 2:])
CPU_model_df: pd.DataFrame = CPU_model_df.rename(columns={'test_neg_mean_absolute_error': 'test_mean_absolute_error', 'test_neg_mean_squared_error': 'test_mean_squared_error', 
                                            'test_neg_median_absolute_error': 'test_median_absolute_error', 'train_neg_mean_absolute_error': 'train_mean_absolute_error', 
                                            'train_neg_mean_squared_error': 'train_mean_squared_error', 'train_neg_median_absolute_error': 'train_median_absolute_error'})
CPU_model_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error,fit_time,score_time
0,LinearRegression,GridSearchCV,43.378262,6383.656697,27.053792,36.695674,3243.698611,25.581576,0.000836,0.002523
1,LinearRegression,RandomizedSearchCV,43.378262,6383.656697,27.053792,36.695674,3243.698611,25.581576,0.001215,0.001262
2,DecisionTreeRegressor,GridSearchCV,54.243176,14087.920619,19.656434,49.061471,12064.236053,16.83751,0.000587,0.002497
3,DecisionTreeRegressor,RandomizedSearchCV,66.16732,18731.135958,27.235594,68.544491,18385.452782,26.036149,0.000994,0.001433
4,KernelRidge,GridSearchCV,47.222355,8067.967074,30.918495,39.472751,3987.727118,24.955848,0.002393,0.000801
5,KernelRidge,RandomizedSearchCV,47.222355,8067.967074,30.918495,39.472751,3987.727118,24.955848,0.00309,0.002053
6,GaussianProcessRegressor,GridSearchCV,43.327007,6367.059231,27.07793,36.664309,3243.095266,25.468359,0.120663,0.001997
7,GaussianProcessRegressor,RandomizedSearchCV,43.327007,6367.059231,27.07793,36.664309,3243.095266,25.468359,0.121615,0.000597
8,LogisticRegression,GridSearchCV,47.049245,8983.801394,15.8,12.37346,659.393449,3.6,0.09186,0.00244
9,LogisticRegression,RandomizedSearchCV,45.088153,10453.114402,17.7,8.923154,437.374137,0.2,0.943974,0.001049


In [62]:
styler: pd.io.formats.style.Styler = CPU_model_df.head(10).style.highlight_max(color='tomato').highlight_min(color='lightgreen')
styler

Unnamed: 0,Model,Search_strategy,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error,fit_time,score_time
0,LinearRegression,GridSearchCV,43.378262,6383.656697,27.053792,36.695674,3243.698611,25.581576,0.000836,0.002523
1,LinearRegression,RandomizedSearchCV,43.378262,6383.656697,27.053792,36.695674,3243.698611,25.581576,0.001215,0.001262
2,DecisionTreeRegressor,GridSearchCV,54.243176,14087.920619,19.656434,49.061471,12064.236053,16.83751,0.000587,0.002497
3,DecisionTreeRegressor,RandomizedSearchCV,66.16732,18731.135958,27.235594,68.544491,18385.452782,26.036149,0.000994,0.001433
4,KernelRidge,GridSearchCV,47.222355,8067.967074,30.918495,39.472751,3987.727118,24.955848,0.002393,0.000801
5,KernelRidge,RandomizedSearchCV,47.222355,8067.967074,30.918495,39.472751,3987.727118,24.955848,0.00309,0.002053
6,GaussianProcessRegressor,GridSearchCV,43.327007,6367.059231,27.07793,36.664309,3243.095266,25.468359,0.120663,0.001997
7,GaussianProcessRegressor,RandomizedSearchCV,43.327007,6367.059231,27.07793,36.664309,3243.095266,25.468359,0.121615,0.000597
8,LogisticRegression,GridSearchCV,47.049245,8983.801394,15.8,12.37346,659.393449,3.6,0.09186,0.00244
9,LogisticRegression,RandomizedSearchCV,45.088153,10453.114402,17.7,8.923154,437.374137,0.2,0.943974,0.001049


In [67]:
html: str = styler.render()
file = open("CPU_Computer_Hardware_DataSet.html", "w")
file.write(html)
file.close()

## Boston Housing

In [69]:
zeros: np.ndarray = np.zeros(10)
BostonHousing_df: pd.DataFrame = pd.DataFrame({
    'Model':list(zeros), 
    'Search_strategy': list(zeros), 
    'test_neg_mean_absolute_error': list(zeros), 
    'test_neg_mean_squared_error': list(zeros), 
    'test_neg_median_absolute_error': list(zeros),
    'train_neg_mean_absolute_error': list(zeros),
    'train_neg_mean_squared_error': list(zeros),
    'train_neg_median_absolute_error': list(zeros),
    'fit_time': list(zeros),
    'score_time': list(zeros)
})
BostonHousing_df.head()

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [70]:
BH_data: pd.DataFrame = pd.read_csv("data/housing.data", delimiter = r'\s+', header = None)
BH_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [71]:
housing_X: np.ndarray = BH_data.iloc[:, :-1].values
housing_y: np.ndarray = BH_data.iloc[:, -1].values

### SupportVectorRegression

In [72]:
regression: RegressionModel = SVR(epsilon=0.01)
parameter_grid: Dict[str, List[Any]] = [{'kernel': ['rbf'], 'gamma': [1e-4, 1e-3, 0.01, 0.1, 0.2, 0.5, 0.6, 0.9],'C': [1, 10, 100, 1000, 10000]}]

In [73]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_X, housing_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}


In [74]:
tuned_regression: RegressionModel = SVR(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, housing_X, housing_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(BostonHousing_df, grid_scores, 'SupportVectorRegression', 'GridSearchCV', 0)

In [75]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(housing_X, housing_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'kernel': 'rbf', 'gamma': 0.001, 'C': 10}


In [76]:
tuned_regression: RegressionModel = SVR(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tuned_regression, housing_X, housing_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(BostonHousing_df, random_scores, 'SupportVectorRegression', 'RandomizedSearchCV', 1)

BostonHousing_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,SupportVectorRegression,GridSearchCV,-5.312487,-60.626008,-3.509054,-2.564362,-22.106605,-1.344217,0.047225,0.011929
1,SupportVectorRegression,RandomizedSearchCV,-5.763953,-70.397232,-4.090828,-2.248393,-21.164556,-0.776284,0.031496,0.009669
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### LassoLars

In [77]:
regression: RegressionModel = LassoLars()
parameter_grid: Dict[str, List[Any]] = [{'alpha':[0.02, 0.024, 0.015, 0.025, 0.026, 0.03, 0.023, 0.017, 0.033, 0.014, 0.019]}]

In [78]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_X, housing_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'alpha': 0.014}


In [79]:
tuned_regression: RegressionModel = LassoLars(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, housing_X, housing_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(BostonHousing_df, grid_scores, 'LassoLars', 'GridSearchCV', 2)

In [80]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(housing_X, housing_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'alpha': 0.014}


In [81]:
tuned_regression: RegressionModel = LassoLars(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tuned_regression, housing_X, housing_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(BostonHousing_df, random_scores, 'LassoLars', 'RandomizedSearchCV', 3)

BostonHousing_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,SupportVectorRegression,GridSearchCV,-5.312487,-60.626008,-3.509054,-2.564362,-22.106605,-1.344217,0.047225,0.011929
1,SupportVectorRegression,RandomizedSearchCV,-5.763953,-70.397232,-4.090828,-2.248393,-21.164556,-0.776284,0.031496,0.009669
2,LassoLars,GridSearchCV,-4.071615,-36.322984,-2.922391,-3.30038,-23.026079,-2.353271,0.006296,0.001486
3,LassoLars,RandomizedSearchCV,-4.071615,-36.322984,-2.922391,-3.30038,-23.026079,-2.353271,0.004523,0.001297
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### ARDRegression

In [82]:
regression: RegressionModel = ARDRegression()
parameter_grid: Dict[str, List[Any]] = [{'tol':[1e-3, 1e-4], 'alpha_1':[1e-6, 1e-5, 1e-7], 'alpha_2':[1e-6, 1e-5, 1e-7]}]

In [83]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_X, housing_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'alpha_1': 1e-07, 'alpha_2': 1e-05, 'tol': 0.0001}


In [84]:
tuned_regression: RegressionModel = ARDRegression(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, housing_X, housing_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(BostonHousing_df, grid_scores, 'ARDRegression', 'GridSearchCV', 4)

In [85]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(housing_X, housing_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'tol': 0.0001, 'alpha_2': 1e-07, 'alpha_1': 1e-07}


In [86]:
tuned_regression: RegressionModel = ARDRegression(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tuned_regression, housing_X, housing_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(BostonHousing_df, random_scores, 'ARDRegression', 'RandomizedSearchCV', 5)

BostonHousing_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,SupportVectorRegression,GridSearchCV,-5.312487,-60.626008,-3.509054,-2.564362,-22.106605,-1.344217,0.047225,0.011929
1,SupportVectorRegression,RandomizedSearchCV,-5.763953,-70.397232,-4.090828,-2.248393,-21.164556,-0.776284,0.031496,0.009669
2,LassoLars,GridSearchCV,-4.071615,-36.322984,-2.922391,-3.30038,-23.026079,-2.353271,0.006296,0.001486
3,LassoLars,RandomizedSearchCV,-4.071615,-36.322984,-2.922391,-3.30038,-23.026079,-2.353271,0.004523,0.001297
4,ARDRegression,GridSearchCV,-4.251395,-37.438832,-3.268956,-3.256953,-21.315617,-2.325255,0.007939,0.000907
5,ARDRegression,RandomizedSearchCV,-4.251395,-37.438832,-3.268956,-3.256953,-21.315617,-2.325255,0.005185,0.002054
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### PassiveAggressiveRegressor

In [87]:
regression: RegressionModel = PassiveAggressiveRegressor()
parameter_grid: Dict[str, List[Any]] = [{'C': np.logspace(-4, 4, 20), 'tol':[1e-3, 1e-4]}]

In [88]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_X, housing_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'C': 0.004832930238571752, 'tol': 0.0001}


In [89]:
tuned_regression: RegressionModel = PassiveAggressiveRegressor(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, housing_X, housing_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(BostonHousing_df, grid_scores, 'PassiveAggressiveRegressor', 'GridSearchCV', 6)

In [90]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(housing_X, housing_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'tol': 0.001, 'C': 0.23357214690901212}


In [91]:
tuned_regression: RegressionModel = PassiveAggressiveRegressor(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tuned_regression, housing_X, housing_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(BostonHousing_df, random_scores, 'PassiveAggressiveRegressor', 'RandomizedSearchCV', 7)

BostonHousing_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,SupportVectorRegression,GridSearchCV,-5.312487,-60.626008,-3.509054,-2.564362,-22.106605,-1.344217,0.047225,0.011929
1,SupportVectorRegression,RandomizedSearchCV,-5.763953,-70.397232,-4.090828,-2.248393,-21.164556,-0.776284,0.031496,0.009669
2,LassoLars,GridSearchCV,-4.071615,-36.322984,-2.922391,-3.30038,-23.026079,-2.353271,0.006296,0.001486
3,LassoLars,RandomizedSearchCV,-4.071615,-36.322984,-2.922391,-3.30038,-23.026079,-2.353271,0.004523,0.001297
4,ARDRegression,GridSearchCV,-4.251395,-37.438832,-3.268956,-3.256953,-21.315617,-2.325255,0.007939,0.000907
5,ARDRegression,RandomizedSearchCV,-4.251395,-37.438832,-3.268956,-3.256953,-21.315617,-2.325255,0.005185,0.002054
6,PassiveAggressiveRegressor,GridSearchCV,-9.391022,-159.440775,-7.199038,-6.525557,-83.220718,-4.666393,0.00169,0.000914
7,PassiveAggressiveRegressor,RandomizedSearchCV,-10.001846,-164.468879,-9.06097,-8.750915,-125.685045,-7.975561,0.002853,0.001972
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### TheilSenRegressor

In [92]:
regression: RegressionModel = TheilSenRegressor()
parameter_grid: Dict[str, List[Any]] = [{'tol':[1e-3, 1e-4], 'n_jobs':[None, 1, 2, 3, 4]}]

In [93]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_X, housing_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'n_jobs': None, 'tol': 0.001}


In [94]:
tuned_regression: RegressionModel = TheilSenRegressor(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, housing_X, housing_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(BostonHousing_df, grid_scores, 'TheilSenRegressor', 'GridSearchCV', 8)

In [95]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(housing_X, housing_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'tol': 0.0001, 'n_jobs': None}


In [96]:
tuned_regression: RegressionModel = TheilSenRegressor(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tuned_regression, housing_X, housing_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(BostonHousing_df, random_scores, 'TheilSenRegressor', 'RandomizedSearchCV', 9)

BostonHousing_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,SupportVectorRegression,GridSearchCV,-5.312487,-60.626008,-3.509054,-2.564362,-22.106605,-1.344217,0.047225,0.011929
1,SupportVectorRegression,RandomizedSearchCV,-5.763953,-70.397232,-4.090828,-2.248393,-21.164556,-0.776284,0.031496,0.009669
2,LassoLars,GridSearchCV,-4.071615,-36.322984,-2.922391,-3.30038,-23.026079,-2.353271,0.006296,0.001486
3,LassoLars,RandomizedSearchCV,-4.071615,-36.322984,-2.922391,-3.30038,-23.026079,-2.353271,0.004523,0.001297
4,ARDRegression,GridSearchCV,-4.251395,-37.438832,-3.268956,-3.256953,-21.315617,-2.325255,0.007939,0.000907
5,ARDRegression,RandomizedSearchCV,-4.251395,-37.438832,-3.268956,-3.256953,-21.315617,-2.325255,0.005185,0.002054
6,PassiveAggressiveRegressor,GridSearchCV,-9.391022,-159.440775,-7.199038,-6.525557,-83.220718,-4.666393,0.00169,0.000914
7,PassiveAggressiveRegressor,RandomizedSearchCV,-10.001846,-164.468879,-9.06097,-8.750915,-125.685045,-7.975561,0.002853,0.001972
8,TheilSenRegressor,GridSearchCV,-4.725228,-51.811434,-3.354929,-3.225483,-26.305485,-2.018425,1.666712,0.000206
9,TheilSenRegressor,RandomizedSearchCV,-4.571813,-49.269049,-3.098548,-3.203193,-25.640233,-2.014435,1.739819,0.003031


In [97]:
BostonHousing_df.iloc[:, 2:] = np.abs(BostonHousing_df.iloc[:, 2:])
BostonHousing_df: pd.DataFrame = BostonHousing_df.rename(columns={'test_neg_mean_absolute_error': 'test_mean_absolute_error', 'test_neg_mean_squared_error': 'test_mean_squared_error', 
                                            'test_neg_median_absolute_error': 'test_median_absolute_error', 'train_neg_mean_absolute_error': 'train_mean_absolute_error', 
                                            'train_neg_mean_squared_error': 'train_mean_squared_error', 'train_neg_median_absolute_error': 'train_median_absolute_error'})
BostonHousing_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error,fit_time,score_time
0,SupportVectorRegression,GridSearchCV,5.312487,60.626008,3.509054,2.564362,22.106605,1.344217,0.047225,0.011929
1,SupportVectorRegression,RandomizedSearchCV,5.763953,70.397232,4.090828,2.248393,21.164556,0.776284,0.031496,0.009669
2,LassoLars,GridSearchCV,4.071615,36.322984,2.922391,3.30038,23.026079,2.353271,0.006296,0.001486
3,LassoLars,RandomizedSearchCV,4.071615,36.322984,2.922391,3.30038,23.026079,2.353271,0.004523,0.001297
4,ARDRegression,GridSearchCV,4.251395,37.438832,3.268956,3.256953,21.315617,2.325255,0.007939,0.000907
5,ARDRegression,RandomizedSearchCV,4.251395,37.438832,3.268956,3.256953,21.315617,2.325255,0.005185,0.002054
6,PassiveAggressiveRegressor,GridSearchCV,9.391022,159.440775,7.199038,6.525557,83.220718,4.666393,0.00169,0.000914
7,PassiveAggressiveRegressor,RandomizedSearchCV,10.001846,164.468879,9.06097,8.750915,125.685045,7.975561,0.002853,0.001972
8,TheilSenRegressor,GridSearchCV,4.725228,51.811434,3.354929,3.225483,26.305485,2.018425,1.666712,0.000206
9,TheilSenRegressor,RandomizedSearchCV,4.571813,49.269049,3.098548,3.203193,25.640233,2.014435,1.739819,0.003031


In [99]:
styler: pd.io.formats.style.Styler = BostonHousing_df.head(10).style.highlight_max(color='tomato').highlight_min(color='lightgreen')
styler

Unnamed: 0,Model,Search_strategy,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error,fit_time,score_time
0,SupportVectorRegression,GridSearchCV,5.312487,60.626008,3.509054,2.564362,22.106605,1.344217,0.047225,0.011929
1,SupportVectorRegression,RandomizedSearchCV,5.763953,70.397232,4.090828,2.248393,21.164556,0.776284,0.031496,0.009669
2,LassoLars,GridSearchCV,4.071615,36.322984,2.922391,3.30038,23.026079,2.353271,0.006296,0.001486
3,LassoLars,RandomizedSearchCV,4.071615,36.322984,2.922391,3.30038,23.026079,2.353271,0.004523,0.001297
4,ARDRegression,GridSearchCV,4.251395,37.438832,3.268956,3.256953,21.315617,2.325255,0.007939,0.000907
5,ARDRegression,RandomizedSearchCV,4.251395,37.438832,3.268956,3.256953,21.315617,2.325255,0.005185,0.002054
6,PassiveAggressiveRegressor,GridSearchCV,9.391022,159.440775,7.199038,6.525557,83.220718,4.666393,0.00169,0.000914
7,PassiveAggressiveRegressor,RandomizedSearchCV,10.001846,164.468879,9.06097,8.750915,125.685045,7.975561,0.002853,0.001972
8,TheilSenRegressor,GridSearchCV,4.725228,51.811434,3.354929,3.225483,26.305485,2.018425,1.666712,0.000206
9,TheilSenRegressor,RandomizedSearchCV,4.571813,49.269049,3.098548,3.203193,25.640233,2.014435,1.739819,0.003031


In [100]:
html:str = styler.render()
file = open("BostonHousing_DataSet.html", "w")
file.write(html)
file.close()

## Wisconsin Breast Cancer

In [101]:
zeros: np.ndarray = np.zeros(10)
WBC_df: pd.DataFrame = pd.DataFrame({
    'Model':list(zeros), 
    'Search_strategy': list(zeros), 
    'test_neg_mean_absolute_error': list(zeros), 
    'test_neg_mean_squared_error': list(zeros), 
    'test_neg_median_absolute_error': list(zeros),
    'train_neg_mean_absolute_error': list(zeros),
    'train_neg_mean_squared_error': list(zeros),
    'train_neg_median_absolute_error': list(zeros),
    'fit_time': list(zeros),
    'score_time': list(zeros)
})
WBC_df.head()

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [102]:
WBC_data: pd.DataFrame = pd.read_csv("data/r_wpbc.data", delimiter = ",", header = None)
WBC_data.head()
assert not np.isnan(WBC_data.values.sum())
WBC_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,23,24,25,26,27,28,29,30,31,32
0,5,18.02,27.6,117.5,1013.0,0.09489,0.1036,0.1086,0.07055,0.1865,...,139.7,1436.0,0.1195,0.1926,0.314,0.117,0.2677,0.08113,5.0,31
1,2,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,3.0,61
2,0,21.37,17.44,137.5,1373.0,0.08836,0.1189,0.1255,0.0818,0.2333,...,159.1,1949.0,0.1188,0.3449,0.3414,0.2032,0.4334,0.09067,2.5,116
3,0,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,2.0,123
4,0,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,3.5,27


In [103]:
WBC_X: np.ndarray = WBC_data.iloc[:, :-1].values
WBC_y: np.ndarray = WBC_data.iloc[:, -1].values

### LinearRegression

In [104]:
regression: RegressionModel = LinearRegression()
parameter_grid: Dict[str, List[Any]] = {"fit_intercept":[True, False], "normalize":[True, False], "n_jobs":[None, 2, 4, 6]}

In [105]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(WBC_X, WBC_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'fit_intercept': False, 'n_jobs': None, 'normalize': True}


In [106]:
tunned_regression: RegressionModel = LinearRegression(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tunned_regression, WBC_X, WBC_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(WBC_df, grid_scores, 'LinearRegression', 'GridSearchCV', 0)

In [107]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(WBC_X, WBC_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'normalize': False, 'n_jobs': None, 'fit_intercept': False}


In [108]:
tunned_regression: RegressionModel = LinearRegression(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tunned_regression, WBC_X, WBC_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(WBC_df, random_scores, 'LinearRegression', 'RandomizedSearchCV', 1)
WBC_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,LinearRegression,GridSearchCV,-29.727065,-1339.66409,-28.793519,-22.279819,-745.178079,-19.407585,0.00111,0.001957
1,LinearRegression,RandomizedSearchCV,-29.727065,-1339.66409,-28.793519,-22.279819,-745.178079,-19.407585,0.001103,0.002906
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### DecisionTreeRegressor

In [110]:
regression: RegressionModel = DecisionTreeRegressor()
parameter_grid: Dict[str, List[Any]] = {'splitter':['best', 'random'], 'max_depth': [1, 3, 5, 7], 'min_samples_leaf':[1, 2, 3, 4, 5],
                 'min_weight_fraction_leaf':[0.1, 0.2, 0.3, 0.4, 0.5], 'max_features':['auto', 'log2', 'sqrt', None],
                 'max_leaf_nodes':[None, 10, 20, 30, 40, 50]}

In [111]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(WBC_X, WBC_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'max_depth': 7, 'max_features': 'log2', 'max_leaf_nodes': None, 'min_samples_leaf': 2, 'min_weight_fraction_leaf': 0.1, 'splitter': 'random'}


In [112]:
tuned_regression: RegressionModel = DecisionTreeRegressor(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, WBC_X, WBC_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(WBC_df, grid_scores, 'DecisionTreeRegressor', 'GridSearchCV', 2)

In [113]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(WBC_X, WBC_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'splitter': 'random', 'min_weight_fraction_leaf': 0.3, 'min_samples_leaf': 5, 'max_leaf_nodes': 50, 'max_features': None, 'max_depth': 3}


In [114]:
tunned_regression: RegressionModel = DecisionTreeRegressor(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tunned_regression, WBC_X, WBC_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(WBC_df, random_scores, 'DecisionTreeRegressor', 'RandomizedSearchCV', 3)
WBC_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,LinearRegression,GridSearchCV,-29.727065,-1339.66409,-28.793519,-22.279819,-745.178079,-19.407585,0.00111,0.001957
1,LinearRegression,RandomizedSearchCV,-29.727065,-1339.66409,-28.793519,-22.279819,-745.178079,-19.407585,0.001103,0.002906
2,DecisionTreeRegressor,GridSearchCV,-30.501704,-1339.205146,-29.053978,-26.994509,-1026.889172,-24.905828,0.000321,0.001921
3,DecisionTreeRegressor,RandomizedSearchCV,-31.237957,-1342.842846,-27.200967,-27.866783,-1078.577552,-27.123853,0.000574,0.001381
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### KernelRidge

In [115]:
regression: RegressionModel = KernelRidge()
parameter_grid: Dict[str, List[Any]] = {'kernel':['linear', 'rbf'], 'alpha':[1, 2, 3, 5], 'gamma':[None, 0.1, 0.001]}

In [116]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(WBC_X, WBC_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'alpha': 5, 'gamma': None, 'kernel': 'linear'}


In [117]:
tuned_regression: RegressionModel = KernelRidge(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, WBC_X, WBC_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(WBC_df, grid_scores, 'KernelRidge', 'GridSearchCV', 4)

In [118]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(WBC_X, WBC_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'kernel': 'linear', 'gamma': 0.1, 'alpha': 5}


In [119]:
tunned_regression: RegressionModel = KernelRidge(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tunned_regression, WBC_X, WBC_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(WBC_df, random_scores, 'KernelRidge', 'RandomizedSearchCV', 5)
WBC_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,LinearRegression,GridSearchCV,-29.727065,-1339.66409,-28.793519,-22.279819,-745.178079,-19.407585,0.00111,0.001957
1,LinearRegression,RandomizedSearchCV,-29.727065,-1339.66409,-28.793519,-22.279819,-745.178079,-19.407585,0.001103,0.002906
2,DecisionTreeRegressor,GridSearchCV,-30.501704,-1339.205146,-29.053978,-26.994509,-1026.889172,-24.905828,0.000321,0.001921
3,DecisionTreeRegressor,RandomizedSearchCV,-31.237957,-1342.842846,-27.200967,-27.866783,-1078.577552,-27.123853,0.000574,0.001381
4,KernelRidge,GridSearchCV,-29.579619,-1253.368318,-29.167524,-24.902682,-907.346397,-22.537228,0.002962,0.002215
5,KernelRidge,RandomizedSearchCV,-29.579619,-1253.368318,-29.167524,-24.902682,-907.346397,-22.537228,0.002659,0.000812
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### SupportVectorRegression

In [120]:
regression: RegressionModel = SVR(epsilon=0.01)
parameter_grid: Dict[str, List[Any]] = [{'kernel': ['rbf'], 'gamma': [1e-4, 1e-3, 0.01, 0.1, 0.2, 0.5, 0.6, 0.9],'C': [1, 10, 100, 1000, 10000]}]

In [121]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(WBC_X, WBC_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}


In [122]:
tuned_regression: RegressionModel = SVR(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, WBC_X, WBC_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(WBC_df, grid_scores, 'SupportVectorRegression', 'GridSearchCV', 6)

In [123]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(WBC_X, WBC_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'kernel': 'rbf', 'gamma': 0.9, 'C': 1000}


In [124]:
tunned_regression: RegressionModel = SVR(**best_parameters)
random_scores: Dict[str, List[Any]] = cross_validate(tunned_regression, WBC_X, WBC_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(WBC_df, random_scores, 'SupportVectorRegression', 'RandomizedSearchCV', 7)
WBC_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,LinearRegression,GridSearchCV,-29.727065,-1339.66409,-28.793519,-22.279819,-745.178079,-19.407585,0.00111,0.001957
1,LinearRegression,RandomizedSearchCV,-29.727065,-1339.66409,-28.793519,-22.279819,-745.178079,-19.407585,0.001103,0.002906
2,DecisionTreeRegressor,GridSearchCV,-30.501704,-1339.205146,-29.053978,-26.994509,-1026.889172,-24.905828,0.000321,0.001921
3,DecisionTreeRegressor,RandomizedSearchCV,-31.237957,-1342.842846,-27.200967,-27.866783,-1078.577552,-27.123853,0.000574,0.001381
4,KernelRidge,GridSearchCV,-29.579619,-1253.368318,-29.167524,-24.902682,-907.346397,-22.537228,0.002962,0.002215
5,KernelRidge,RandomizedSearchCV,-29.579619,-1253.368318,-29.167524,-24.902682,-907.346397,-22.537228,0.002659,0.000812
6,SupportVectorRegression,GridSearchCV,-31.196103,-1368.839019,-29.596207,-0.09993,-0.009991,-0.100004,0.007265,0.003468
7,SupportVectorRegression,RandomizedSearchCV,-31.779892,-1386.7938,-30.447742,-0.099898,-0.009982,-0.099827,0.007891,0.002999
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### ARDRegression

In [125]:
regression: RegressionModel = ARDRegression()
parameter_grid: Dict[str, List[Any]] = [{'tol':[1e-3, 1e-4], 'alpha_1':[1e-6, 1e-5, 1e-7], 'alpha_2':[1e-6, 1e-5, 1e-7]}]

In [126]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(WBC_X, WBC_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'alpha_1': 1e-07, 'alpha_2': 1e-05, 'tol': 0.0001}


In [127]:
tuned_regression: RegressionModel = ARDRegression(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, WBC_X, WBC_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(WBC_df, grid_scores, 'ARDRegression', 'GridSearchCV', 8)

In [128]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(WBC_X, WBC_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'tol': 0.0001, 'alpha_2': 1e-05, 'alpha_1': 1e-07}


In [129]:
tunned_regression: RegressionModel = ARDRegression(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tunned_regression, WBC_X, WBC_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(WBC_df, random_scores, 'ARDRegression', 'RandomizedSearchCV', 9)
WBC_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,LinearRegression,GridSearchCV,-29.727065,-1339.66409,-28.793519,-22.279819,-745.178079,-19.407585,0.00111,0.001957
1,LinearRegression,RandomizedSearchCV,-29.727065,-1339.66409,-28.793519,-22.279819,-745.178079,-19.407585,0.001103,0.002906
2,DecisionTreeRegressor,GridSearchCV,-30.501704,-1339.205146,-29.053978,-26.994509,-1026.889172,-24.905828,0.000321,0.001921
3,DecisionTreeRegressor,RandomizedSearchCV,-31.237957,-1342.842846,-27.200967,-27.866783,-1078.577552,-27.123853,0.000574,0.001381
4,KernelRidge,GridSearchCV,-29.579619,-1253.368318,-29.167524,-24.902682,-907.346397,-22.537228,0.002962,0.002215
5,KernelRidge,RandomizedSearchCV,-29.579619,-1253.368318,-29.167524,-24.902682,-907.346397,-22.537228,0.002659,0.000812
6,SupportVectorRegression,GridSearchCV,-31.196103,-1368.839019,-29.596207,-0.09993,-0.009991,-0.100004,0.007265,0.003468
7,SupportVectorRegression,RandomizedSearchCV,-31.779892,-1386.7938,-30.447742,-0.099898,-0.009982,-0.099827,0.007891,0.002999
8,ARDRegression,GridSearchCV,-29.745827,-1282.080226,-27.580134,-24.105382,-853.213228,-21.831935,0.260768,0.001656
9,ARDRegression,RandomizedSearchCV,-29.745827,-1282.080226,-27.580134,-24.105382,-853.213228,-21.831935,0.213841,0.001902


In [130]:
WBC_df.iloc[:, 2:] = np.abs(WBC_df.iloc[:, 2:])
WBC_df: pd.DataFrame = WBC_df.rename(columns={'test_neg_mean_absolute_error': 'test_mean_absolute_error', 'test_neg_mean_squared_error': 'test_mean_squared_error', 
                                            'test_neg_median_absolute_error': 'test_median_absolute_error', 'train_neg_mean_absolute_error': 'train_mean_absolute_error', 
                                            'train_neg_mean_squared_error': 'train_mean_squared_error', 'train_neg_median_absolute_error': 'train_median_absolute_error'})
WBC_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error,fit_time,score_time
0,LinearRegression,GridSearchCV,29.727065,1339.66409,28.793519,22.279819,745.178079,19.407585,0.00111,0.001957
1,LinearRegression,RandomizedSearchCV,29.727065,1339.66409,28.793519,22.279819,745.178079,19.407585,0.001103,0.002906
2,DecisionTreeRegressor,GridSearchCV,30.501704,1339.205146,29.053978,26.994509,1026.889172,24.905828,0.000321,0.001921
3,DecisionTreeRegressor,RandomizedSearchCV,31.237957,1342.842846,27.200967,27.866783,1078.577552,27.123853,0.000574,0.001381
4,KernelRidge,GridSearchCV,29.579619,1253.368318,29.167524,24.902682,907.346397,22.537228,0.002962,0.002215
5,KernelRidge,RandomizedSearchCV,29.579619,1253.368318,29.167524,24.902682,907.346397,22.537228,0.002659,0.000812
6,SupportVectorRegression,GridSearchCV,31.196103,1368.839019,29.596207,0.09993,0.009991,0.100004,0.007265,0.003468
7,SupportVectorRegression,RandomizedSearchCV,31.779892,1386.7938,30.447742,0.099898,0.009982,0.099827,0.007891,0.002999
8,ARDRegression,GridSearchCV,29.745827,1282.080226,27.580134,24.105382,853.213228,21.831935,0.260768,0.001656
9,ARDRegression,RandomizedSearchCV,29.745827,1282.080226,27.580134,24.105382,853.213228,21.831935,0.213841,0.001902


In [132]:
styler: pd.io.formats.style.Styler = WBC_df.head(10).style.highlight_max(color='tomato').highlight_min(color='lightgreen')
styler

Unnamed: 0,Model,Search_strategy,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error,fit_time,score_time
0,LinearRegression,GridSearchCV,29.727065,1339.66409,28.793519,22.279819,745.178079,19.407585,0.00111,0.001957
1,LinearRegression,RandomizedSearchCV,29.727065,1339.66409,28.793519,22.279819,745.178079,19.407585,0.001103,0.002906
2,DecisionTreeRegressor,GridSearchCV,30.501704,1339.205146,29.053978,26.994509,1026.889172,24.905828,0.000321,0.001921
3,DecisionTreeRegressor,RandomizedSearchCV,31.237957,1342.842846,27.200967,27.866783,1078.577552,27.123853,0.000574,0.001381
4,KernelRidge,GridSearchCV,29.579619,1253.368318,29.167524,24.902682,907.346397,22.537228,0.002962,0.002215
5,KernelRidge,RandomizedSearchCV,29.579619,1253.368318,29.167524,24.902682,907.346397,22.537228,0.002659,0.000812
6,SupportVectorRegression,GridSearchCV,31.196103,1368.839019,29.596207,0.09993,0.009991,0.100004,0.007265,0.003468
7,SupportVectorRegression,RandomizedSearchCV,31.779892,1386.7938,30.447742,0.099898,0.009982,0.099827,0.007891,0.002999
8,ARDRegression,GridSearchCV,29.745827,1282.080226,27.580134,24.105382,853.213228,21.831935,0.260768,0.001656
9,ARDRegression,RandomizedSearchCV,29.745827,1282.080226,27.580134,24.105382,853.213228,21.831935,0.213841,0.001902


In [133]:
html:str = styler.render()
file = open("WisconsinBreastCancer_DataSet.html", "w")
file.write(html)
file.close()

## Comunities and Crime

In [139]:
zeros: np.ndarray = np.zeros(10)
com_crime_df: pd.DataFrame = pd.DataFrame({
    'Model':list(zeros), 
    'Search_strategy': list(zeros), 
    'test_neg_mean_absolute_error': list(zeros), 
    'test_neg_mean_squared_error': list(zeros), 
    'test_neg_median_absolute_error': list(zeros),
    'train_neg_mean_absolute_error': list(zeros),
    'train_neg_mean_squared_error': list(zeros),
    'train_neg_median_absolute_error': list(zeros),
    'fit_time': list(zeros),
    'score_time': list(zeros)
})
com_crime_df.head()

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [140]:
com_crime_data: pd.DataFrame = pd.read_csv("data/communities.data", delimiter = ",", header = None)
com_crime_data: pd.DataFrame = com_crime_data.drop(range(5), axis = 1)
com_crime_data: pd.DataFrame = com_crime_data.replace("?", np.nan)
com_crime_data: pd.DataFrame = com_crime_data.dropna()
com_crime_data.head()

Unnamed: 0,5,6,7,8,9,10,11,12,13,14,...,118,119,120,121,122,123,124,125,126,127
0,0.19,0.33,0.02,0.9,0.12,0.17,0.34,0.47,0.29,0.32,...,0.12,0.26,0.2,0.06,0.04,0.9,0.5,0.32,0.14,0.2
16,0.15,0.31,0.4,0.63,0.14,0.06,0.58,0.72,0.65,0.47,...,0.06,0.39,0.84,0.06,0.06,0.91,0.5,0.88,0.26,0.49
20,0.25,0.54,0.05,0.71,0.48,0.3,0.42,0.48,0.28,0.32,...,0.09,0.46,0.05,0.09,0.05,0.88,0.5,0.76,0.13,0.34
21,1.0,0.42,0.47,0.59,0.12,0.05,0.41,0.53,0.34,0.33,...,1.0,0.07,0.15,1.0,0.35,0.73,0.0,0.31,0.21,0.69
23,0.11,0.43,0.04,0.89,0.09,0.06,0.45,0.48,0.31,0.46,...,0.16,0.12,0.07,0.04,0.01,0.81,1.0,0.56,0.09,0.63


In [141]:
com_crime_X: np.ndarray = com_crime_data.iloc[:, :-1].values
com_crime_y: np.ndarray = com_crime_data.iloc[:, -1].values

### SupportVectorRegression

In [142]:
regression: RegressionModel = SVR(epsilon=0.01)
parameter_grid: Dict[str, List[Any]] = [{'kernel': ['rbf'], 'gamma': [1e-4, 1e-3, 0.01, 0.1, 0.2, 0.5, 0.6, 0.9],'C': [1, 10, 100, 1000, 10000]}]

In [143]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(com_crime_X, com_crime_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}


In [144]:
tuned_regression: RegressionModel = SVR(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, com_crime_X, com_crime_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(com_crime_df, grid_scores, 'SupportVectorRegression', 'GridSearchCV', 0)

In [145]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(com_crime_X, com_crime_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'kernel': 'rbf', 'gamma': 0.01, 'C': 1}


In [146]:
tuned_regression: RegressionModel = SVR(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tuned_regression, com_crime_X, com_crime_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(com_crime_df, random_scores, 'SupportVectorRegression', 'RandomizedSearchCV', 1)

com_crime_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,SupportVectorRegression,GridSearchCV,-0.129862,-0.027126,-0.1016,-0.119847,-0.023508,-0.099283,0.009369,0.006354
1,SupportVectorRegression,RandomizedSearchCV,-0.129862,-0.027126,-0.1016,-0.119847,-0.023508,-0.099283,0.011629,0.005639
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### LassoLars

In [147]:
regression: RegressionModel = LassoLars()
parameter_grid: Dict[str, List[Any]] = [{'alpha':[0.02, 0.024, 0.015, 0.025, 0.026, 0.03, 0.023, 0.017, 0.033, 0.014, 0.019]}]

In [148]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(com_crime_X, com_crime_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'alpha': 0.02}


In [150]:
tuned_regression: RegressionModel = LassoLars(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, com_crime_X, com_crime_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(com_crime_df, grid_scores, 'LassoLars', 'GridSearchCV', 2)

In [151]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error')
random_search.fit(com_crime_X, com_crime_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'alpha': 0.019}


In [152]:
tuned_regression: RegressionModel = LassoLars(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tuned_regression, com_crime_X, com_crime_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(com_crime_df, random_scores, 'LassoLars', 'RandomizedSearchCV', 3)

com_crime_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,SupportVectorRegression,GridSearchCV,-0.129862,-0.027126,-0.1016,-0.119847,-0.023508,-0.099283,0.009369,0.006354
1,SupportVectorRegression,RandomizedSearchCV,-0.129862,-0.027126,-0.1016,-0.119847,-0.023508,-0.099283,0.011629,0.005639
2,LassoLars,GridSearchCV,-0.233331,-0.076662,-0.223616,-0.232308,-0.076076,-0.223195,0.007156,0.001755
3,LassoLars,RandomizedSearchCV,-0.233331,-0.076662,-0.223616,-0.232308,-0.076076,-0.223195,0.006383,0.001791
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### ARDRegression

In [153]:
regression: RegressionModel = ARDRegression()
parameter_grid: Dict[str, List[Any]] = [{'tol':[1e-3, 1e-4], 'alpha_1':[1e-6, 1e-5, 1e-7], 'alpha_2':[1e-6, 1e-5, 1e-7]}]

In [154]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error', n_jobs = -1)
grid_search.fit(com_crime_X, com_crime_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'alpha_1': 1e-07, 'alpha_2': 1e-05, 'tol': 0.001}


In [155]:
tuned_regression: RegressionModel = ARDRegression(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, com_crime_X, com_crime_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(com_crime_df, grid_scores, 'ARDRegression', 'GridSearchCV', 4)

In [156]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error', n_jobs = -1)
random_search.fit(com_crime_X, com_crime_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'tol': 0.001, 'alpha_2': 1e-05, 'alpha_1': 1e-06}


In [157]:
tuned_regression: RegressionModel = ARDRegression(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tuned_regression, com_crime_X, com_crime_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(com_crime_df, random_scores, 'ARDRegression', 'RandomizedSearchCV', 5)

com_crime_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,SupportVectorRegression,GridSearchCV,-0.129862,-0.027126,-0.1016,-0.119847,-0.023508,-0.099283,0.009369,0.006354
1,SupportVectorRegression,RandomizedSearchCV,-0.129862,-0.027126,-0.1016,-0.119847,-0.023508,-0.099283,0.011629,0.005639
2,LassoLars,GridSearchCV,-0.233331,-0.076662,-0.223616,-0.232308,-0.076076,-0.223195,0.007156,0.001755
3,LassoLars,RandomizedSearchCV,-0.233331,-0.076662,-0.223616,-0.232308,-0.076076,-0.223195,0.006383,0.001791
4,ARDRegression,GridSearchCV,-0.13194,-0.028597,-0.107857,-0.107035,-0.018688,-0.085781,0.587935,0.003093
5,ARDRegression,RandomizedSearchCV,-0.13194,-0.028597,-0.107857,-0.107035,-0.018688,-0.085781,0.586385,0.003657
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### PassiveAggressiveRegressor

In [158]:
regression: RegressionModel = PassiveAggressiveRegressor()
parameter_grid: Dict[str, List[Any]] = [{'C': np.logspace(-4, 4, 20), 'tol':[1e-3, 1e-4]}]

In [159]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error', n_jobs = -1)
grid_search.fit(com_crime_X, com_crime_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'C': 0.0006951927961775605, 'tol': 0.0001}


In [160]:
tuned_regression: RegressionModel = PassiveAggressiveRegressor(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, com_crime_X, com_crime_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(com_crime_df, grid_scores, 'PassiveAggressiveRegressor', 'GridSearchCV', 6)

In [161]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error', n_jobs = -1)
random_search.fit(com_crime_X, com_crime_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'tol': 0.0001, 'C': 0.0018329807108324356}


In [162]:
tuned_regression: RegressionModel = PassiveAggressiveRegressor(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tuned_regression, com_crime_X, com_crime_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(com_crime_df, random_scores, 'PassiveAggressiveRegressor', 'RandomizedSearchCV', 7)

com_crime_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,SupportVectorRegression,GridSearchCV,-0.129862,-0.027126,-0.1016,-0.119847,-0.023508,-0.099283,0.009369,0.006354
1,SupportVectorRegression,RandomizedSearchCV,-0.129862,-0.027126,-0.1016,-0.119847,-0.023508,-0.099283,0.011629,0.005639
2,LassoLars,GridSearchCV,-0.233331,-0.076662,-0.223616,-0.232308,-0.076076,-0.223195,0.007156,0.001755
3,LassoLars,RandomizedSearchCV,-0.233331,-0.076662,-0.223616,-0.232308,-0.076076,-0.223195,0.006383,0.001791
4,ARDRegression,GridSearchCV,-0.13194,-0.028597,-0.107857,-0.107035,-0.018688,-0.085781,0.587935,0.003093
5,ARDRegression,RandomizedSearchCV,-0.13194,-0.028597,-0.107857,-0.107035,-0.018688,-0.085781,0.586385,0.003657
6,PassiveAggressiveRegressor,GridSearchCV,-0.128225,-0.027286,-0.100526,-0.118693,-0.023896,-0.093731,0.008769,0.003662
7,PassiveAggressiveRegressor,RandomizedSearchCV,-0.136498,-0.029677,-0.117013,-0.125962,-0.024801,-0.108881,0.010287,0.002146
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### TheilSenRegressor

In [163]:
regression: RegressionModel = TheilSenRegressor()
parameter_grid: Dict[str, List[Any]] = [{'tol':[1e-3, 1e-4], 'n_jobs':[None, 1, 2, 3, 4]}]

In [164]:
grid_search: SearchModel = GridSearchCV(estimator=regression, param_grid=parameter_grid, cv=5, scoring='neg_mean_squared_error', n_jobs = -1)
grid_search.fit(com_crime_X, com_crime_y)
best_parameters: Dict[str, List[Any]] = grid_search.best_params_
print(best_parameters)

{'n_jobs': None, 'tol': 0.0001}


In [165]:
tuned_regression: RegressionModel = TheilSenRegressor(**best_parameters)
grid_scores: Dict[str, List[float]] = cross_validate(tuned_regression, com_crime_X, com_crime_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(com_crime_df, grid_scores, 'TheilSenRegressor', 'GridSearchCV', 8)

In [166]:
random_search: SearchModel = RandomizedSearchCV(estimator=regression, param_distributions=parameter_grid, cv=5, scoring='neg_mean_squared_error', n_jobs = -1)
random_search.fit(com_crime_X, com_crime_y)
best_parameters: Dict[str, List[Any]] = random_search.best_params_
print(best_parameters)

{'tol': 0.001, 'n_jobs': 3}


In [167]:
tuned_regression: RegressionModel = TheilSenRegressor(**best_parameters)
random_scores: Dict[str, List[float]] = cross_validate(tuned_regression, com_crime_X, com_crime_y, cv=5, scoring=['neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_median_absolute_error'], return_train_score=True)
populate_dataFrame(com_crime_df, random_scores, 'TheilSenRegressor', 'RandomizedSearchCV', 9)

com_crime_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_neg_mean_absolute_error,test_neg_mean_squared_error,test_neg_median_absolute_error,train_neg_mean_absolute_error,train_neg_mean_squared_error,train_neg_median_absolute_error,fit_time,score_time
0,SupportVectorRegression,GridSearchCV,-0.129862,-0.027126,-0.1016,-0.119847,-0.023508,-0.099283,0.009369,0.006354
1,SupportVectorRegression,RandomizedSearchCV,-0.129862,-0.027126,-0.1016,-0.119847,-0.023508,-0.099283,0.011629,0.005639
2,LassoLars,GridSearchCV,-0.233331,-0.076662,-0.223616,-0.232308,-0.076076,-0.223195,0.007156,0.001755
3,LassoLars,RandomizedSearchCV,-0.233331,-0.076662,-0.223616,-0.232308,-0.076076,-0.223195,0.006383,0.001791
4,ARDRegression,GridSearchCV,-0.13194,-0.028597,-0.107857,-0.107035,-0.018688,-0.085781,0.587935,0.003093
5,ARDRegression,RandomizedSearchCV,-0.13194,-0.028597,-0.107857,-0.107035,-0.018688,-0.085781,0.586385,0.003657
6,PassiveAggressiveRegressor,GridSearchCV,-0.128225,-0.027286,-0.100526,-0.118693,-0.023896,-0.093731,0.008769,0.003662
7,PassiveAggressiveRegressor,RandomizedSearchCV,-0.136498,-0.029677,-0.117013,-0.125962,-0.024801,-0.108881,0.010287,0.002146
8,TheilSenRegressor,GridSearchCV,-0.179553,-0.061347,-0.143869,-0.097217,-0.020838,-0.072623,27.708493,0.001389
9,TheilSenRegressor,RandomizedSearchCV,-0.177982,-0.058108,-0.143902,-0.097009,-0.019726,-0.072421,13.015588,0.000999


In [168]:
com_crime_df.iloc[:, 2:] = np.abs(com_crime_df.iloc[:, 2:])
com_crime_df: pd.DataFrame = com_crime_df.rename(columns={'test_neg_mean_absolute_error': 'test_mean_absolute_error', 'test_neg_mean_squared_error': 'test_mean_squared_error', 
                                            'test_neg_median_absolute_error': 'test_median_absolute_error', 'train_neg_mean_absolute_error': 'train_mean_absolute_error', 
                                            'train_neg_mean_squared_error': 'train_mean_squared_error', 'train_neg_median_absolute_error': 'train_median_absolute_error'})
com_crime_df.head(n=10)

Unnamed: 0,Model,Search_strategy,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error,fit_time,score_time
0,SupportVectorRegression,GridSearchCV,0.129862,0.027126,0.1016,0.119847,0.023508,0.099283,0.009369,0.006354
1,SupportVectorRegression,RandomizedSearchCV,0.129862,0.027126,0.1016,0.119847,0.023508,0.099283,0.011629,0.005639
2,LassoLars,GridSearchCV,0.233331,0.076662,0.223616,0.232308,0.076076,0.223195,0.007156,0.001755
3,LassoLars,RandomizedSearchCV,0.233331,0.076662,0.223616,0.232308,0.076076,0.223195,0.006383,0.001791
4,ARDRegression,GridSearchCV,0.13194,0.028597,0.107857,0.107035,0.018688,0.085781,0.587935,0.003093
5,ARDRegression,RandomizedSearchCV,0.13194,0.028597,0.107857,0.107035,0.018688,0.085781,0.586385,0.003657
6,PassiveAggressiveRegressor,GridSearchCV,0.128225,0.027286,0.100526,0.118693,0.023896,0.093731,0.008769,0.003662
7,PassiveAggressiveRegressor,RandomizedSearchCV,0.136498,0.029677,0.117013,0.125962,0.024801,0.108881,0.010287,0.002146
8,TheilSenRegressor,GridSearchCV,0.179553,0.061347,0.143869,0.097217,0.020838,0.072623,27.708493,0.001389
9,TheilSenRegressor,RandomizedSearchCV,0.177982,0.058108,0.143902,0.097009,0.019726,0.072421,13.015588,0.000999


In [170]:
styler: pd.io.formats.style.Styler = com_crime_df.head(10).style.highlight_max(color='tomato').highlight_min(color='lightgreen')
styler

Unnamed: 0,Model,Search_strategy,test_mean_absolute_error,test_mean_squared_error,test_median_absolute_error,train_mean_absolute_error,train_mean_squared_error,train_median_absolute_error,fit_time,score_time
0,SupportVectorRegression,GridSearchCV,0.129862,0.027126,0.1016,0.119847,0.023508,0.099283,0.009369,0.006354
1,SupportVectorRegression,RandomizedSearchCV,0.129862,0.027126,0.1016,0.119847,0.023508,0.099283,0.011629,0.005639
2,LassoLars,GridSearchCV,0.233331,0.076662,0.223616,0.232308,0.076076,0.223195,0.007156,0.001755
3,LassoLars,RandomizedSearchCV,0.233331,0.076662,0.223616,0.232308,0.076076,0.223195,0.006383,0.001791
4,ARDRegression,GridSearchCV,0.13194,0.028597,0.107857,0.107035,0.018688,0.085781,0.587935,0.003093
5,ARDRegression,RandomizedSearchCV,0.13194,0.028597,0.107857,0.107035,0.018688,0.085781,0.586385,0.003657
6,PassiveAggressiveRegressor,GridSearchCV,0.128225,0.027286,0.100526,0.118693,0.023896,0.093731,0.008769,0.003662
7,PassiveAggressiveRegressor,RandomizedSearchCV,0.136498,0.029677,0.117013,0.125962,0.024801,0.108881,0.010287,0.002146
8,TheilSenRegressor,GridSearchCV,0.179553,0.061347,0.143869,0.097217,0.020838,0.072623,27.708493,0.001389
9,TheilSenRegressor,RandomizedSearchCV,0.177982,0.058108,0.143902,0.097009,0.019726,0.072421,13.015588,0.000999


In [171]:
html: str = styler.render()
file = open("CommunitiesCrime_DataSet.html", "w")
file.write(html)
file.close()

# Documentatie

## Linear Regression

***Link catre documentatia oficiala***: [https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

&emsp;&emsp;Regresia caută relații între variabile.

&emsp;&emsp;De exemplu, puteți observa câțiva angajați ai unei companii și puteți încerca să înțelegeți modul în care salariile lor depind de anumite caracteristici, cum ar fi experiența, nivelul de educație, rolul, orașul în care lucrează și așa mai departe.

&emsp;&emsp;În mod similar, puteți încerca să stabiliți o dependență matematică a prețurilor caselor in functie de suprafețele lor, numărul de dormitoare, distanțele de centrul orașului și așa mai departe.

&emsp;&emsp;În general, în regresie, luați în considerare de obicei un fenomen de interes și aveți o serie de observații. Fiecare observație are două sau mai multe caracteristici. Urmând presupunerea că (cel puțin) una dintre caracteristici depinde de celelalte, încercați să stabiliți o relație între ele.

&emsp;&emsp;Cu alte cuvinte, trebuie să găsiți o funcție care să asocieze suficient de bine unele caracteristici sau variabile la altele.

&emsp;&emsp;Caracteristicile dependente se numesc variabile dependente, ieșiri sau răspunsuri.

&emsp;&emsp;Caracteristicile independente sunt numite variabile independente, intrări sau predictori.

&emsp;&emsp;Problemele de regresie au de obicei o variabilă dependentă continuă și nelimitată. Cu toate acestea, intrările pot fi date continue, discrete sau chiar categorice, cum ar fi sexul, naționalitatea, marca etc.

&emsp;&emsp;Regresia este unul dintre cele mai importante domenii în statistica și machine learning. Există multe metode de regresie disponibile. ***Regresia liniară*** este una dintre ele.

&emsp;&emsp;***Regresia liniară*** este probabil una dintre cele mai importante și utilizate pe scară largă tehnici de regresie. Este printre cele mai simple metode de regresie. Unul dintre principalele sale avantaje este ușurința interpretării rezultatelor.

&emsp;&emsp;**Regresie liniară simplă**

&emsp;&emsp;Regresia liniară simplă sau cu o singură variație este cel mai simplu caz de regresie liniară cu o singură variabilă independentă, 𝐱 = 𝑥.

<img src="./images/fig-lin-reg.webp" alt="drawing" width="600px"/>

***Linear Regression in Python***: [https://realpython.com/linear-regression-in-python/#what-is-regression](https://realpython.com/linear-regression-in-python/#what-is-regression)

***Linear Regression for Machine Learning***: [https://machinelearningmastery.com/linear-regression-for-machine-learning/](https://machinelearningmastery.com/linear-regression-for-machine-learning/)

## Decision Tree Regression

***Link catre documentatia oficiala***: [https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)

&emsp;&emsp;Decision Tree este un instrument de luare a deciziilor care folosește o structură de arbore asemănătoare unei diagrame sau este un model de decizii și al tuturor rezultatelor posibile ale acestora, inclusiv rezultatele, costurile de intrare și utilitatea.

&emsp;&emsp;Algoritmul Decision Tree se încadrează în categoria algoritmilor de învățare supervizata. Funcționează atât pentru variabilele de ieșire continue, cât și pentru cele categorice.

&emsp;&emsp;Ramurile/muchiile reprezintă rezultatul nodului, iar nodurile au fie:

&emsp;&emsp;&emsp;&emsp;1.Conditii (Noduri de decizie)

&emsp;&emsp;&emsp;&emsp;2.Rezultate (Noduri frunza)

&emsp;&emsp;Ramurile/marginile reprezintă adevărul/falsitatea declarației și un exemplu de luare a deciziei este in poza de mai jos, care arată un arbore de decizie care evaluează cel mai mic dintre trei numere:

<img src="./images/decision-tree.jpg" alt="drawing" width="500px"/>


&emsp;&emsp;Regresia Decision Tree observă caracteristicile unui obiect și antrenează un model în structura unui arbore pentru a prezice date în viitor si pentru a produce o ieșire continuă semnificativă. Ieșirea continuă înseamnă că ieșirea/rezultatul nu este discretă, adică nu este reprezentată doar de un set discret, cunoscut de numere sau valori.

&emsp;&emsp;Arborii de decizie sunt folosiți pentru a se potrivi cu o curbă sinusoidală cu zgomot adăugat. Ca rezultat, acestia învață regresii liniare locale aproximând curba sinusoidală.

&emsp;&emsp;Putem vedea că dacă adâncimea maximă a arborelui (controlată de parametrul *max_depth*) este setată prea mare, arborii de decizie învață detalii prea fine ale datelor de antrenament și învață din zgomot, adică se produce *overfitting*.

<img src="./images/sphx_glr_plot_tree_regression_001.png" alt="drawing" width="500px"/>

***Python | Decision Tree Regression using sklearn***: [https://www.geeksforgeeks.org/python-decision-tree-regression-using-sklearn/](https://www.geeksforgeeks.org/python-decision-tree-regression-using-sklearn/)

***Visualize a Decision Tree in 4 Ways with Scikit-Learn and Python***: [https://mljar.com/blog/visualize-decision-tree/](https://mljar.com/blog/visualize-decision-tree/)

## KernelRidge

***Link catre documentatia oficiala***: [https://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html](https://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html)

&emsp;&emsp;Kernel Ridge Regression (KRR) combină Ridge Regression si Ridge Classification cu un truc numit *kernel trick*. 

&emsp;&emsp;Învață astfel o funcție liniară în spațiul indus de nucleul respectiv și de date. Pentru nucleele neliniare, aceasta corespunde unei funcții neliniare în spațiul original.

&emsp;&emsp;Forma modelului învățat de KernelRidge este identică cu Support Vector Regression (SVR). Cu toate acestea, se utilizează diferite funcții de pierdere: KRR utilizează squared error loss, în timp ce regresia vectorului suport folosește $\epsilon$-insensitive loss, ambele combinate cu regularizarea l2. Spre deosebire de SVR, fitting-ul KernelRidge se poate face în formă închisă și este de obicei mai rapidă pentru seturile de date de dimensiuni medii. Pe de altă parte, modelul învățat este non-rar și, prin urmare, mai lent decât SVR, care învață un model rar, pentru $\epsilon > 0$ la timpul de predicție.

&emsp;&emsp;Figura următoare compară KernelRidge și SVR pe un set de date artificial, care constă dintr-o funcție țintă sinusoidală și zgomot puternic adăugat la fiecare al cincilea punct de date. Este reprezentat modelul învățat al KernelRidge și SVR, unde complexitatea/regularizarea și lățimea de bandă a nucleului RBF au fost optimizate utilizând GridSearch. Funcțiile învățate sunt foarte similare; cu toate acestea, fitting-ul KernelRidge este de aproximativ șapte ori mai rapid decât fitting-ul SVR (ambele cu GridSearch). Cu toate acestea, predicția a 100000 de valori țintă este de peste trei ori mai rapidă cu SVR, deoarece a învățat un model rar folosind doar aproximativ 1/3 din cele 100 de puncte de antrenament ca vectori de suport.

<img src="./images/sphx_glr_plot_kernel_ridge_regression_001.png" alt="drawing" width="400px"/>

<img src="./images/sphx_glr_plot_kernel_ridge_regression_002.png" alt="drawing" width="400px"/>

<img src="./images/sphx_glr_plot_kernel_ridge_regression_003.png" alt="drawing" width="400px"/>

***Kernel Ridge Regression – Python Tutorial***: [https://www.mdelcueto.com/blog/kernel-ridge-regression-tutorial/](https://www.mdelcueto.com/blog/kernel-ridge-regression-tutorial/)

***What is the kernel trick? Why is it important?***: [https://medium.com/@zxr.nju/what-is-the-kernel-trick-why-is-it-important-98a98db0961d](https://medium.com/@zxr.nju/what-is-the-kernel-trick-why-is-it-important-98a98db0961d)


## Gaussian Process Regressor

***Link catre documentatia oficiala***: [https://scikit-learn.org/stable/modules/gaussian_process.html](https://scikit-learn.org/stable/modules/gaussian_process.html)

&emsp;&emsp;Procesele Gaussiene (GP) sunt o metodă generică de învățare supervizata concepută pentru a rezolva regresia și problemele de clasificare probabilistică.

&emsp;&emsp;Avantajele proceselor gaussiene sunt:

* Predicția interpoleaza observațiile (cel puțin pentru nucleele obișnuite).

* Predicția este probabilistică (gaussiană), astfel încât să se poată calcula intervale de încredere empirice și să se decidă pe baza acelora dacă ar trebui să refaceți (adaptarea online, adaptarea adaptivă) predicția în anumite regiuni de interes.

* Versatilitatea: pot fi specificate diferite nuclee. Sunt furnizate nuclee obișnuite, dar este posibilă și specificarea nucleelor personalizate.

&emsp;&emsp;Dezavantajele proceselor gaussiene includ:

* Nu sunt rare, adică folosesc toate informațiile despre eșantioane / caracteristici pentru a efectua predicția.

* Își pierd eficiența în spații cu dimensiuni ridicate - și anume atunci când numărul de caracteristici depășește câteva zeci.

&emsp;&emsp;GaussianProcessRegressor implementează procesele Gaussiene (GP) în scopuri de regresie. Pentru aceasta, trebuie specificat priorul GP. Se presupune că media anterioară este constantă și zero (pentru *normalize_y = False*) sau media datelor de antrenament (pentru *normalize_y = True*). Covarianța priorului este specificată printr-un obiect de tip kernel. Hiperparametrii kernelului sunt optimizati în timpul fitting-ului GaussianProcessRegressor prin maximizarea probabilității log-marginale (LML) pe baza optimizatorului. Deoarece LML poate avea multiple optime locale, optimizatorul poate fi pornit în mod repetat prin specificarea *n_restarts_optimizer*. Prima rundă se efectuează întotdeauna pornind de la valorile inițiale ale hiperparametrului nucleului; rulările ulterioare sunt efectuate din valori hiperparametrice care au fost alese aleatoriu din gama valorilor permise. Dacă hiperparametrii inițiali ar trebui menținuti ficsi,  niciunul nu poate fi transmis ca optimizator.

&emsp;&emsp;Nivelul de zgomot din ținte poate fi specificat prin parametrul *alpha*, fie la nivel global ca scalar, fie ca datapoint. Rețineți că un nivel moderat de zgomot poate fi, de asemenea, util pentru a face față problemelor numerice în timpul fitting-ului, deoarece este implementat eficient ca regularizare Tihonov, adică prin adăugarea acestuia la diagonala matricei nucleului. O alternativă la specificarea în mod explicit a nivelului de zgomot este includerea unei componente WhiteKernel în kernel, care poate estima nivelul global de zgomot din date.

<img src="./images/sphx_glr_plot_gpr_noisy_001.png" alt="drawing" width="400px"/>

<img src="./images/sphx_glr_plot_gpr_noisy_002.png" alt="drawing" width="400px"/>

<img src="./images/sphx_glr_plot_gpr_noisy_003.png" alt="drawing" width="400px"/>

***Comparison of kernel ridge and Gaussian process regression***: [https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_compare_gpr_krr.html#sphx-glr-auto-examples-gaussian-process-plot-compare-gpr-krr-py](https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_compare_gpr_krr.html#sphx-glr-auto-examples-gaussian-process-plot-compare-gpr-krr-py)

***An Introduction to Gaussian Process Regression***: [https://juanitorduz.github.io/gaussian_process_reg/](https://juanitorduz.github.io/gaussian_process_reg/)

## Logistic Regression

***Link catre documentatia oficiala***: [https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

&emsp;&emsp;Regresia logistică poate fi utilizată pentru diverse probleme de clasificare, cum ar fi detectarea spamului, predicția diabetului, dacă un anumit client va achiziționa un anumit produs,dacă utilizatorul va face clic pe un anumit link publicitar sau nu, și multe alte exemple.

&emsp;&emsp;Regresia logistică este unul dintre cei mai simpli și frecvent utilizați algoritmi de machine learning pentru clasificarea în două clase. Este ușor de implementat și poate fi folosit ca bază pentru orice problemă de clasificare binară. Conceptele sale usoare de bază sunt, de asemenea, constructive în deep learning. Regresia logistică descrie și estimează relația dintre o variabilă binară dependentă și variabile independente.

&emsp;&emsp;Este un caz special de regresie liniară în care variabila țintă are o natură categorică. Utilizează un jurnal de cote ca variabilă dependentă. Regresia logistică prezice probabilitatea apariției unui eveniment binar utilizând o funcție logit.

<img src="./images/linear_vs_logistic_regression_edxw03.png" alt="drawing" width="600px"/>

* Avantaje

&emsp;&emsp;Datorită naturii sale eficiente și directe, nu necesită o putere de calcul mare, este ușor de implementat, este ușor de interpretat, este utilizata pe scară largă de către analistii de date și oamenii de știință. De asemenea, nu necesită scalarea caracteristicilor. Regresia logistică oferă un scor de probabilitate pentru observații.

* Dezavantaje

&emsp;&emsp;Regresia logistică nu poate gestiona un număr mare de caracteristici/variabile categorice. Este vulnerabila la suprasolicitare. De asemenea, nu poate rezolva o problema neliniară, motiv pentru care este necesara o transformare a caracteristicilor neliniare. Regresia logistică nu va funcționa bine cu variabile independente care nu sunt corelate cu variabila țintă și sunt foarte similare sau corelate între ele.

***Understanding Logistic Regression in Python***: [https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python](https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python)

***Logistic Regression in Python***: [https://realpython.com/logistic-regression-python/#logistic-regression-in-python](https://realpython.com/logistic-regression-python/#logistic-regression-in-python)

## SupportVectorRegression

***Link catre documentatia oficiala***: [https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html)

&emsp;&emsp;Mașinile vectoriale de asistență (SVM) sunt utilizate în mod popular și pe scară largă pentru probleme de clasificare în învățarea automată.
&emsp;&emsp;Problema de regresie este o generalizare a problemei de clasificare, în care modelul returnează o ieșire cu valoare continuă, spre deosebire de o ieșire dintr-un set finit. Cu alte cuvinte, un model de regresie estimează o funcție multivariată cu valoare continuă.

&emsp;&emsp;SVM-urile rezolvă probleme de clasificare binare formulându-le ca probleme de optimizare convexă. Problema de optimizare presupune găsirea marjei maxime care separă hiperplanul, clasificând în același timp cât mai multe puncte de antrenament posibil. SVM-urile reprezintă acest hiperplan optim cu vectori de suport. Soluția rară și o generalizare bună a SVM se pretează adaptării la problemele de regresie. Generalizarea SVM la SVR se realizează prin introducerea unei regiuni insensibile la ε în jurul funcției, numită tubul ε. Acest tub reformulează problema de optimizare pentru a găsi tubul care se apropie cel mai bine de funcția cu valoare continuă, echilibrând în același timp complexitatea modelului și eroarea de predicție. Mai precis, SVR este formulat ca o problemă de optimizare prin definirea mai întâi a unei funcții de pierdere convexă ε-insensibilă care trebuie minimizată și găsirea celui mai plat tub care conține majoritatea instanțelor de antrenament. Prin urmare, o funcție multiobjectivă este construită din funcția de pierdere și proprietățile geometrice ale tubului. Apoi, optimizarea convexă, care are o soluție unică, este rezolvată, utilizând algoritmi de optimizare numerici corespunzători. Hiperplanul este reprezentat în termeni de vectori de susținere, care sunt probe de antrenament care se află în afara limitei tubului. La fel ca în SVM, vectorii de sprijin din SVR sunt cele mai influente instanțe care afectează forma tubului, iar datele de antrenament și de testare sunt presupuse a fi independente și distribuite identic (iid), extrase din aceeași funcție de distribuție de probabilitate fixă, dar necunoscută. într-un context de învățare supravegheată.

&emsp;&emsp;Formularea problemei SVR este adesea derivată cel mai bine dintr-o perspectivă geometrică, folosind exemplul unidimensional din imaginea de mai jos.

<img src="./images/SVM_1.png" alt="drawing" width="600px"/>

&emsp;&emsp;Ceea ce încercăm să facem aici este practic să încercăm să decidem o limită de decizie la o distanță „e” de hiperplanul original, astfel încât punctele de date cele mai apropiate de hiperplanul sau vectorii de susținere să fie în acea linie de limită.

<img src="./images/SVM_2.jpeg" alt="drawing" width="600px"/>

***Support Vector Regression***: [https://link.springer.com/chapter/10.1007/978-1-4302-5990-9_4](https://link.springer.com/chapter/10.1007/978-1-4302-5990-9_4)

***Support Vector Regression Tutorial for Machine Learning***: [https://www.analyticsvidhya.com/blog/2020/03/support-vector-regression-tutorial-for-machine-learning/](https://www.analyticsvidhya.com/blog/2020/03/support-vector-regression-tutorial-for-machine-learning/)

## LassoLars

***Link catre documentatia oficiala***: [https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLars.html](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLars.html)

&emsp;&emsp;În statistici, regresia cu unghi minim (LARS) este un algoritm pentru adaptarea modelelor de regresie liniară la date de dimensiuni ridicate, dezvoltat de Bradley Efron, Trevor Hastie, Iain Johnstone și Robert Tibshirani.

&emsp;&emsp;Să presupunem că ne așteptăm ca o variabilă de răspuns să fie determinată de o combinație liniară a unui subset de covariabile potențiale. Apoi, algoritmul LARS oferă un mijloc de a produce o estimare a variabilelor de inclus, precum și a coeficienților acestora.

&emsp;&emsp;În loc să dea un rezultat vector, soluția LARS constă dintr-o curbă care denotă soluția pentru fiecare valoare a normei L1 a vectorului parametru. Algoritmul este similar cu regresia pas cu pas înainte, dar în loc să includă variabile la fiecare pas, parametrii estimate sunt crescuți într-o direcție echiangulară cu corelațiile fiecăruia cu reziduul.

&emsp;&emsp;O explicație simplă a regresiei Lasso și Least Angle

&emsp;&emsp;Dați un set de măsurători de intrare x1, x2 ... xp și o măsurare a rezultatului y, lazo se potrivește unui model liniar

&emsp;&emsp;yhat = b0 + b1 * x1 + b2 * x2 + ... bp * xp

&emsp;&emsp;Criteriul pe care îl folosește este:

&emsp;&emsp;Minimizați suma ((y-yhat) ^ 2) sub rezerva sumei [valoare absolută (bj)] <= s

&emsp;&emsp;Prima sumă este preluată de observațiile (cazurile) din setul de date. „S” legat este un parametru de reglare. Când „s” este suficient de mare, constrângerea nu are efect și soluția este doar regresia obișnuită multiplă liniară obișnuită a lui y pe x1, x2, ... xp.

&emsp;&emsp;Cu toate acestea, atunci când pentru valori mai mici de s (s> = 0) soluțiile sunt versiuni micșorate ale estimărilor celor mai mici pătrate. Adesea, unii dintre coeficienții bj sunt zero. Alegerea „s” este ca alegerea numărului de predictori de utilizat într-un model de regresie, iar validarea încrucișată este un instrument bun pentru estimarea celei mai bune valori pentru „s”.


* Calculul soluțiilor Lasso

&emsp;&emsp;Calculul soluțiilor lazo este o problemă de programare pătratică și poate fi abordat prin algoritmi standard de analiză numerică. Dar procedura de regresie cu unghi minim este o abordare mai bună. Acest algoritm exploatează structura specială a problemei lazo și oferă o modalitate eficientă de a calcula soluțiile simultan pentru toate valorile „s”.

&emsp;&emsp;Regresia cu unghi minim este ca o versiune mai „democratică” a regresiei în trepte. Amintiți-vă cum funcționează regresia în trepte:

* Algoritmul de regresie pas cu pas înainte:

&emsp;&emsp;Începeți cu toți coeficienții bj egali cu zero.
&emsp;&emsp;Găsiți predictorul xj cel mai corelat cu y și adăugați-l în model. Luați reziduuri r = y-yhat.
&emsp;&emsp;Continuați, în fiecare etapă adăugând la model predictorul cel mai corelat cu r.
&emsp;&emsp;Până la: toți predictorii sunt în model
&emsp;&emsp;Cea mai mică procedură de regresie a unghiului urmează aceeași schemă generală, dar nu adaugă un predictor pe deplin în model. Coeficientul acelui predictor este crescut numai până când acel predictor nu mai este cel mai corelat cu r rezidual. Apoi, un alt predictor concurent este invitat să „se alăture clubului”.


* Algoritmul de regresie cu unghi minim:

&emsp;&emsp;Începeți cu toți coeficienții bj egali cu zero.
&emsp;&emsp;Găsiți predictorul xj cel mai corelat cu y
&emsp;&emsp;Creșteți coeficientul bj în direcția semnului corelației sale cu y. Luați reziduuri r = y-yhat pe parcurs. Opriți-vă când un alt predictor xk are la fel de multă corelație cu r ca xj.
&emsp;&emsp;Creșteți (bj, bk) în direcția articulației lor minime pătrate, până când un alt predictor xm are la fel de multă corelație cu r rezidual.
&emsp;&emsp;Continuați până: toți predictorii sunt în model
&emsp;&emsp;În mod surprinzător, se poate demonstra că, cu o modificare, această procedură oferă întreaga cale a soluțiilor lazo, deoarece s variază de la 0 la infinit. Modificarea necesară este: dacă un coeficient diferit de zero atinge zero, eliminați-l din setul activ de predictori și recomputați direcția articulației.

<img src="./images/LassoLars.png" alt="drawing" width="600px"/>

***A simple explanation of the Lasso and Least Angle Regression***: [https://statweb.stanford.edu/~tibs/lasso/simple.html](https://statweb.stanford.edu/~tibs/lasso/simple.html)

***Least-angle regression***:[https://en.wikipedia.org/wiki/Least-angle_regression](https://en.wikipedia.org/wiki/Least-angle_regression)

## ARDRegression

***Link catre documentatia oficiala***: [https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ARDRegression.html](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ARDRegression.html)

&emsp;&emsp;Determinarea automată a relevanței (ARD) se bazează pe metoda inferenței bayesiene. Scikit-learn API oferă clasa ARDRegression pentru a se potrivi modelului de regresie utilizând metoda ARD. ARDRegresiunea consideră ponderile modelului ca un Gaussian distribuite și estimează parametrii lambda și alfa prin iterație.

&emsp;&emsp;Un sistem de regresie de determinare automată a relevanței (in engleza Automatic Relevance Determination (ARD)) este un sistem de regresie bayesiană care implementează un algoritm de regresie ARD pentru a rezolva o sarcină de regresie ARD.

&emsp;&emsp;Pentru a motiva Determinarea Automată a Relevanței (ARD) se stabilește o intuiție pentru problema alegerii unui model complex care se potrivește bine cu datele comparativ cu un model simplu care generalizează bine. Prin urmare, ideea din spatele aparatului de ras este prezentată ca o modalitate de echilibrare a prejudecății și varianței. Acest lucru ne conduce la cadrul matematic al interpolației Bayesiene și la selectarea modelelor pentru a alege între diferite modele bazate pe date.

&emsp;&emsp;Pentru a obține ARD cât mai rapid posibil, se repetă elementele de bază matematice ale unui model liniar simplu, precum și ideea de regularizare pentru a preveni supraadaptarea. Pe baza acestui fapt, se introduce Regresia Bayesian Ridge (BayesianRidge în Scikit-Learn). Generalizarea conceptului de regresie Bayesian Ridge ne duce și mai mult la ideea din spatele ARD (ARDRegresie în Scikit-Learn).

&emsp;&emsp;Cu ajutorul unui exemplu practic, consolidăm ceea ce s-a învățat până acum și comparăm ARD cu un model obișnuit de cel puțin pătrat. Acum ne adâncim în matematica ARD și prezentăm algoritmul care rezolvă problema de minimizare a ARD. În cele din urmă, sunt discutate câteva detalii despre implementarea ARD a Scikit-Learn.

<img src="./images/ARD.png" alt="drawing" width="600px"/>

***Explaining the Idea behind ARD and Bayesian Interpolation***: [https://florianwilhelm.info/2016/03/explaining_the_idea_behind_ard/](https://florianwilhelm.info/2016/03/explaining_the_idea_behind_ard/)

***Automatic Relevance Determination (ARD) Regression System***: [http://www.gabormelli.com/RKB/Automatic_Relevance_Determination_(ARD)_Regression_System](http://www.gabormelli.com/RKB/Automatic_Relevance_Determination_(ARD)_Regression_System)

## Passive Aggressive Regressor

***Link catre documentatia oficiala***: [https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PassiveAggressiveRegressor.html](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PassiveAggressiveRegressor.html)

&emsp;&emsp;Algoritmii pasiv-agresivi sunt utilizați în general pentru învățarea pe scară largă. Este unul dintre puținii „algoritmi de învățare online”. În algoritmii de învățare automată online, datele de intrare vin în ordine secvențială, iar modelul de învățare automată este actualizat pas cu pas, spre deosebire de învățarea în serie, unde întregul set de date de instruire este utilizat simultan. Acest lucru este foarte util în situațiile în care există o cantitate uriașă de date și este calculabil imposibil să instruiți întregul set de date din cauza dimensiunii foarte mari a datelor. Putem spune pur și simplu că un algoritm de învățare online va obține un exemplu de instruire, va actualiza clasificatorul și apoi va arunca exemplul.
Un exemplu foarte bun în acest sens ar fi detectarea știrilor false pe un site de socializare precum Twitter, unde se adaugă date noi în fiecare secundă. Pentru a citi în mod dinamic datele de pe Twitter în mod continuu, datele ar fi uriașe și ar fi ideală utilizarea unui algoritm de învățare online.

&emsp;&emsp;Algoritmii pasiv-agresivi sunt oarecum similari cu un model Perceptron, în sensul că nu necesită o rată de învățare. Cu toate acestea, acestea includ un parametru de regularizare.

&emsp;&emsp;Cum funcționează algoritmii pasiv-agresivi:
&emsp;&emsp;Algoritmii pasiv-agresivi sunt numiți astfel deoarece:

&emsp;&emsp;Pasiv: dacă predicția este corectă, păstrați modelul și nu efectuați modificări. adică datele din exemplu nu sunt suficiente pentru a provoca modificări ale modelului.

&emsp;&emsp;Agresiv: dacă predicția este incorectă, modificați modelul. adică, unele modificări ale modelului îl pot corecta.
Înțelegerea matematicii din spatele acestui algoritm nu este foarte simplă și depășește scopul unui singur articol. Acest articol oferă doar o prezentare generală a algoritmului și o simplă implementare a acestuia. Pentru a afla mai multe despre matematica din spatele acestui algoritm, vă recomand să urmăriți acest videoclip excelent despre lucrările algoritmului de către Dr. Victor Lavrenko.

&emsp;&emsp;Parametri importanți:

&emsp;&emsp;**C**: Acesta este parametrul de regularizare și denotă penalizarea pe care o va face modelul pentru o predicție incorectă

&emsp;&emsp;**max_iter**: numărul maxim de iterații pe care modelul le face asupra datelor de instruire.
tol: Criteriul de oprire. Dacă este setat la None, modelul se va opri când (pierdere> previous_loss - tol). În mod implicit, este setat la 1e-3.

<img src="./images/PAR.png" alt="drawing" width="600px"/>

***ML Algorithms addendum: Passive Aggressive Algorithms***: [https://www.bonaccorso.eu/2017/10/06/ml-algorithms-addendum-passive-aggressive-algorithms/](https://www.bonaccorso.eu/2017/10/06/ml-algorithms-addendum-passive-aggressive-algorithms/)

***Passive Aggressive Classifiers***: [https://www.geeksforgeeks.org/passive-aggressive-classifiers/](https://www.geeksforgeeks.org/passive-aggressive-classifiers/)

## TheilSenRegressor

***Link catre documentatia oficiala***: [https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.TheilSenRegressor.html](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.TheilSenRegressor.html)

&emsp;&emsp;În statisticile neparametrice, estimatorul Theil – Sen este o metodă de potrivire robustă a unei linii la eșantionarea punctelor în plan (regresie liniară simplă) prin alegerea medianei pantei tuturor liniilor prin perechi de puncte. De asemenea, a fost numit estimatorul de pantă al lui Sen, selecția pantei, metoda mediană unică, metoda Kendall robustă de potrivire a liniei, și linia robustă Kendall-Theil .  Acesta poartă numele lui Henri Theil și Pranab K. Sen, care au publicat articole despre această metodă în 1950 și respectiv în 1968  și după Maurice Kendall datorită relației sale cu coeficientul de corelație a rangului tau Kendall.

&emsp;&emsp;Acest estimator poate fi calculat eficient și este insensibil la valori anormale. Poate fi semnificativ mai precis decât regresia liniară simplă non-robustă (cele mai mici pătrate) pentru datele înclinate și heteroskedastice și concurează bine împotriva celor mai mici pătrate chiar și pentru datele distribuite în mod normal în termeni de putere statistică. A fost numită „cea mai populară tehnică neparametrică pentru estimarea unei tendințe liniare”.

&emsp;&emsp;Panta mediană a unui set de n puncte de eșantionare poate fi calculată exact calculând toate O (n2) drepte prin perechi de puncte și apoi aplicând un algoritm liniar de găsire a medianei timpului. Alternativ, poate fi estimată prin eșantionarea perechilor de puncte. Această problemă este echivalentă, în cadrul dualității proiective, cu problema găsirii punctului de trecere într-un aranjament de linii care are coordonata x mediană între toate aceste puncte de trecere.

&emsp;&emsp;Problema efectuării selecției pantei exact, dar mai eficient decât algoritmul timpului pătratic al forței brute a fost studiată pe larg în geometria de calcul. Sunt cunoscute mai multe metode diferite pentru calcularea estimatorului Theil – Sen exact în timpul O (n log n), fie deterministic , fie folosind algoritmi randomizați. Estimatorul mediu repetat al lui Siegel poate fi, de asemenea, construit în același timp legat. În modelele de calcul în care coordonatele de intrare sunt numere întregi și în care operațiunile bit-bit pe numere întregi necesită timp constant, estimatorul Theil-Sen poate fi construit și mai rapid, în timp aleatorizat așteptat {\ displaystyle O (n {\ sqrt {\ log n}})} O (n {\ sqrt {\ log n}}).

&emsp;&emsp;Un estimator pentru panta cu rang aproximativ mediu, având același punct de defalcare ca și estimatorul Theil-Sen, poate fi menținut în modelul fluxului de date (în care punctele de eșantionare sunt procesate unul câte unul de un algoritm care nu are suficientă persistență stocare pentru a reprezenta întregul set de date) folosind un algoritm bazat pe rețele ε.

&emsp;&emsp;În pachetul de statistici R, atât estimatorul Theil – Sen, cât și estimatorul mediu repetat al lui Siegel sunt disponibile prin biblioteca mblm. O aplicație Visual Basic independentă pentru estimarea Theil-Sen, KTRLine, a fost pusă la dispoziție de US Geological Survey. Estimatorul Theil – Sen a fost implementat și în Python ca parte a bibliotecilor SciPy și scikit-learn.

<img src="./images/TSE.jpg" alt="drawing" width="600px"/>

***Theil–Sen estimator***: [https://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator](https://en.wikipedia.org/wiki/Theil%E2%80%93Sen_estimator)

***Theil-Sen Estimators in a Multiple Linear Regression Model***: [http://home.olemiss.edu/~xdang/papers/MTSE.pdf](http://home.olemiss.edu/~xdang/papers/MTSE.pdf)