**Data preparation** is one of the essential processes in machine learning projects workflow: with well-prepared input even simple algorithm can achieve great result, and without it --- it’s hard to get something meaningful even using the most sophisticated models (remember concept of "[garbage in — garbage out](https://en.wikipedia.org/wiki/Garbage_in,_garbage_out)").

Usually, specific preparation of data for ML modeling can be considered as part of [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load) process and consists of following steps:

* **feature engineering**: transformation of raw data into proper features, that can be useful for modeling; sometimes, when original data is complex enough (e. g. text, images) this process is also called *feature extraction, feature preparation*.
* **feature selection**: removing unnecessary features (usually it can help to improve model quality/performance/etc).


In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OrdinalEncoder, OneHotEncoder
from sklearn.decomposition import PCA

from sklearn.feature_selection import VarianceThreshold, SelectFromModel, RFECV, SequentialFeatureSelector

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.metrics import accuracy_score

from sklearn.datasets import make_classification, load_wine, load_breast_cancer, load_diabetes

ImportError: cannot import name 'SequentialFeatureSelector' from 'sklearn.feature_selection' (C:\Users\annra\anaconda3\lib\site-packages\sklearn\feature_selection\__init__.py)

In [None]:
plt.style.use('seaborn-darkgrid')

In [None]:
def plot_scatter(x, y, auto_scaled=True, title=None, clusters=None):
    plt.figure(figsize=(4, 4))
    plt.scatter(x, y)
    
    if not auto_scaled:
        plt.axis('square')
    
    plt.grid(True)
    plt.title(title)
    
    plt.show()
    
def return_X_y(data, target_column):
    return data.drop(target_column, axis=1), data[target_column]

# Feature Engineering

## Missing Values Preprocessing

In [None]:
housing_data = pd.read_csv('Melbourne_housing_FULL.csv')
# prepare dataset for price regression
housing_data = housing_data[~housing_data['Price'].isnull()]

Missing values are one of the most common problems you can encounter when you try to prepare your data for machine learning. The reason for the missing values might be human errors, interruptions in the data flow, privacy concerns, and so on. Whatever is the reason, missing values affect the performance of the machine learning models (most of the algorithms do not even accept datasets with missing values).

First let's check the amount of missing value in each column of our dataset:

In [None]:
housing_data.isnull().mean() # housing_data.isnull().sum() to get absolute numbers

The most simple strategy is to drop entire rows and/or columns containing missing values based on some threshold (for example, if column contains more than *30%* --- drop it, then drop all rows that still contains some NaN's).

In [None]:
threshold = 0.3
housing_data_dropped = housing_data[housing_data.columns[housing_data.isnull().mean() < threshold]]
housing_data_dropped = housing_data_dropped.dropna(axis=0, how='any') # params is optinal here (matching defaults)
print(f'Original dataset shape (rows, cols): {housing_data.shape}')
print(f'Dataset shape (rows, cols) after dropna: {housing_data_dropped.shape}')

In general dropping data without additional investigation is not a good approach in most cases since you lose a lot of potentially useful information. For this particular dataset we've fully dropped `Landsize`, `BuildingArea` columns (which actually seem like strong features from common sense). 

Usually a better strategy is to impute the missing values, i.e., to infer them from the known part of the data. However, there is an important selection of what you impute to the missing values. You can use default value of missing values in the column. For example, if you have a column that only has `1` and `N\A`, then it is likely that the `N\A` rows may be considered as `0`. 
Another way is to use basic statistics (like *mean* and *medians* of the columns) for imputation.

In [None]:
# const imputing
housing_data_const = housing_data.fillna(value=0)

# mean imputing
housing_data_mean = housing_data.fillna(housing_data.mean())

There are also some advanced technics [KNN Imputation](), [Multivariate imputation]().

But commonly the most beneficial way is to dig deeper in available data, understand root cases of the problem and develop mixed strategy (for separate features based on investigation results). **Subject matter expertise rules!**

For example, one of the questions you may ask yourself to help figure this out is this: 

`Is this value missing because it wasn't recorded or because it doesn’t exist?`

If the value is missing because it doesn’t exist (like the height of the oldest child of someone who doesn't have any children) then it doesn't make sense to try and guess what it might be. These values you probably do want to mark this value using some special tag (or create separate bool feature). On the other hand, if a value is missing because it wasn't recorded, then you may probably use some of the imputation technics mentioned above or even more sophisticated ones.


## Feature scaling

In [3]:
wine_sklearn = load_wine(as_frame=True)
wine_data, wine_labels = wine_sklearn['data'], wine_sklearn['target']
wine_data

NameError: name 'load_wine' is not defined

In real world datasets you can often see multiple features spanning varying degrees of magnitude, range, and units. This is a significant obstacle as a lot of machine learning algorithms are highly sensitive to such things.

To make it simple: algorithm just sees number and does not know what that number represents --- if there is a vast difference in the range say few ranging in thousands and few ranging in dozens, it makes the underlying assumption that higher ranging numbers have superiority of some sort. So, these more significant number starts playing a more decisive role while training the model.

For example, you might be looking at the prices of some products in both Yen and US Dollars. One US Dollar is worth about 100 Yen, but if you don't scale your prices methods like SVM or KNN will consider a difference in price of 1 Yen as important as a difference of 1 US Dollar! This clearly doesn't fit with our intuitions of the world. With currency, you can convert between currencies. But what about if you're looking at something like height and weight? It's not entirely clear how many pounds should equal one inch (or how many kilograms should equal one meter).

By scaling your variables, you can help compare different variables on equal footing (scale).

### Standartization

**Standardization** of datasets is a common requirement for many machine learning models. The idea is to transform the data to the center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

$$scaled\_X = \frac{X - mean(X)}{std(X)}$$, where $X$ is **feature column** (not dataset itself!)

A common approach is to use `StandardScaler` from `sklearn`:


In [None]:
scaler = StandardScaler()
wine_data_scaled = scaler.fit_transform(wine_data)
wine_data_scaled

Let's illustrate the influence of scaling on [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis):

In [None]:
pca = PCA(n_components=2)

wine_data_pca = pca.fit_transform(wine_data)
wine_data_scaled_pca = pca.fit_transform(wine_data_scaled)

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(18, 10))

for l, c, m in zip(range(0, 3), ('blue', 'red', 'green'), ('^', 's', 'o')):
    ax1.scatter(wine_data_pca[wine_labels == l, 0], wine_data_pca[wine_labels == l, 1], 
                color=c, label=f'class {l}', alpha=0.5, marker=m)

for l, c, m in zip(range(0, 3), ('blue', 'red', 'green'), ('^', 's', 'o')):
    ax2.scatter(wine_data_scaled_pca[wine_labels == l, 0], wine_data_scaled_pca[wine_labels == l, 1], 
                color=c, label=f'class {l}', alpha=0.5, marker=m)
    
ax1.set_title('Dataset after PCA')
ax2.set_title('Standardized dataset after PCA')

for ax in (ax1, ax2):
    ax.set_xlabel('1st principal component')
    ax.set_ylabel('2nd principal component')
    ax.legend(loc='upper right')

### Normalization

An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size (also known as **Normalization**.  This can be achieved using `MinMaxScaler` or `MaxAbsScaler` from `sklearn`, respectively.

The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.

$$normalised\_X = \frac{X - min(X)}{max(X) - min(X)}$$, where $X$ is **feature column** (not dataset itself!)

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit_transform(wine_data)

## Log/Power Transform

Log transformation is a data transformation method in which it replaces each variable $x$ with a $log(x)$. The choice of the logarithm base is usually left up to the analyst and it would depend on the purposes of statistical modeling.

When our original continuous data do not follow the bell curve, we can log transform this data to make it as “normal” as possible so that the statistical analysis results from this data become more valid. In other words, the log transformation reduces or removes the skewness of our original data. The important caveat here is that the original data has to approximately follow a *log-normal distribution*. Otherwise, you can't expect any guarantees that result distribution will be close to normal (but even in such cases log transform can help to improve you scores).

In [None]:
mu, sigma = 5, 1
lognorm_data = np.random.lognormal(mu, sigma, 1000)

In [None]:
plt.figure(figsize=(16,8))
sns.histplot(lognorm_data, stat='probability')
plt.show()

In [None]:
plt.figure(figsize=(16,8))
sns.histplot(np.log(lognorm_data), stat='probability')
plt.show()

This may sound a bit odd: is it even possible to meet something specific like "log-normal distribution" in real life?

Well, let's plot the price column from Melbourne housing dataset, that we used previously:

In [None]:
plt.figure(figsize=(16,8))
sns.histplot(housing_data['Price'], stat='probability')
plt.show()

Seems familiar!

Eventually, lognormal distribution of some value in real world is quite common (just like normal distribution). It is suitable for describing length of comments, posted in the internet; the salaries amount; the population of cities and many other things. You may find some [more](https://en.wikipedia.org/wiki/Log-normal_distribution#Occurrence_and_applications) examples just on the wikipedia page.

However, to get some profit from this transformation, the distribution does not necessarily have to be *exactly* lognormal; you can try to apply it to any distribution with a heavy right tail. Furthermore, one can try to use other similar transformations, formulating their own hypotheses on how to approximate the available distribution to a normal. Examples of such transformations are Box-Cox transformation (log is a special case of the Box-Cox transformation) or Yeo-Johnson transformation (extends the range of applicability to negative numbers). Some information about these transformations and their implementations in `sklean` can be found [here](https://scikit-learn.org/stable/modules/preprocessing.html#non-linear-transformation).

## Categorical Features Encoding

Quite often features are not given as continuous values but categorical. For example a person could have features `["male", "female"], ["from Europe", "from US", "from Asia"], ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]`. Such features can be efficiently coded as integers, for instance `["male", "from US", "uses Internet Explorer"]` could be expressed as `[0, 1, 3]` while `["female", "from Asia", "uses Chrome"]` would be `[1, 2, 1]`.

To convert categorical features to such integer codes, we can use the *ordinal encoding*. It transforms each categorical feature to a range of integers (0 to number of categories - 1).

In [None]:
X = [['male', 'US', 'Safari'], ['female', 'Europe', 'Firefox'], ['male', 'Europe', 'Opera']]
pd.DataFrame(X, columns=['gender', 'place', 'browser'])

In [None]:
encoder = OrdinalEncoder()
ordinal_encoded_X = encoder.fit_transform(X)

Such integer representation can, however, can be unsuitable, for a lot of models: these expect continuous input, and would interpret the categories as being ordered, which is often not desired.

Another possibility to convert categorical features to features that can be used with scikit-learn estimators is to use *one-hot* encoding. The idea is to transforms each categorical feature, that has $n$ different possible categories, into $n$ separate binary features (whether the object belongs to specific category or not).

In [None]:
encoder = OneHotEncoder()
ohe_encoded_X = encoder.fit_transform(X).toarray()

In [None]:
pd.DataFrame(ohe_encoded_X, columns=encoder.get_feature_names())

It is also possible to encode each column into  $n - 1$ columns instead of $n$ columns by using the drop parameter (also called *dummy encoding*). This is useful to avoid co-linearity in the input matrix in some classifiers. Such functionality is useful, for example, when using non-regularized regression, since co-linearity would cause the covariance matrix to be non-invertible.

You can read about some advanced technics [here](https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding). However, most of them equivalent to one hot encoding to some degree.


# Feature Selection

Why is it sometimes necessary to select just subset of features and not all at once? The idea of removing features may seem a little counterintuitive, but there is some import motivation here:

1) First is more connected to engeneering side: the more data, the higher the computational complexity. Removing some unimportant and noisy features can help a lot here.    
2) The second reason is related to algorithms side: some models can be unstable when data have highly correlated features ([multicolinearity](https://datascience.stackexchange.com/questions/24452/in-supervised-learning-why-is-it-bad-to-have-correlated-features)), some --- when data is noisy. 

In [None]:
cancer_sklearn = load_breast_cancer(as_frame=True)
cancer_data, cancer_labels = cancer_sklearn['data'], cancer_sklearn['target']
cancer_data_scaled = StandardScaler().fit_transform(cancer_data)
cancer_data

## Statistical Approaches

The most obvious candidate for removal is a feature whose value remains unchanged, i.e., it contains no information at all. If we build on this thought, it is reasonable to say that features with low variance are worse than those with high variance. So, one can consider cutting features with variance below a certain threshold.

In [None]:
X_generated, y_generated = make_classification(n_samples=1000, n_features=25, n_informative=3,
                                                         n_redundant=2, n_repeated=0)
X_generated.shape

In [None]:
print(VarianceThreshold(0.9).fit_transform(X_generated).shape)
print(VarianceThreshold(1).fit_transform(X_generated).shape)
print(VarianceThreshold(1.1).fit_transform(X_generated).shape)

Keep in mind that we are using absolute value as threshold, so in real world scenario it is necessary to bring all the features to same scale (perform scaling before thresholding).

Personally, I won't recommend using `VarianceTreshold` unless you are completely sure that it's needed and won't make things worse: the low variance does not necessarily mean that feature is not informative. You can also try [other](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection) a little bit more advanced statistical approaches.

## Selection From Modeling


Basically, the idea is to use some model as an feature importance estimator: for example, we can use linear model with `Lasso` regularization (and feature weights from it) or some tree based models (which have natural ability to compute feature importance). Then, based on received importance/weights we can choose some threshold and take features, that have importance above this value.

In [None]:
selection_model = RandomForestClassifier(random_state=42)
selector = SelectFromModel(selection_model).fit(cancer_data, cancer_labels)
cancer_data_pruned = selector.transform(cancer_data)
print(cancer_data.columns[selector.get_support()])
print(f'Original shape: {cancer_data.shape}')
print(f'Shape after selection: {cancer_data_pruned.shape}')

In [None]:
main_model = LogisticRegression(solver='liblinear', penalty='l1')
pipe_baseline = make_pipeline(StandardScaler(), main_model)
pipe_selection = make_pipeline(StandardScaler(), SelectFromModel(selection_model), main_model) # fix to select only once

print('Result on original data: {:f}'.format(cross_val_score(pipe_baseline, cancer_data, cancer_labels, 
                      scoring='accuracy', cv=5).mean()))

print('Result after selection {:f}'.format(cross_val_score(pipe_selection, cancer_data, cancer_labels, 
                      scoring='accuracy', cv=5).mean()))

We were able to reduce the number of features significantly, but, as you can see, stable performance is not guaranteed.

It's also possible to use same model as an importance estimator and actual classifier (regressor).
As a development of this approach we can consider recursive feature elimination: first, the model is trained on the initial set of features and the importance of each feature is obtained. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

In [None]:
min_features_to_select = 1 
rfecv = RFECV(estimator=main_model, step=1, cv=KFold(3), 
              scoring='accuracy', min_features_to_select=min_features_to_select)
rfecv.fit(cancer_data_scaled, cancer_labels)

print("Optimal number of features : %d" % rfecv.n_features_)


In [None]:
plt.figure(figsize=(16,8))
plt.plot(range(min_features_to_select,
               len(rfecv.grid_scores_) + min_features_to_select),
         rfecv.grid_scores_)
plt.show()

## Greedy (Sequential) Feature Selection 

Finally, we get to the most reliable method --- trivial brute force: just test all possible subsets of features (train a model on a subset of features, store results, repeat for different subsets, and compare the quality of models to identify the best feature set). This approach is called [Exhaustive Feature Selection](http://rasbt.github.io/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector).

However, usually this method is too computationally complex to use for some real word dataset (it's even not available in scikit-learn). To reduce complexity one can the following *greedy* heuristic:  tart with zero feature and find the one feature that maximizes a cross-validated score when the model is trained on this single feature. Once that first feature is selected, we repeat the procedure by adding a new feature to the set of selected features. It is possible to iterate until we hit (preselected) maximum number of features or until the quality of the model ceases to increase significantly between iterations.

This algorithm can work in the opposite direction: instead of starting with no feature and greedily adding features, we start with all the features and greedily remove features from the set.

In [None]:
selector = SequentialFeatureSelector(main_model, scoring='accuracy', n_jobs=-1).fit(cancer_data_scaled, cancer_labels)
cancer_data_scaled_pruned = selector.transform(cancer_data_scaled)

print(cancer_data.columns[selector.get_support()])
print(f'Original shape: {cancer_data.shape}')
print(f'Shape after selection: {cancer_data_pruned.shape}\n')

print('Result on original data: {:f}'.format(cross_val_score(main_model, cancer_data_scaled, 
                                                           cancer_labels, scoring='accuracy', cv=5).mean()))

print('Result after selection {:f}'.format(cross_val_score(main_model, cancer_data_scaled_pruned, 
                                                        cancer_labels, scoring='accuracy', cv=5).mean()))

# Homework

## Exercise  1 - Scaling (3 points)

Perform standardization for wine dataset (`wine_data`) using only basic python, numpy and pandas (without using `StandardScaler` and sklearn at all). Implementation of function (or class) that can get dataset as input and return standardized dataset as output is preferrable, but not necessary.

Compare you results (output) with `StandardScaler`.

**NOTE:**

1) 1.5 points is for correct wine dataset standardization and another 1.5 points is for implementation of standardization function, that is working in more general case.

2) "General case" doesn't mean, that you need to handle some/all really "specific" cases (datasets with missing/categorial variables, very large dataset, etc). Let's assume that it should work with numeric datasets of reasonable shape: showing the output for one or two randomly generated 10x10 dataset and comparing the results with `StandardScaler` should be enough (or you can be more creative).



In [None]:
## your code
from pandas._testing import assert_frame_equal

def scaling_func(df):
     return df.apply(lambda x: (x - np.mean(x))/np.std(x), axis=0)

wine_data_func_scaled = scaling_func(wine_data)
wine_data_standart_scaled = pd.DataFrame(wine_data_scaled, columns = wine_data_func_scaled.columns)
np.mean(np.abs(wine_data_func_scaled - wine_data_standart_scaled))

Average differences between results of scaling show that results differ by a small number.

Showing the output randomly generated 10x10 dataset:

In [None]:
np.random.seed(123)
random_matr = np.column_stack(np.random.rand(10, 10)*np.random.randint(0, 10, size=[10, 10]))
random_df = pd.DataFrame(data=random_matr,
            index=[ it_col for it_col in np.array(range(0, random_matr.shape[0]))],
            columns=[ it_col for it_col in np.array(range(0, random_matr.shape[1]))])
random_df

In [None]:
random_df_func_scaled = scaling_func(random_df)
random_df_standart_scaled =  pd.DataFrame(scaler.fit_transform(random_df))
np.mean(np.abs(random_df_func_scaled - random_df_standart_scaled))

On random dataframe results difference is more major.

In [None]:
cancer_data, cancer_labels = cancer_sklearn['data'], cancer_sklearn['target']
cancer_data_scaled = StandardScaler().fit_transform(cancer_data)
cancer_data_func_scaled = scaling_func(cancer_data)
cancer_data_standart_scaled = pd.DataFrame(cancer_data_scaled, columns = cancer_data_func_scaled.columns)
np.mean(np.abs(cancer_data_func_scaled - cancer_data_standart_scaled))

## Exercise  2 - Visualization (4 points)

As noted earlier, standardization/normalization of data can be crucial for some distance-based ML methods.

Let’s generate some toy example of unnormalized data and visualize the importance of this process once more:

In [None]:
feature_0 = np.random.randn(1000) * 10   
feature_1 = np.concatenate([np.random.randn(500), np.random.randn(500) + 5])
data = np.column_stack([feature_0, feature_1])
data 

In [None]:
plot_scatter(data[:, 0], data[:, 1], auto_scaled=True, title='Data (different axes units!)')

**NOTE:** on the plot above axes are scaled differently and we can clearly see two potential *classes/clusters*. In fact `matplotlib` performed `autoscaling` (which is basically can be considered as `MinMaxScaling` of original data) just for better visualization purposes.

Let's turn this feature off and visualize the original data on the plot with equally scaled axes:

In [None]:
plot_scatter(data[:, 0], data[:, 1], auto_scaled=False , title='Data (equal axes units!)')

This picture is clearly less interpretable, but much closer to "how distance-based algorithm see the original data": separability of data is hardly noticable only because the variation (std) of x-feature is much bigger in absolute numbers.

Perform `StandardScaling` and `MinMaxScaling` of original data; visualize results for each case (**use `plot_scatter` with `auto_scaled=False`**):

### MinMaxScaling (1 point)

In [None]:
## your code
minmaxscaler = MinMaxScaler()
data_minmax_scaled = pd.DataFrame(scaler.fit_transform(data))

plot_scatter(data_minmax_scaled[0], data_minmax_scaled[1], auto_scaled=False , title='Data (equal axes units!)')

### StandardScaler (1 point)

In [None]:
## your code
data_standart_scaled = pd.DataFrame(scaler.fit_transform(data))

plot_scatter(data_standart_scaled[0], data_standart_scaled[1], auto_scaled=False , title='Data (equal axes units!)')

### (Bonus) K-means (2 points)

Illustrate the impact of scaling on basic distance-based clustering algorithm [K-means](https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1) using `data` generated above.

**NOTE:** basically, you don't need understanding K-means algorithm here, you just need to:

1) run algorithm (with k=2, k - number of clusters/classes) on unscaled data    
2) run algorithm (with k=2) on scaled data    
3) plot results: highlight different clusters using different colors.

You can use this [question](https://stats.stackexchange.com/questions/89809/is-it-important-to-scale-data-before-clustering/89813) as a hint, but I recommend you to plot results using `plot_scatter` with `equal_scaled=True`: it might help you to intuitively understand the reasons of such scaling impact.


In [None]:
## your code
from sklearn.cluster import KMeans

Kmean = KMeans(n_clusters=2)
Kmean.fit(data)
Kmean.predict(data)
y_kmeans = Kmean.predict(data)

plt.scatter(data[:, 0], data[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = Kmean.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5);

In [None]:
Kmean.fit(data_minmax_scaled)
Kmean.predict(data_minmax_scaled)
y_kmeans_minmax = Kmean.predict(data_minmax_scaled)

fig = plt.figure(figsize = (10,10))
ax.autoscale(False, tight=False)

plt.scatter(data_minmax_scaled[0], data_minmax_scaled[1], c=y_kmeans_minmax, s=50, cmap='viridis')

centers = Kmean.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5);

In [None]:
Kmean.fit(data_minmax_scaled)
Kmean.predict(data_standart_scaled)
y_kmeans_standart = Kmean.predict(data_standart_scaled)

fig = plt.figure(figsize = (10,10))
ax.autoscale(False, tight=False)

plt.scatter(data_standart_scaled[0], data_standart_scaled[1], c=y_kmeans_standart, s=50, cmap='viridis')

centers = Kmean.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5);

## Exercise  3 - Preprocessing Pipeline (3 points)

In [None]:
wine_train, wine_val, wine_labels_train, wine_labels_val = train_test_split(wine_data, wine_labels, 
                                                                            test_size=0.4, random_state=42)

Train model (for example, `LogisticRegression(solver='liblinear', penalty='l1')` on raw `wine_train` data; then train same model after data scaling; then add feature selection (and train model again on scaled data).

Measure `accuracy` of all 3 approaches on `wine_val` dataset. Describe and explain results.

In [1]:
## your code
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

LR = LogisticRegression(solver='liblinear', penalty='l1')

LR.fit(wine_train, wine_labels_train)
print(LR.predict(wine_val))
print('Accuracy for raw data: ' + str(accuracy_score(LR.predict(wine_val), wine_labels_val)))

wine_train_standart_scaled = pd.DataFrame(scaler.fit_transform(wine_train))
wine_val_standart_scaled = pd.DataFrame(scaler.fit_transform(wine_val))
LR.fit(wine_train_standart_scaled, wine_labels_train)
print('Accuracy for data with Standart Scaling: ' + str(accuracy_score(LR.predict(wine_val_standart_scaled), wine_labels_val)))

wine_train_minmax_scaled = pd.DataFrame(minmaxscaler.fit_transform(wine_train))
wine_val_minmax_scaled = pd.DataFrame(minmaxscaler.fit_transform(wine_val))
LR.fit(wine_train_minmax_scaled, wine_labels_train)
print('Accuracy for data with MinMax Scaling: ' + str(accuracy_score(LR.predict(wine_val_minmax_scaled), wine_labels_val)))


NameError: name 'wine_train' is not defined

In [267]:
selection_model = RandomForestClassifier(random_state=42)
selector = SelectFromModel(selection_model).fit(wine_train, wine_labels_train)
wine_train_pruned = selector.transform(wine_train)
cols = wine_train.columns[selector.get_support()]
LR.fit(wine_train[cols], wine_labels_train)
print('Accuracy for raw data with feature selection: ' + str(accuracy_score(LR.predict(wine_val[cols]), wine_labels_val)))

selector = SelectFromModel(selection_model).fit(wine_train_standart_scaled, wine_labels_train)
wine_train_standart_scaled_pruned = selector.transform(wine_train_standart_scaled)
cols = wine_train_standart_scaled.columns[selector.get_support()]
LR.fit(wine_train_standart_scaled[cols], wine_labels_train)
print('Accuracy for data with Standart Scaling with feature selection: ' + str(accuracy_score(LR.predict(wine_val_standart_scaled[cols]), wine_labels_val)))

selector = SelectFromModel(selection_model).fit(wine_train_minmax_scaled, wine_labels_train)
wine_train_minmax_scaled_pruned = selector.transform(wine_train_minmax_scaled)
cols = wine_train_minmax_scaled.columns[selector.get_support()]
LR.fit(wine_train_minmax_scaled[cols], wine_labels_train)
print('Accuracy for data with MinMax Scaling with feature selection: ' + str(accuracy_score(LR.predict(wine_val_minmax_scaled[cols]), wine_labels_val)))

Accuracy for raw data with feature selection: 0.9583333333333334
Accuracy for data with Standart Scaling with feature selection: 0.9583333333333334
Accuracy for data with MinMax Scaling with feature selection: 0.9583333333333334


# Materials & References

1. General article about feature engineering and selection (main reference):
https://github.com/Yorko/mlcourse.ai/blob/master/jupyter_english/topic06_features_regression/topic6_feature_engineering_feature_selection.ipynb


2. Feature engineering/preprocessing, using scikit-learn API (great code examples, but really brief explanation):    
https://scikit-learn.org/stable/modules/preprocessing


3. Feature scaling/normalization:     
https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35


4. Log Transform/power transform:    
https://medium.com/@kyawsawhtoon/log-transformation-purpose-and-interpretation-9444b4b049c9


6. Missing values preprocessing using scikit-learn API (great code examples, great explanation):    
https://scikit-learn.org/stable/modules/impute.html


7. Feature selection scikit-learn API (great code examples, great explanation):   
https://scikit-learn.org/stable/modules/feature_selection.html


8. Melbourne housing dataset source:    
https://www.kaggle.com/anthonypino/melbourne-housing-market