#  Feature engineering, feature selection

In [1]:
import os
import string

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import RFE, RFECV
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression, Ridge, RidgeCV
from sklearn.metrics import make_scorer
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    ShuffleSplit,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import (
    OneHotEncoder,
    OrdinalEncoder,
    PolynomialFeatures,
    StandardScaler,
)
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.svm import SVC, SVR

%matplotlib inline

## 1: Feature engineering 
<hr>

One of the most important aspects which influences performance of machine learning models is the features used to represent the problem. If your underlying representation is bad whatever fancy model you use is not going to help. With a better feature representation, a simple and a more interpretable model is likely to perform reasonably well. 



**Feature engineering** is the process of transforming raw data into features that better represent the underlying problem to the predictive models. 

### 1.1 The data

In this exercise we'll engineer our own features on [the Disaster Tweets dataset](https://www.kaggle.com/vstepanenko/disaster-tweets). We are trying to predict whether the content of a tweet is a real disaster or not. 

Note that coming up with features is difficult, time-consuming, and requires expert knowledge. Since we'll be using simplistic features, we might not get better scores with our engineered features. The purpose here is to make familiar with the process of feature engineering rather than getting the best scores. 


In [2]:
df = pd.read_csv("../downloads/tweets.csv", usecols=["keyword", "text", "target", "location"])
train_df, test_df = train_test_split(df, test_size=0.2, random_state=2)
train_df.head()

Unnamed: 0,keyword,location,text,target
3289,debris,,"Unfortunately, both plans fail as the 3 are im...",0
2672,crash,SLC,I hope this causes Bernie to crash and bern. S...,0
2436,collide,,—pushes himself up from the chair beneath to r...,0
9622,suicide%20bomb,,Widow of CIA agent killed in 2009 Afghanistan ...,1
8999,screaming,Azania,As soon as God say yes they'll be screaming we...,0


In [3]:
X_train, y_train = train_df.drop(columns=["target"]), train_df["target"]
X_test, y_test = test_df.drop(columns=["target"]), test_df["target"]

- According to the below calculations we see that there is a class imbalance. We definitely need to deal with the class imbalance because the difference between 81% and 18% seems large.
- Assuming that we use this for a real situations (and we want to send out forces to help victims before it is too late), it is very important to have a high precision meaning that we want to know the number of tweets that we correctly labeled "disaster" from the list of tweets that are reporting a disaster in reality. I will also be using f1 score to make sure I don't get really low recall scores.

In [4]:
y_train.value_counts(normalize=True)

0    0.812995
1    0.187005
Name: target, dtype: float64

#### 1.1.1 Setting scoring metrics to be used

In [5]:
scoring_metrics = ["precision", "f1"]

<br><br>

### 1.2 The location feature

The location feature seems quite messy. I see two challenges that would be involved in encoding the location feature. 

1. The first challenge would be the high ratio of NA's and the second is that this looks like a free text without any structure. For example in some examples name of a city has been used where in other observations it is a country or state. Also, there is a use of emoticons which should be translated to text.
2. More than a third of the values are unique. In order to reduce the number of unique values, I will use some wrangling and built-in string match functionality of pandas to see if I can replace the name of the cities and states with countries (using list of cities and countries from another dataframe). I will also use a library like spacymoji to translate emoji's to text in order to see if I can get name of any flags translated to country names. After extracting as much country as I could, I will use OneHotEncoder and only provide the categories of the 192 countries. Instead of the last step, I could alternatively use CountVectorizer and use an `n-gram range` of 1 to 3 (because name of some countries is comprised of multiple names) and use `min_df` to ignore the locations that have lower frequency that the threshold that I specify in order to remove the ones that only appear for one or two times.

In [6]:
X_train['location'].nunique()

3746

In [7]:
X_train['location'].unique()[:10]

array([nan, 'SLC', 'Azania', 'United States',
       'Amphoe Mueang Nakhon Ratchasim', 'Accra, Ghana', 'Lagos, Nigeria',
       'Rohnert Park, CA', 'Brighton', 'Hell,Hades,Mictlan,Tartarus'],
      dtype=object)

In [8]:
X_train['location'].value_counts()

United States                     80
Australia                         68
London, England                   66
UK                                62
India                             60
                                  ..
Arizona City, AZ                   1
Yorkshire & Scotland               1
th: hakuna matata                  1
Tacloban City, Eastern Visayas     1
Greater Manchester                 1
Name: location, Length: 3746, dtype: int64

In [9]:
print(f"{round(100 * sum(X_train['location'].isna()) / X_train.shape[0], 1)}% of the values in the location column are missing.")

30.0% of the values in the location column are missing.


<br><br>

### 1.3 Identifying feature types

In preparation for building a classifier, we identify different feature types and set up a column transformer that performs the feature transformations you deem sensible.

> Note: for `CountVectorizer` transformer, we need to pass a 1-D array or a pandas.Series. So in a column transformer, we pass a string rather than a list of features for this transformer.

- I am dropping the location column as it does not provide any useful information and is in a free text format.
- I will be watching that my number of features won't grow more than my observations (m < n). I will use a Countvectorizer for the `text` column and will use max_feature to limit the number of columns. If not specified, it will give me more than 23,000 feature columns.
- For the `keyword` column I will use a OneHotEncoder as it is a categorical column with 200 unique values.

In [10]:
X_train.head()

Unnamed: 0,keyword,location,text
3289,debris,,"Unfortunately, both plans fail as the 3 are im..."
2672,crash,SLC,I hope this causes Bernie to crash and bern. S...
2436,collide,,—pushes himself up from the chair beneath to r...
9622,suicide%20bomb,,Widow of CIA agent killed in 2009 Afghanistan ...
8999,screaming,Azania,As soon as God say yes they'll be screaming we...


In [11]:
drop = ["location"]
text_feature = "text"  # Inputig as string rather than list
categorical_feature = ["keyword"]

In [12]:
preprocessor = make_column_transformer(
    (CountVectorizer(stop_words="english", max_features=3000), text_feature),
    (OneHotEncoder(handle_unknown="ignore",sparse=False), categorical_feature)
)

In [13]:
preprocessor.fit_transform(X_train)

<9096x3219 sparse matrix of type '<class 'numpy.float64'>'
	with 68911 stored elements in Compressed Sparse Row format>

In [14]:
preprocessor.named_transformers_["countvectorizer"].get_feature_names_out()

array(['00', '000', '01', ..., 'zip', 'zone', '하윤빈'], dtype=object)

<br><br>

### 1.4 DummyClassifier - Baseline score


In [15]:
results = {}

In [16]:
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """

    scores = cross_validate(model, X_train, y_train, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

In [17]:
dummy_pipe = make_pipeline(preprocessor, DummyClassifier())

results["Dummy Classifier"] = mean_std_cross_val_scores(
    dummy_pipe, X_train, y_train, scoring=scoring_metrics
)
pd.DataFrame(results)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,Dummy Classifier
fit_time,0.194 (+/- 0.016)
score_time,0.045 (+/- 0.004)
test_precision,0.000 (+/- 0.000)
test_f1,0.000 (+/- 0.000)


<br><br>

### 1.5 Logistic regression

Now we try logistic regression classifier with the same scoring metrics and default hyperparameters. 

In [18]:
lr_pipe = make_pipeline(preprocessor, LogisticRegression())
results["Logistic Regression"] = mean_std_cross_val_scores(
    lr_pipe, X_train, y_train, scoring=scoring_metrics
)
pd.DataFrame(results)

Unnamed: 0,Dummy Classifier,Logistic Regression
fit_time,0.194 (+/- 0.016),0.258 (+/- 0.016)
score_time,0.045 (+/- 0.004),0.039 (+/- 0.007)
test_precision,0.000 (+/- 0.000),0.771 (+/- 0.027)
test_f1,0.000 (+/- 0.000),0.619 (+/- 0.026)


<br><br>

### 1.6 Hyperparameter optimization 

We jointly tune hyperparameters of logistic regression and `CountVectorizer`and report the best hyperparameter values and best cross-validation scores.


In [19]:
param_distributions = {
    "logisticregression__C": np.logspace(-2, 2, 5),
    "logisticregression__class_weight": ["balanced", None],
    "columntransformer__countvectorizer__max_features": np.linspace(
        1000, 5000, 9, dtype=int
    ),
}

hyper_opt = RandomizedSearchCV(
    estimator=lr_pipe,
    param_distributions=param_distributions,
    scoring="f1",
    n_iter=20,
    cv=5,
    n_jobs=-1,
)

hyper_opt.fit(X_train, y_train)

RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('columntransformer',
                                              ColumnTransformer(transformers=[('countvectorizer',
                                                                               CountVectorizer(max_features=3000,
                                                                                               stop_words='english'),
                                                                               'text'),
                                                                              ('onehotencoder',
                                                                               OneHotEncoder(handle_unknown='ignore',
                                                                                             sparse=False),
                                                                               ['keyword'])])),
                                             ('logisticregression',
    

In [20]:
hyper_opt.best_params_

{'logisticregression__class_weight': 'balanced',
 'logisticregression__C': 0.1,
 'columntransformer__countvectorizer__max_features': 5000}

In [21]:
print(f"best f1 score is {round(hyper_opt.best_score_, 3)}.")

best f1 score is 0.643.


<br><br>

### 1.7: Feature engineering

In this section we will be extracting out own features that might be useful for this prediction task. 

The code below adds some very basic length-related and sentiment features. We will be using a popular library called `nltk` for this section.

1. We run the code below which creates three new features: 
    - Relative character length in the tweet. 
    - Number of words in the tweet.
    - Sentiment of the tweet. In particular, we'll be using a metric called "compound score" representing the sentiment in the given tweet. (A score of -1 corresponds to most extreme negative and a score of +1 corresponds to most extreme positive.) This score is extracted using [Vader lexicon](https://github.com/cjhutto/vaderSentiment). Here we are using some pre-trained model to extract sentiment expressed in the tweets. Below I am showing you a couple of examples of using this pre-trained model for getting sentiment information on some random sentences. 

Below I am showing couple of examples from [Vader lexicon](https://github.com/cjhutto/vaderSentiment).

In [22]:
import nltk

nltk.download("vader_lexicon")
nltk.download("punkt")
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\artan\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\artan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [23]:
s = "Not bad at all."
print(sid.polarity_scores(s))

{'neg': 0.0, 'neu': 0.513, 'pos': 0.487, 'compound': 0.431}


In [24]:
s = "The plot was good, but the characters are uncompelling and the dialog is not great."
print(sid.polarity_scores(s))

{'neg': 0.327, 'neu': 0.579, 'pos': 0.094, 'compound': -0.7042}


In [25]:
def get_relative_length(text, TWITTER_ALLOWED_CHARS=280.0):
    """
    Returns the relative length of text.

    Parameters:
    ------
    text: (str)
    the input text

    Keyword arguments:
    ------
    TWITTER_ALLOWED_CHARS: (float)
    the denominator for finding relative length

    Returns:
    -------
    relative length of text: (float)

    """
    return len(text) / TWITTER_ALLOWED_CHARS


def get_length_in_words(text):
    """
    Returns the length of the text in words.

    Parameters:
    ------
    text: (str)
    the input text

    Returns:
    -------
    length of tokenized text: (int)

    """
    return len(nltk.word_tokenize(text))


def get_sentiment(text):
    """
    Returns the compound score representing the sentiment of the given text: -1 (most extreme negative) and +1 (most extreme positive)
    The compound score is a normalized score calculated by summing the valence scores of each word in the lexicon.

    Parameters:
    ------
    text: (str)
    the input text

    Returns:
    -------
    sentiment of the text: (str)
    """
    scores = sid.polarity_scores(text)
    return scores["compound"]

In [26]:
train_df = train_df.assign(n_words=train_df["text"].apply(get_length_in_words))
train_df = train_df.assign(vader_sentiment=train_df["text"].apply(get_sentiment))
train_df = train_df.assign(rel_char_len=train_df["text"].apply(get_relative_length))

test_df = test_df.assign(n_words=test_df["text"].apply(get_length_in_words))
test_df = test_df.assign(vader_sentiment=test_df["text"].apply(get_sentiment))
test_df = test_df.assign(rel_char_len=test_df["text"].apply(get_relative_length))

Also for experiment I am creating some extra features. I am extracting whether a news agency has been mentioned in the tweet by seeing if there is a link (http) to news agency. I have also checked if a hashtag has been used to make a certain news a hashtag. Also, as an Iranian there has always been disaster associated with the name Iran through recent history. I will give this a try to see if I can find something sensible.  
> Note: As long as the created features are derived by applying a function on individual row and not the rest of the column we are not violating the Golden rule. For the case of this section, since the n_words, vader sentiment, and rel_char_len are applied on single values not considering the rest of the observations, we are not violating the golden rule (i.e. leaking of the data to the test set).

In [27]:
train_df["news_agency"] = train_df["text"].apply(lambda x: "http" in x or "news" in x)
train_df["hashtag"] = train_df["text"].apply(lambda x: "#" in x)
train_df["Iran"] = train_df["text"].apply(lambda x: "iran" in x.lower())

test_df["news_agency"] = test_df["text"].apply(lambda x: "http" in x)
test_df["hashtag"] = test_df["text"].apply(lambda x: "#" in x)
test_df["Iran"] = test_df["text"].apply(lambda x: "iran" in x.lower())

train_df.head()

Unnamed: 0,keyword,location,text,target,n_words,vader_sentiment,rel_char_len,news_agency,hashtag,Iran
3289,debris,,"Unfortunately, both plans fail as the 3 are im...",0,22,-0.765,0.425,False,False,False
2672,crash,SLC,I hope this causes Bernie to crash and bern. S...,0,18,-0.5697,0.267857,True,False,False
2436,collide,,—pushes himself up from the chair beneath to r...,0,21,0.0,0.439286,True,False,False
9622,suicide%20bomb,,Widow of CIA agent killed in 2009 Afghanistan ...,1,20,-0.946,0.428571,True,False,False
8999,screaming,Azania,As soon as God say yes they'll be screaming we...,0,14,0.296,0.203571,False,False,False


<br><br>

### 1.8 Pipeline with all features


In [28]:
X_train_new = train_df.drop("target", axis=1)
y_train_new = train_df["target"]

X_test_new = test_df.drop("target", axis=1)
y_test_new = test_df["target"]

In [29]:
text_feature = "text"
binary_features = ["news_agency", "hashtag", "Iran"]
categorical_features = ["keyword"]
numeric_feature = ["n_words"]
passthrough = ["vader_sentiment", "rel_char_len"]
drop_feature = ["location"]


In [30]:
NLP_preprocessor = make_column_transformer(
    (
        CountVectorizer(
            stop_words="english",
            max_features=hyper_opt.best_params_[
                "columntransformer__countvectorizer__max_features"
            ],
        ),
        text_feature,
    ),
    (OneHotEncoder(drop="if_binary"), binary_features),
    (OneHotEncoder(handle_unknown="ignore"), categorical_features),
    (StandardScaler(), numeric_feature),
    ("passthrough", passthrough),
)

In [31]:
hyper_opt.best_params_

{'logisticregression__class_weight': 'balanced',
 'logisticregression__C': 0.1,
 'columntransformer__countvectorizer__max_features': 5000}

In [32]:
NLP_pipeline = make_pipeline(
    NLP_preprocessor,
    LogisticRegression(
        C=hyper_opt.best_params_["logisticregression__C"],
        class_weight=hyper_opt.best_params_["logisticregression__class_weight"],
        max_iter=300
    ),
)

In [33]:
mean_std_cross_val_scores(
    NLP_pipeline,
    X_train_new,
    y_train_new,
    scoring=scoring_metrics,
    cv=5,
)

fit_time          0.246 (+/- 0.008)
score_time        0.042 (+/- 0.007)
test_precision    0.589 (+/- 0.011)
test_f1           0.654 (+/- 0.010)
dtype: object

<br><br>

### 1.9 Interpretation



We see slight improvement on the f1 score but not much.

In [34]:
NLP_pipeline.fit(X_train_new, y_train_new)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('countvectorizer',
                                                  CountVectorizer(max_features=5000,
                                                                  stop_words='english'),
                                                  'text'),
                                                 ('onehotencoder-1',
                                                  OneHotEncoder(drop='if_binary'),
                                                  ['news_agency', 'hashtag',
                                                   'Iran']),
                                                 ('onehotencoder-2',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['keyword']),
                                                 ('standardscaler',
                                                  StandardScaler(),
           

In [35]:
text_features = (
    NLP_pipeline.named_steps["columntransformer"]
    .named_transformers_["countvectorizer"]
    .get_feature_names_out()
    .tolist()
)
keyword_features = (
    NLP_pipeline.named_steps["columntransformer"]
    .named_transformers_["onehotencoder-2"]
    .get_feature_names_out()
    .tolist()
)

column_names = (
    text_features + binary_features + keyword_features + numeric_feature + passthrough
)

coefficients = NLP_pipeline.named_steps["logisticregression"].coef_.flatten()

In [36]:
coef_df = pd.DataFrame({"feature": column_names, "coefficient": coefficients}).sort_values("coefficient", ascending=False)
coef_df[:10]

Unnamed: 0,feature,coefficient
5224,rel_char_len,1.269627
4463,thunderstorm,1.020694
5216,keyword_windstorm,0.918887
3623,rescued,0.900307
1144,died,0.89698
3678,road,0.876581
776,collision,0.869327
4625,ukrainian,0.83822
4295,survived,0.834366
1948,hit,0.815764


In [37]:
coef_df[-10:]

Unnamed: 0,feature,coefficient
4801,want,-0.493764
644,cause,-0.505274
5069,keyword_demolition,-0.530712
5022,keyword_blazing,-0.532947
5048,keyword_collapse,-0.55475
1897,heart,-0.563255
2551,love,-0.607777
2480,like,-0.765922
5223,vader_sentiment,-0.822291
1216,don,-0.857552


From the above top ranking positive and negative coefficients, we can see that most of them make sense.
Below are the coefficients for the three columns that I added. Two of them have high positive coefficients, but the feature capturing the news agencies actually has a negative coefficient which might probably be due to the fact that not all news are disaster related (or that I have not used the proper method to extract news related tweets!).

_Resultant coefficient from added feature `Iran`:_

In [38]:
coef_df.query("feature == 'Iran'")

Unnamed: 0,feature,coefficient
5002,Iran,0.495746


_Resultant coefficient from added feature `hashtag`:_

In [39]:
coef_df.query("feature == 'hashtag'")

Unnamed: 0,feature,coefficient
5001,hashtag,0.404291


_Resultant coefficient from added feature `news_agency`:_

In [40]:
coef_df.query("feature == 'news_agency'")

Unnamed: 0,feature,coefficient
5000,news_agency,-0.225947


<br><br>

### 1.10 Test results


In [41]:
from sklearn.metrics import f1_score
test_score = f1_score(y_test_new, NLP_pipeline.predict(X_test_new))
print(f"The f1 score for the test data is {round(test_score, 2)}.")

The f1 score for the test data is 0.68.


<br><br><br><br>

## 2: Dataset for Feature Selection
<hr>

In the following exercises, we'll be using [`sklearn`'s boston housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html) to do some feature selection. 


In [42]:
from sklearn.datasets import load_boston

boston_housing = load_boston()
print(boston_housing.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

In [43]:
boston_df = pd.DataFrame(boston_housing.data, columns=boston_housing.feature_names)
boston_df["target"] = boston_housing.target
train_df, test_df = train_test_split(boston_df, test_size=0.2, random_state=2)

X_train, y_train = train_df.drop(columns=["target"]), train_df["target"]
X_test, y_test = test_df.drop(columns=["target"]), test_df["target"]

<br><br><br><br>

## 3: Feature importances and feature selection 
<hr>

We'll briefly explore feature importances, recursive feature elimination, adding polynomial features in a pipeline, and forward/backward selection.

### 3.1 Feature importance

The following code shows the coefficients learned by `RidgeCV` on the Boston housing dataset. 

In [44]:
lrcv = RidgeCV()
lrcv.fit(X_train, y_train)
pd.DataFrame(data=lrcv.coef_, index=X_train.columns, columns=["coefficient"])

Unnamed: 0,coefficient
CRIM,-0.107895
ZN,0.039087
INDUS,-0.020005
CHAS,3.138096
NOX,-15.339384
RM,3.640792
AGE,0.008323
DIS,-1.367897
RAD,0.321224
TAX,-0.011728


<br><br>

### 3.2 `RFECV` 

We'll explore recursive feature elimination for feature selection. The code below defines a pipeline with feature selection incorporated in it. `RFECV` is used for feature selection in the pipeline; it uses cross-validation to decide how many features to select. The selected features are passed to `RidgeCV`, which has built-in cross-validation to tune the `alpha` hyperparameter.  

We will see how many features have been selected by `RFECV` using the `n_features_` attribute of the `RFECV` step from the pipeline. 

In [45]:
pipe_rfe_ridgecv = make_pipeline(StandardScaler(), RFECV(Ridge(), cv=10), RidgeCV())

In [46]:
rfe_score = mean_std_cross_val_scores(
    pipe_rfe_ridgecv, X_train, y_train, cv=5, return_train_score=True
)
rfe_score

fit_time       0.160 (+/- 0.023)
score_time     0.006 (+/- 0.009)
test_score     0.704 (+/- 0.077)
train_score    0.730 (+/- 0.017)
dtype: object

In [47]:
rfe_score["test_score"]

'0.704 (+/- 0.077)'

In [48]:
pipe_rfe_ridgecv.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('rfecv', RFECV(cv=10, estimator=Ridge())),
                ('ridgecv', RidgeCV(alphas=array([ 0.1,  1. , 10. ])))])

#### Before feature selection

In [49]:
pipe_rfe_ridgecv.named_steps['rfecv'].n_features_in_

13

#### After feature selection

In [50]:
pipe_rfe_ridgecv.named_steps["rfecv"].n_features_

11

<br><br>

### 3.3: `PolynomialFeatures` + `RFECV`

We add one more step to the pipeline above, `PolynomialFeatures()`, for extracting polynomial features. We then carry out cross-validation using the pipeline and report the mean validation scores.   

In [51]:
poly_pipe = make_pipeline(
    StandardScaler(), PolynomialFeatures(degree=3), RFECV(Ridge(), cv=10), RidgeCV()
)

In [52]:
poly_score = mean_std_cross_val_scores(
    poly_pipe, X_train, y_train, cv=5, return_train_score=True
)
poly_score

fit_time       21.906 (+/- 1.359)
score_time      0.001 (+/- 0.001)
test_score      0.845 (+/- 0.058)
train_score     0.925 (+/- 0.017)
dtype: object

Our mean crossvalidation R2 score improved a lot from 0.7 to 0.84. This is mainly because of the higher number of features and the fact that some of these features have had a polynomial shape and once transformed to third degree polynomial, we have been able to fit them better in our regression model.

<br><br>

### 3.4 Selected Features with Polynomial 

We will see how many total features there will be before and after applying `PolynomialFeatures` transformation. 

In [53]:
poly_pipe.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('polynomialfeatures', PolynomialFeatures(degree=3)),
                ('rfecv', RFECV(cv=10, estimator=Ridge())),
                ('ridgecv', RidgeCV(alphas=array([ 0.1,  1. , 10. ])))])

#### Before feature selection

In [54]:
poly_pipe.named_steps["rfecv"].n_features_in_

560

#### After feature selection

In [55]:
poly_pipe.named_steps["rfecv"].n_features_

40

In [56]:
poly_pipe.named_steps["rfecv"].get_feature_names_out()

array(['x6', 'x7', 'x8', 'x35', 'x63', 'x64', 'x74', 'x76', 'x84', 'x89',
       'x90', 'x95', 'x99', 'x134', 'x145', 'x167', 'x169', 'x187',
       'x195', 'x251', 'x284', 'x336', 'x403', 'x420', 'x421', 'x423',
       'x424', 'x453', 'x463', 'x484', 'x496', 'x505', 'x514', 'x515',
       'x518', 'x524', 'x525', 'x541', 'x542', 'x544'], dtype=object)

<br><br>

### 3.5: `PolynomialFeatures` + backward selection

We now define a pipeline with `StandardScaler()`, `PolynomialFeatures()`, and backward feature selection (`SequentialFeatureSelector` with `Ridge` and direction `backward`) and report cross-validation scores.  
> In `SequentialFeatureSelector` (SFS) with backward direction we start with a set of all features, sequentially find the one feature that has the least effect on the maximization of a cross-validation score and greedily remove features from the set. On the other hand, he goal of RFE is to select features by recursively considering smaller and smaller sets of features instead of just one feature removed by SFS. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute. Then, the least important features are removed from current set of features. That procedure is recursively repeated on the new set until the desired number of features to select is eventually reached. For this reason RFE is very slower than SFS.

In [57]:
poly_seq_pipe = make_pipeline(
    StandardScaler(),
    PolynomialFeatures(),
    SequentialFeatureSelector(Ridge(), direction="backward", n_jobs=-1),
    RidgeCV(),
)

In [58]:
mean_std_cross_val_scores(poly_seq_pipe, X_train, y_train, return_train_score=True)

fit_time       52.778 (+/- 2.778)
score_time      0.000 (+/- 0.000)
test_score      0.796 (+/- 0.064)
train_score     0.922 (+/- 0.006)
dtype: object

In [59]:
poly_seq_pipe.named_steps['sequentialfeatureselector']

SequentialFeatureSelector(direction='backward', estimator=Ridge(), n_jobs=-1)

The mean crossvalidation score from this pipeline with backward selection is less than the ones from RFECV.