# DSCI 573 - Feature and Model Selection

# Lab 2: Feature engineering, feature selection

## Table of contents
- [Submission instructions](#si)
- [Exercise 1: Feature engineering](#1)
- [(optional) Exercise 2: Change of basis](#2)
- [Exercise 3: Recursive feature elimination and forward selection](#3)
- [(optional) Exercise 4: Implement forward selection](#4)

## Submission instructions <a name="si"></a>
<hr>
rubric={mechanics:2}

You will receive marks for correctly submitting this assignment. 

To correctly submit this assignment follow the instructions below:

- Push your assignment to your GitHub repository. 
- Add a link to your GitHub repository here: LINK TO YOUR GITHUB REPO 
- Upload an HTML render of your assignment to Canvas. The last cell of this notebook will help you do that.
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).

[Here](https://github.com/UBC-MDS/public/tree/master/rubric) you will find the description of each rubric used in MDS.

**NOTE: The data you download for use in this lab SHOULD NOT BE PUSHED TO YOUR REPOSITORY. You might be penalised for pushing datasets to your repository. I have seeded the repository with `.gitignore` and hoping that it won't let you push CSVs.**

In [None]:
import os

%matplotlib inline
import string
from collections import deque

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# data
from sklearn import datasets
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.feature_extraction.text import CountVectorizer

# Feature selection
from sklearn.feature_selection import RFE, RFECV
from sklearn.impute import SimpleImputer

# classifiers / models
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, RidgeCV

# other
from sklearn.metrics import accuracy_score, log_loss, make_scorer, mean_squared_error
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    ShuffleSplit,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import (
    OneHotEncoder,
    OrdinalEncoder,
    PolynomialFeatures,
    StandardScaler,
)
from sklearn.svm import SVC, SVR

## Exercise 1: Feature engineering <a name="1"></a>
<hr>

One of the most important aspects which influences performance of machine learning models is the features used to represent the problem. If your underlying representation is bad whatever fancy model you use is not going to help and with good representation, a simple and interpretable model is likely to perform reasonably well. 

**Feature engineering** is the process of transforming raw data into features that better represent the underlying problem to the predictive models. 

In this exercise we'll engineer our own features on [the Disaster Tweets dataset](https://www.kaggle.com/vstepanenko/disaster-tweets). 

Note that coming up with features is difficult, time-consuming, and requires expert knowledge. The purpose of this exercise is to give you a little taste of feature engineering, which you are likely to be doing in your career as a data scientist or a machine learning practitioner. In this exercise, since we'll be using simplistic features, you might not get better scores with your engineered features, and that's fine. The purpose here is to make you familiar with the process of feature engineering rather than getting the best scores. 

As usual, download the dataset, unzip it and save it in your lab folder. Do not push it into the repository. 

In [None]:
### BEGIN STARTER CODE

df = pd.read_csv("tweets.csv", usecols=["keyword", "text", "target", "location"])
train_df, test_df = train_test_split(df, test_size=0.2, random_state=2)
train_df.head()

### BEGIN STARTER CODE

### 1.1 Preliminary analysis
rubric={reasoning:5}

**Your tasks:**

1. State in your own words what problem you are trying to solve here. (One sentence is enough.) 
2. Do you have class imbalance. If yes, do we need to deal with it? What metric would be appropriate in this case? 
3. I am defining `text_feature` and `target` in the starter code below. Identify other feature types and the transformations you want to apply on features. Note that "location" feature could be a potentially useful feature but is a bit complicated to encode and we are going to exclude it in this assignment. 

In [None]:
### BEGIN STARTER CODE

text_feature = "text"
target = "target"

X_train, y_train = train_df.drop(columns=["target", "location"]), train_df[target]
X_test, y_test = test_df.drop(columns=["target", "location"]), test_df[target]

### END STARTER CODE

**solution_1_1_1**

### YOUR ANSWER HERE

**solution_1_1_2**

### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

In [None]:
# solution_1_1_3
### YOUR ANSWER HERE

### (optional) 1.2 
rubric={reasoning:1}

**Your tasks:**
1. Here we are dropping the `location` feature. How you might encode it if you decide to include it?

**solution_1_2_1**

### YOUR ANSWER HERE

In [None]:
### YOUR ANSWER HERE

### 1.3 DummyClassifier
rubric={accuracy:1}

**Your tasks:**

1. Report cross-validation mean f1 score and accuracy for `DummyClassifier` `with strategy="stratified"`.


In [None]:
# solution_1_3_1

### YOUR ANSWER HERE

### 1.4 `CountVectorizer`
rubric={accuracy:3,reasoning:1}

So far for text data, we have been using bag of words representation as features. 
Let's examine the scores with a bag of words model. 

**Your tasks:**
1. Define a pipeline with `CountVectorizer` on the "text" column and `LogisticRegression` classifier. Set `max_features` of `CountVectorizer` to 20_000 and `class_weight` of `LogisticRegression` to "balanced". (These are  optimized hyperparameter values. We won't carry out hyperparameter optimization here in the interest of time.)
2. Report mean cross-validation f1 score and accuracy. Compare it with the baseline model in 1.3. 

In [None]:
# solution_1_4_1

### YOUR ANSWER HERE

In [None]:
# solution_1_4_2

### YOUR ANSWER HERE

**solution_1_4_2 (reasoning)**

### YOUR ANSWER HERE

### 1.5 Include _keyword_ feature
rubric={accuracy:4,reasoning:1}

The _keyword_ feature seems relevant for predicting whether the tweet is disastrous or not. 

**Your tasks:**

1. Build a column transformer for transforming _keyword_ feature and _text_ feature. Set `max_features` of `CountVectorizer` to 20_000. So far you haven't used `CountVectorizer` with other transformations. Below is an example column transformer which shows how to use `CountVectorizer` with other transformers. Unlike transformers such as `StandardScaler()` for numeric features, for `CountVectorizer` transformer, you pass the feature name as a string rather than a list of features. (So if you have multiple text columns, you'll have to define multiple `CountVectorizer` transformers.) 
```
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features), # scale for numeric features
    (CountVectorizer(), "text") # bag of words for text feature
)
```
2. Build a pipeline with the column transformer in 1. and `LogisticRegression` classifier. Set `class_weight` of `LogisticRegression` to "balanced". 
3. Report mean cross-validation f1 scores and accuracy. 
4. Are you getting better scores than 1.3 and 1.4? 

In [None]:
# solution_1_5_1

### YOUR ANSWER HERE

In [None]:
# solution_1_5_2

### YOUR ANSWER HERE

In [None]:
# solution_1_5_3

### YOUR ANSWER HERE

**solution_1_5_4**

### YOUR ANSWER HERE

### Exercise 1.6: Adding new features
rubric={reasoning:5}

Is it possible to further improve the scores? How about adding new features based on our intuitions? 

**Your tasks:**

1. Name 3 to 4 additional features you think would be helpful in predicting the target. An example would be a binary feature "has_emoticons" indicating whether the tweet has emoticons or not. Explain your intuition behind the features and discuss how hard in would be to engineer these features. 

**solution_1_6_1**

### YOUR ANSWER HERE

### Exercise 1.7: Extracting your own features 
rubric={accuracy:4,reasoning:4}

In this exercise, we will be adding some very basic length-related and sentiment features.  

You will need to install a popular library called `nltk` for this exercise. For that, run the following commands in your `conda` environment. 

```
conda install -c anaconda nltk 
nltk.download("vader_lexicon")
nltk.download("punkt")
```        

Run the starter code below creates three new features: 
- Relative character length in the tweet. 
- Number of words in the tweet.
- Sentiment of the tweet (positive (pos), negative (neg), neutral (neu), compound (mixture of different sentiments)). In 571, you carried out sentiment analysis on the IMDB data set. Here we are using some pre-trained machine learning model to extract sentiment expressed in the tweets. 

**Your tasks:**

1. Extract at least two more features that you think might be relevant for prediction and store them as new columns in the train and test sets. Briefly explain your intuition on why these features might help the prediction task. 
2. Would it have been OK to create new columns directly in the original `df` instead of creating them separately for train and test splits? Would that be violation of the golden rule? 

In [None]:
### BEGIN STARTER CODE

import nltk

nltk.download("vader_lexicon")
nltk.download("punkt")
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

### END STARTER CODE

In [None]:
### BEGIN STARTER CODE


def get_relative_length(text, TWITTER_ALLOWED_CHARS=280.0):
    """
    Returns the relative length of text.

    Parameters:
    ------
    text: (str)
    the input text

    Keyword arguments:
    ------
    TWITTER_ALLOWED_CHARS: (float)
    the denominator for finding relative length

    Returns:
    -------
    relative length of text: (float)

    """
    return len(text) / TWITTER_ALLOWED_CHARS


def get_length_in_words(text):
    """
    Returns the length of the text in words.

    Parameters:
    ------
    text: (str)
    the input text

    Returns:
    -------
    length of tokenized text: (int)

    """
    return len(nltk.word_tokenize(text))


def get_sentiment(text):
    """
    Returns the maximum scoring sentiment of the text

    Parameters:
    ------
    text: (str)
    the input text

    Returns:
    -------
    sentiment of the text: (str)
    """
    scores = sid.polarity_scores(text)
    return max(scores, key=lambda x: scores[x])


### YOUR ANSWER HERE

In [None]:
### BEGIN STARTER CODE

train_df = train_df.assign(n_words=train_df["text"].apply(get_length_in_words))
train_df = train_df.assign(sentiment=train_df["text"].apply(get_sentiment))
train_df = train_df.assign(rel_char_len=train_df["text"].apply(get_relative_length))

test_df = test_df.assign(n_words=test_df["text"].apply(get_length_in_words))
test_df = test_df.assign(sentiment=test_df["text"].apply(get_sentiment))
test_df = test_df.assign(rel_char_len=test_df["text"].apply(get_relative_length))

### END STARTER CODE

In [None]:
# solution_1_7_1

### YOUR ANSWER HERE

**solution_1_7_1 (reasoning)**

### YOUR ANSWER HERE

**solution_1_7_2**

### YOUR ANSWER HERE

### 1.8 Pipeline with all features
rubric={accuracy:4,reasoning:2}

**Your tasks:**
1. Identify different feature types in your new data set with the features you created above, and separate features and targets from your new dataset. 
2. Define a column transformer for your mixed feature types. Again, set `max_features` of `CountVectorizer` to 20_000.  
3. Define a pipeline with the column transformer and `LogisticRegression` with `class_weight` of `LogisticRegression` set to "balanced" and report mean cross-validation f1 scores.


In [None]:
# solution_1_8_1

### YOUR ANSWER HERE

In [None]:
# solution_1_8_2

### YOUR ANSWER HERE

In [None]:
# solution_1_8_3

### YOUR ANSWER HERE

### 1.9 Interpretation
rubric={accuracy:4,reasoning:2}

1. Do you see any improvements with the new features compared to when you used only `CountVectorizer` features? Note that feature engineering is hard and requires domain expertise. If you do not see big improvements in scores with new features, that's OK. Do not get discouraged. The purpose of this exercise is to make you familiar to the process of extracting new features rather than getting the best scores. 
2. Show the first 20 coefficients with largest magnitudes and corresponding features. 
3. Examine the coefficients of the features we have extracted above. Do they make sense? 

**solution_1_9_1**

### YOUR ANSWER HERE

In [None]:
# solution_1_9_2

### YOUR ANSWER HERE

**solution_1_9_3**

### YOUR ANSWER HERE

### 1.10 Test results
rubric={accuracy:2, reasoning:2}

**Yout tasks**

1. Report f1 score on the test set with the model trained with all features. 
2. What additional time, other than prediction time, do we need if we are to use this model with our engineered features on the deployment data?  

In [None]:
# solution_1_10_1

### YOUR ANSWER HERE

**solution_1_10_2**

### YOUR ANSWER HERE

## Dataset for next exercises
<hr>

In the following exercises, we'll be using [`sklearn`'s boston housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html). 

In [None]:
### BEGIN STARTER CODE

from sklearn.datasets import load_boston

boston_housing = load_boston()
print(boston_housing.keys())
print(boston_housing.DESCR)

### END STARTER CODE

In [None]:
### BEGIN STARTER CODE

boston_df = pd.DataFrame(boston_housing.data, columns=boston_housing.feature_names)
boston_df["target"] = boston_housing.target
train_df, test_df = train_test_split(boston_df, test_size=0.2, random_state=2)

X_train, y_train = train_df.drop(columns=["target"]), train_df["target"]
X_train, y_test = train_df.drop(columns=["target"]), train_df["target"]

### END STARTER CODE

## (optional) Exercise 2: Change of basis <a name="2"></a>
<hr>

The linear model is problematic when the target is a non-linear function of the input. With high dimensional data we cannot really know whether the target is a linear or non-linear function of the input. One way to examine this is by using _polynomial features_. Suppose you have a single feature $x_1$ in your original data, you can think of transforming the data into the following matrix $X_{poly}$ where each of its rows contains the values $(X_{i})^j$ for $j=0$ up to some maximum $degree$. E.g., 

$$
X_{poly} = \left[\begin{array}{cccc}
1 & x_1 & (x_1)^2 & (x_1)^3\\
1 & x_2 & (x_2)^2 & (x_2)^3\\
\vdots\\
1 & x_n & (x_n)^2 & (x_N)^3\\
\end{array}
\right],
$$

We can then fit a least squares model as if the above were our data set. You can think of this as "changing the model by changing the data" since we are still using a linear model but making the fit nonlinear by inventing new features. 

### (optional) 2.1 Polynomial feature transformations  
rubric={reasoning:1}

**Your tasks:**
1. Is it possible to visualize the our Boston housing data and examine whether a linear fit is good fit for this dataset or not? 
2. Carry out cross-validation using `DummyRegressor` on the train portion. 
3. Define a pipeline with `PolynomialFeatures` and `RidgeCV`. 
4. Examine the train and validation scores for three values for `degree` hyperparameter of `PolynomialFeatures`: 1, 2, and 3. Use either negative MAPE or `neg_root_mean_squared_error` for scoring. 
5. Which value of `degree` is giving you the best results? How many new features do you have with this degree?

**solution_2_1_1**

### YOUR ANSWER HERE

In [None]:
# solution_2_1_2

### YOUR ANSWER HERE

In [None]:
# solution_2_1_3

### YOUR ANSWER HERE

In [None]:
# solution_2_1_4

### YOUR ANSWER HERE

In [None]:
# solution_2_1_5

### YOUR ANSWER HERE

## Exercise 3: Feature importances and feature selection <a name="3"></a>
<hr>

In this exercise we'll explore feature importances, recursive feature elimination, adding polynomial features, and forward selection. You could use the scoring method of your choice. The default $R^2$ is fine too.  

### Exercise 3.1 Adding random noise
rubric={reasoning:2}

The following code shows the coefficients learned by `RidgeCV` on the Boston housing dataset. It then adds a column of random noise to `X_train` and re-trains and examines the coefficients again. We see that the model has assigned a non-zero coefficient to the noise feature. But wait, we know this feature can't possibly be useful. 

**Yout taks:**

1. why is the importance of the random noise feature non-zero (and in fact larger than for some real features)? Maximum 2 sentences.

In [None]:
lrcv = RidgeCV()
lrcv.fit(X_train, y_train)
pd.DataFrame(data=lrcv.coef_, index=X_train.columns, columns=["coefficient"])

In [None]:
random_noise = np.random.randn(X_train.shape[0], 1)
X_train_noise = pd.concat(
    (X_train, pd.DataFrame(random_noise, columns=["noise"], index=X_train.index)),
    axis=1,
)
X_train_noise.head()

In [None]:
lrcv = RidgeCV()
lrcv.fit(X_train_noise, y_train)
pd.DataFrame(data=lrcv.coef_, index=X_train_noise.columns, columns=["coefficient"])

**solution_3_1_1**

### YOUR ANSWER HERE

### 3.2 `RFECV` 
rubric={accuracy:4,reasoning:2}

In this exercise, you'll explore recursive feature elimination for feature selection. 

**Your tasks:**
1. Define a pipeline with the following steps and report mean cross-validation scores with the pipeline on the Boston housing dataset.  
    - `StandardScaler` with default parameters
    - `RidgeCV`
2. Now add `RFECV` with `Ridge` to the pipeline and report mean cross-validation scores with the pipeline.     
3. Why are we using `RFECV` and `RidgeCV` in the pipeline? 
4. How many features have been selected by the `RFECV`. You can access this using `n_features_` attribute of the `RFECV` object. 

In [None]:
# soliution_3_2_1

### YOUR ANSWER HERE

**soliution_3_2_3**

### YOUR ANSWER HERE

In [None]:
# soliution_3_2_4

### YOUR ANSWER HERE

**soliution_3_2_4**

### YOUR ANSWER HERE

### 3.3: `PolynomialFeatures` + [`RFECV`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html)
rubric={accuracy:3,reasoning:3}

**Your tasks:**
1. Add one more step to the pipeline above: **`PolynomialFeatures()`** 
2. Carry out cross-validation using the pipeline, and report the mean validation scores. 
3. What's the effect of adding `PolynomialFeatures` step in the pipeline? How many total features there will be after applying `PolynomialFeatures` transformation? How many features have been selected by `RFECV`?
Are you getting better scores compared to 3.2? 

In [None]:
# solution_3_3_1

### YOUR ANSWER HERE

In [None]:
# solution_3_3_2

### YOUR ANSWER HERE

**solution_3_3_3**

### YOUR ANSWER HERE

In [None]:
# solution 3_3_3
### YOUR ANSWER HERE

### 3.4: `PolynomialFeatures` + Forward selection
rubric={accuracy:4,reasoning:2}

Forward selection is not implemented in `sklearn`. But it is in a package `mlxtend`, which is compatible with `sklearn` pipelines. 

So first, let's install `mlxtend` in our environment. 
```
conda install -c conda-forge mlxtend
```

**Your tasks:**
1. Define a pipeline with forward search instead of `RFECV`. So add the following step in the pipeline instead of `RFECV` and report mean cross-validation scores. 
    - `SequentialFeatureSelector` with `Ridge`, and `k_features` = 20. 
2. Are you getting comparable scores? Is there any overlap between the features selected by `RFECV` and forward selection? 

In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector

In [None]:
# solution_3_4_1

### YOUR ANSWER HERE

In [None]:
# solution_3_4_2

### YOUR ANSWER HERE

**solution_3_4_2 (reasoning)**

### YOUR ANSWER HERE

### 3.5
rubric={reasoning:2}

**Your tasks:**

1. Discuss advantages and disadvantages of recursive feature elimination (RFE) and forward selection. 

**solution_3_5_1**

### YOUR ANSWER HERE

### (optional) Exercise 4: Implement forward selection <a name="4"></a>
<hr>

### 4.1 (optional) Implement your own forward selection
rubric={reasoning:2}

**Your tasks:**
1. Implement the `fit` method of the forward selection algorithm using the starter code below. This algorithm works iteratively. At each step, add in the feature that most reduces the validation error. Stop adding features once the validation error stops decreasing. Feel free to adapt the `init` and `transform` methods as you see appropriate. You are welcome to hard-code in a particular choice of model (e.g., `Ridge` or `SVR` with a linear kernel). Optionally, abstract away the model so that your forward selection function can be called with any model so long as it implements `fit` and `predict` and `score` like most `sklearn` models.
2. Carry out feature selection using your method on the Boston housing dataset above. 
3. Discuss your results. Are you getting similar results to `mlxtend`? 

In [None]:
class ForwardSelection:
    def __init__(
        self,
        model,
        min_features=None,
        max_features=None,
        scoring=None,
        cv=None,
        mode="regression",
    ):
        """
        Initializes the ForwardSelection object

        Parameters
        ----------
        model -- sklearn regressor or classifier object
            The sklearn regression or classification model

        Keyword arguments
        ----------
        min_features -- (int)
            the minimum number of features the model must select (default None)
        max_features -- (int)
            the maximum number of features that model may select (default None)
        scoring -- (str)
            the scoring that will be used for feature selection (default None)
        cv -- (int)
            the number of folds in the cross validation (default None)
        mode -- (str)
            Whether you are carrying out feature selection for regression or classification

        Returns
        -------
            None

        """
        self.max_features = max_features
        if min_features is None:
            self.min_features = 1
        else:
            self.min_features = min_features

        self.model = model
        self.scoring = scoring
        self.cv = cv
        self.score_ = None
        self.ftr_ = []
        self.mode = mode

    def fit(self, X, y):
        """
        Finds the best feature set using the `Forward Selection` algorithm.

        Parameters:
        -------
        X -- (numpy array)
            Feature vector
        y -- (numpy array)
            target vector

        Returns
        -------
            None

        """
        # solution_4_1_1
        ### YOUR ANSWER HERE

    def transform(self, X, y=None):
        """
        Return the features selected using the `Forward Selection` algorithm.

        Parameters:
        -------
        X -- (numpy array)
            Feature vector
        y -- (numpy array)
            target vector (default = None)

        Return:
        -------
        selected features from X

        """
        return X[:, self.ftr_]

In [None]:
# solution_4_1_2

### YOUR ANSWER HERE

**solution_4_1_3**

### YOUR ANSWER HERE

### Submission to Canvas

**PLEASE READ: When you are ready to submit your assignment do the following:**

- Run all cells in your notebook to make sure there are no errors by doing Kernel -->  Restart Kernel and Run All Cells...
- If you are using the "573" `conda` environment, make sure to select it before running all cells. 
- Convert your notebook to .html format using the `convert_notebook()` function below or by File -> Export Notebook As... -> Export Notebook to HTML
- Run the code `submit()` below to go through an interactive submission process to Canvas.
After submission, be sure to do a final push of all your work to GitHub (including the rendered html file).

In [None]:
# from canvasutils.submit import convert_notebook, submit

# convert_notebook("lab2.ipynb", "html")  # uncomment and run when you want to try convert your notebook (or you can convert manually from the File menu)
# submit(course_code=59091, token=False)  # uncomment and run when ready to submit to Canvas

Congratulations on finishing the lab!! 