# SLU15 - Feature Engineering (aka Real World Data): Exercises notebook

## 1 About the data

In this exercise we will be using a dataset with Google Play Store apps, adapted from [here](https://www.kaggle.com/lava18/google-play-store-apps).

In [1]:
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.read_csv('data/googleplaystore.csv')
data.head()

ModuleNotFoundError: No module named 'matplotlib'

The fields in this dataset have the following meaning:
* **App** - name of the app.
* **Category** - category the app belongs to.
* **Rating** - overall user rating of the app (as when scraped).
* **Reviews** - number of user reviews for the app (as when scraped).
* **Size** - size of the app in MB (as when scraped).
* **Installs** - Number of user installs for the app (as when scraped).
* **Type** - Paid or Free.
* **Price** - price of the app (as when scraped).
* **Content Rating** - age group the app is targeted at: Children / Mature 21+ / Adult.
* **Genre** - an app can belong to multiple genres (apart from its main category).
* **Last Updated** - date when the app was last updated on Play Store (as when scraped).
* **Current Ver** - current version of the app available on Play Store (as when scraped).
* **Android Ver** - min required Android version (as when scraped).

The first thing we want to do is to check the dtypes of our features.

In [None]:
data.dtypes

## 2 Category dtype in pandas

### Exercise 1: Convert fields into category dtype (graded)

The fields `Category` and `Content Rating` are of dtype `object` but can be converted into dtype `category`, as explained in the Learning Notebook. Moreover:
* `Category` is a *nominal* categorical field, that is, without any meaningful order;
* `Content Rating` is an *ordinal* categorical field, as its values has a natural order.

In the following exercise, convert both fields into dtype `category` and, in the case of the field `Content Rating`, assign a natural order for its categories.

_Note:_ Regarding the "natural order" for the field `Content Rating`, go from less restrictive to more restrictive. If a given category does not fit, leave it to the end of the ordering. 

In [None]:
def convert_categorical_features(X, nominal_feat='Category', ordinal_feat='Content Rating'):

    X_s = X.copy()
    
    ## convert nominal feature to dtype 'category'
    # ...
    ## create list of ordered categories for ordinal feature
    # ordered_cats = ...
    ## convert ordinal feature to dtype 'category'
    # ...
    ## Assign natural order to ordinal feature
    # ...
    
    # YOUR CODE HERE
    raise NotImplementedError()

    return X_s

In [None]:
"""Check that the solution is correct."""
X_cat_conv = convert_categorical_features(data)

assert X_cat_conv['Category'].dtype == 'category'
assert X_cat_conv['Content Rating'].dtype == 'category'
assert X_cat_conv['Content Rating'].min() == 'Everyone'
assert X_cat_conv['Content Rating'].max() == 'Unrated'

### Exercise 2: Encode binary field (graded)

In this exercise, encode the target variable to be `1` when an app is `Paid` and to `0` when it is `Free` using the `map` method.

In [None]:
def encode_binary_field(f):

    f_e = f.copy()
    
    ## create a dictionary mapping the current values to int values
    # enconding_map = ...
    ## change target using the mapping
    # f_e = ...
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return f_e

In [None]:
"""Check that the solution is correct."""
f_encoded = encode_binary_field(data.Type)

assert f_encoded[123] == 0
assert f_encoded[995] == 1
assert sum(f_encoded.fillna(0)) == 800

### Exercise 3: Discretize `Reviews` field (graded)

The field `Reviews` is a continuous field, with a distribution which is, not surprisingly, very skewed to the right (remember *skewness* from SLU04?).

In [None]:
data.Reviews.plot.hist(bins=100, figsize=(10,6));
plt.xlim(0);
plt.xlabel('Reviews');
print("The field 'Reviews' ranges from", data.Reviews.min(), "to", data.Reviews.max())

We will deal with the skewness in a bit. Let's first discretize this field in two ways:
* create a new field called `discrete_reviews` which is the discretization of the `Review` field, such that the range is between 0 and 99 and the original instances are uniformly distributed;
* create a new field called `binary_reviews` which is the binarization of the `Review` field, such that amounts smaller than `100000` become `0` and amounts equal or greater than `100000` become 1.

Use `sklearn` transformers in this exercise.

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import Binarizer

def discretize_reviews(X):

    X_a = X.copy()
    
    ## create new column `discrete_amount` using suitable transformer
    # discretizer = ...
    # ...
    ## create new column `binary_amount` using suitable transformer
    # binarizer = ...
    # ...
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return X_a

In [None]:
"""Check that the solution is correct."""
X_reviews = discretize_reviews(data)

assert X_reviews.discrete_reviews.nunique() == 32
assert X_reviews.discrete_reviews.max() == 99
assert X_reviews.loc[123, 'discrete_reviews'] == 0
assert X_reviews.binary_reviews.nunique() == 2
assert X_reviews.binary_reviews.max() == 1
assert X_reviews.loc[123, 'binary_reviews'] == 0

Check the distribution of the two new fields you just calculated:

In [None]:
X_reviews.discrete_reviews.plot.hist(bins=100, figsize=(10,6));
plt.xlim(0,99);
plt.xlabel('discrete_reviews');
plt.title('Reviews after discretization');

In [None]:
X_reviews.binary_reviews.plot.hist(figsize=(4,4));
plt.xlim(0,1);
plt.xlabel('binary_reviews');
plt.title('Reviews after binarization');

### Exercise 4: Scale `Reviews` field (graded)

In the Learning Notebook, you also learned that numerical data can be scaled. 

In this exercise, let's scale the field `Reviews` in three different ways and compare the results:
* create a new field called `minmaxscaled_reviews` which scales uniformly the `Reviews` field such that the values range from 0 to 1;
* create a new field called `standardscaled_reviews` which scales the `Reviews` field such that the *mean* is 0 and the standard deviation is 1;
* create a new field called `robustscaled_reviews` which scales the `Reviews` field such that the *median* is 0 and it is scaled according to the Interquartile Range.

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

def scale_reviews(X):

    X_s = X.copy()
    
    ## create new column `minmaxscaled_reviews` using suitable transformer
    # ...
    ## create new column `standardscaled_reviews` using suitable transformer
    # ...
    ## create new column `robustscaled_reviews` using suitable transformer
    # ...
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return X_s

In [None]:
"""Check that the solution is correct."""
X_scaled = scale_reviews(data)

assert X_scaled.minmaxscaled_reviews.min() == 0
assert X_scaled.minmaxscaled_reviews.max() == 1
assert math.isclose(X_scaled.minmaxscaled_reviews.mean(), 0.0057, abs_tol = 0.0001)
assert math.isclose(X_scaled.loc[1234, 'minmaxscaled_reviews'], 0.00022, abs_tol = 0.00001)
assert math.isclose(X_scaled.standardscaled_reviews.min(), -0.152, abs_tol = 0.001)
assert math.isclose(X_scaled.standardscaled_reviews.max(), 26.55, abs_tol = 0.01)
assert math.isclose(X_scaled.standardscaled_reviews.mean(), -5.244e-18, abs_tol = 0.01e-18)
assert math.isclose(X_scaled.loc[1234, 'standardscaled_reviews'], -0.146, abs_tol = 0.01)
assert math.isclose(X_scaled.robustscaled_reviews.min(), -0.0383, abs_tol = 0.0001)
assert math.isclose(X_scaled.robustscaled_reviews.max(), 1427.84, abs_tol = 0.01)
assert math.isclose(X_scaled.robustscaled_reviews.mean(), 8.076, abs_tol = 0.001)
assert math.isclose(X_scaled.loc[1234, 'robustscaled_reviews'], 0.274, abs_tol = 0.001)

Plot the distributions for the new fields you just calculated:

In [None]:
X_scaled.minmaxscaled_reviews.plot.hist(bins=30, figsize=(10,6));
plt.xlim(0,1);
plt.xlabel('minmaxscaled_reviews');
plt.title('Reviews after min-max scaling');

In [None]:
X_scaled.standardscaled_reviews.plot.hist(bins=30, figsize=(10,6));
plt.xlabel('standardscaled_reviews');
plt.title('Reviews after standard scaling');

In [None]:
X_scaled.robustscaled_reviews.plot.hist(bins=30, figsize=(10,6));
plt.xlabel('robustscaled_reviews');
plt.title('Reviews after robust scaling');

### Exercise 5: Ordinal encode `Content Rating` feature

Finally, let's deal with the categorical features.

First, create a new field called `content_rating_encoded` which is the result of ordinal encoding of the `Content Rating` feature.

In [None]:
import category_encoders as ce

def encode_content_rating(X):

    X_r = X.copy()
    
    # create new column using suitable transformer
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return X_r

In [None]:
"""Check that the solution is correct."""
X_content_rating = encode_content_rating(data)

assert X_content_rating.content_rating_encoded.dtype == int
assert X_content_rating.content_rating_encoded.min() == 1
assert X_content_rating.content_rating_encoded.max() == 6
assert X_content_rating.loc[1234, 'content_rating_encoded'] == 1

### Exercise 6: One-hot encode type feature

Finally, perform a one-hot encoding of the `Category` feature. Pay attention to the following points:
* return the original DataFrame `X`, but with the `Category` feature replaced by the new ones resulting from the one-hot encoding;
* make sure the new features have names of the form `Category_<value>`, where `<value>` is the category being indicated by that feature.

In [None]:
def encode_category(X):

    X_t = X.copy()
    
    # perform one-hot encoding in X_t
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return X_t

In [None]:
"""Check that the solution is correct."""
X_category = encode_category(data)

assert X_category.shape[1] > 10
assert X_category.Category_WEATHER.sum() == 82
assert X_category['Category_-1'].sum() == 0
assert X_category.loc[1234, 'Category_VIDEO_PLAYERS'] == 0
assert X_category.loc[4322, 'Category_SHOPPING'] == 1