# Participation: Feature Engineering

Assignment built using SKLearn Samples:
 - https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html

For this activity, we will be exploring a multi-faceted dataset and implementing regression. We'll look at using baked-in SKLearn tools to understand the dataset, transform it to suit our needs, and attempt regression with KNN and Random Forests.

In [1]:
#Imports

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

**0) Use your unique random seed (last 5 digits of your BuffOne card) in the train_test_split**

In [7]:
# Initialize our dataset

db_data = load_diabetes(scaled=False)
X_train, X_test, y_train, y_test = train_test_split(db_data.data, db_data.target, test_size = 0.3, random_state=78578)

Freely explore the dataset to answer the following questions:

**1) How many features are there? (Do not include "target", which we will be using as our outcome measure)**

There are 8 freatures (excluding target or class).

**2) For each feature, figure out whether it is numeric (either continuous and discrete), ordinal, or categorical (including binary, e.g. True/False). Report the breakdown of types for the features.**

Preg (Number of times pregnant): numeric but discrete (only integer).

Plas (Plasma glucose concentration a 2 hours in an oral glucose tolerance test): It is numeric but continuous (even though it is an integer, it could be defined as continuous because they can be calculated in decimals but rounded up or down to an integer). Since it is an integer, we could also say that it is discrete.

Pres (Blood Pressure): numeric but continous (same logic as plas, it could be measured in decimals but it is rounded to integer). Since it is an integer, it can be considered as discrete.

Skin (Skin Fold Thickness): Numeric but continuous (same logic as before). However, in this case, since it is an integer, we could say that it is discrete.

Insu (Serum Insulin): Numeric but continous. Same logic as before because insulin levels can be calculated as continous, but since it is an integer, we could consider it as discrete.

Mass: Numeric but continous. Measured in decimals.

Pedi (pedigree): Numeric but continous since it is measured in decimals.

Age: Numeric but continous since age can be measured in decimals. However, since it is stored integer, we can consider it to be discrete.

**3) How many samples are in our train set? Our test set?**

There are 309 samples in training set and 133 samples in test set.

In [22]:
#Your exploration workspace
from sklearn.datasets import fetch_openml
features = fetch_openml(data_id=37, as_frame=True) #id from the website
features.frame.info()

print(len(X_train))
print(len(X_test))

#HINT Explore the SKLearn fetch_openml documentation to learn more about getting the data: https://scikit-learn.org/dev/modules/generated/sklearn.datasets.fetch_openml.html

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   preg    768 non-null    int64   
 1   plas    768 non-null    int64   
 2   pres    768 non-null    int64   
 3   skin    768 non-null    int64   
 4   insu    768 non-null    int64   
 5   mass    768 non-null    float64 
 6   pedi    768 non-null    float64 
 7   age     768 non-null    int64   
 8   class   768 non-null    category
dtypes: category(1), float64(2), int64(6)
memory usage: 49.0 KB
309
133


Now that we better understand the bike sharing problem, let's look at trying to solve it. First, we can run regression on the unaltered features and see how they perform. Remember, for regression we need a measure of distance between real and predicted - for today, we will use the Mean Squared Error (MSE).

In [24]:
# KNN Regressor

knn = KNeighborsRegressor()
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)
knn_mse = mean_squared_error(y_test, knn_pred)
print(knn_mse)

#TODO Display resulting mse

5369.800902255639


In [25]:
#Random Forest Regressor
forest = RandomForestRegressor()
forest.fit(X_train, y_train)
forest_pred = forest.predict(X_test)
forest_mse = mean_squared_error(y_test, forest_pred)
print(forest_mse)
#TODO Display resulting MSE

3423.705415789474


**3) Which regressor performs better on our dataset? What is its MSE?**

The random forest regressor performs better because its MSE is smaller than the KNN regressor's MSE. Its MSE is 3423.7

Next, let's implement Min-Max Scaling and Normalization for our features in X. Again, we'll use SKLearn's implementations to accomplish this task.

In [26]:
# Min Max Scaler

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Standard Scaler (Normalizes)

normer = StandardScaler()
normer.fit(X_train)
X_train_normed = normer.transform(X_train)
X_test_normed = normer.transform(X_test)



**4) How does the KNN Regressor perform on each transformed feature set?**

With the scaler, the MSE was $3464.97$. For the normer, the MSE was $3394.30$. Therefore, it performed better with the normer.

**5) How does the Random Forest Regressor perform on each transformed feature set?**

With the scaler, the MSE was $3556.78$. With the normer, the MSE was $3330.78$. Therefore, it performed better with the normer.

In [32]:
#YOUR EXPLORATION SPACE
knn_scaler = KNeighborsRegressor()
knn_scaler.fit(X_train_scaled, y_train)
knn_scaler_pred = knn_scaler.predict(X_test_scaled)
knn_scaler_mse = mean_squared_error(y_test, knn_scaler_pred)
print(knn_scaler_mse)

knn_normer = KNeighborsRegressor()
knn_normer.fit(X_train_normed, y_train)
knn_normer_pred = knn_normer.predict(X_test_normed)
knn_normer_mse = mean_squared_error(y_test, knn_normer_pred)
print(knn_normer_mse)

forest_scaler = RandomForestRegressor()
forest_scaler.fit(X_train_scaled, y_train)
forest_scaler_pred = forest_scaler.predict(X_test_scaled)
forest_scaler_mse = mean_squared_error(y_test, forest_scaler_pred)
print(forest_scaler_mse)

forest_normer = RandomForestRegressor()
forest_normer.fit(X_train_normed, y_train)
forest_normer_pred = forest_normer.predict(X_test_normed)
forest_normer_mse = mean_squared_error(y_test, forest_normer_pred)
print(forest_normer_mse)




# HINT - replace "X_train" in the fit() and "X_test" in the predict()


3464.9702255639095
3394.298345864662
3463.2381007518793
3446.2006691729325


### BONUS (50 Participation Points)

Play with the different feature transformation and tune the Regressors' hyperparameters to get the best results you can on MSE. Report the results of your exploration - which model / hyperparameters and which feature transformation combined to get the best results?

In [39]:
#YOUR EXPLORATION HERE

#scaler knn

knn_list = [0] * 21

for i in range(1,21):
  knn_list[i] = KNeighborsRegressor(n_neighbors=i)
  knn_list[i].fit(X_train_scaled, y_train)
  knn_scaler_pred = knn_list[i].predict(X_test_scaled)
  knn_scaler_mse = mean_squared_error(y_test, knn_scaler_pred)
  knn_list[i] = knn_scaler_mse

knn_list = knn_list[1:]




#normer knn


knn_normer = [0] * 21


for i in range(1,21):
  knn_normer[i] = KNeighborsRegressor(n_neighbors=i)
  knn_normer[i].fit(X_train_normed, y_train) #train
  knn_normer_pred = knn_normer[i].predict(X_test_normed) #predict
  knn_normer_mse = mean_squared_error(y_test, knn_normer_pred) #error
  knn_normer[i] = knn_normer_mse #append

knn_normer = knn_normer[1:]




#scaler forest

forest_scaler = [0] * 300

for i in range(1,300):
  forest_scaler[i] = RandomForestRegressor(n_estimators=i)
  forest_scaler[i].fit(X_train_scaled, y_train)
  forest_scaler_pred = forest_scaler[i].predict(X_test_scaled)
  forest_scaler_mse = mean_squared_error(y_test, forest_scaler_pred)
  forest_scaler[i] = forest_scaler_mse

forest_scaler = forest_scaler[1:]



#normer forst

forest_normer = [0] * 300

for i in range(1,300):
  forest_normer[i] = RandomForestRegressor(n_estimators=i)
  forest_normer[i].fit(X_train_normed, y_train)
  forest_normer_pred = forest_normer[i].predict(X_test_normed)
  forest_normer_mse = mean_squared_error(y_test, forest_normer_pred)
  forest_normer[i] = forest_normer_mse

forest_normer = forest_normer[1:]


print(f"The best hyperparamter (n) for scaler KNN is {knn_list.index(min(knn_list))} with an MSE of {min(knn_list)}")
print(f"The best hyperparamter (n) for normer KNN is {knn_normer.index(min(knn_normer))} with an MSE of {min(knn_normer)}")
print(f"The best hyperparamter (n_estimator) for scaler forest is {forest_scaler.index(min(forest_scaler))} with an MSE of {min(forest_scaler)}")
print(f"The best hyperparamter (n_estimator) for normer forest is {forest_normer.index(min(forest_normer))} with an MSE of {min(forest_normer)}")


print(f"Therefore, the best model with the best hyperparamter comes from the feature transformation normer that was trained by forest with n_estimator {forest_normer.index(min(forest_normer))} with an with an MSE of {min(forest_normer)}")




The best hyperparamter (n) for scaler KNN is 16 with an MSE of 3351.9017353071254
The best hyperparamter (n) for normer KNN is 15 with an MSE of 3274.1811266447367
The best hyperparamter (n_estimator) for scaler forest is 77 with an MSE of 3221.5451523285533
The best hyperparamter (n_estimator) for normer forest is 12 with an MSE of 3225.258664412511
Therefore, the best model with the best hyperparamter comes from the feature transformation normer that was trained by forest with n_estimator 12 with an with an MSE of 3225.258664412511
