# Feature Engineering Challenge

<a href="https://colab.research.google.com/github/coding-dojo-data-science/week-10-lecture-2-feature-engineering/blob/11-7-22/Challenge%20Feature%20Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


In this notebook you perform feature engineering to try to improve model performance:

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error, \
precision_score, recall_score, accuracy_score, f1_score, ConfusionMatrixDisplay, \
classification_report

import warnings
warnings.filterwarnings('ignore')

# Useful Functions

In [None]:
def eval_regression(true, pred, name='Model'):
  scores = pd.DataFrame()
  scores['Model Name'] = [name]
  scores['RMSE'] = [np.sqrt(mean_squared_error(true, pred))]
  scores['MAE'] = [mean_absolute_error(true, pred)]
  scores['R2'] = [r2_score(true, pred)]
  return scores

def eval_classification(true, pred, name='Model'):
  """shows classification_report and confusion matrix
  for the model predictions"""
  
  print(name, '\n')
  print(classification_report(true, pred))
  ConfusionMatrixDisplay.from_predictions(true, pred)
  plt.show()

  scores = pd.DataFrame()
  scores['Model Name'] = [name]
  scores['Precision'] = [precision_score(true, pred)]
  scores['Recall'] = [recall_score(true, pred)]
  scores['F1 Score'] = [f1_score(true, pred)]
  scores['Accuracy'] = [accuracy_score(true, pred)]

  return scores

## Data

Today we will use data about housing sales in Melbourne, Australia. 

## Your job is to predict the sale price of the house.

In [None]:
# load data
df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vQnwAtoM6edkuZ4Xncjx_wnZjN6zcWRtBZdK9wfQwW6AzXCGhOdjvTQrtbsEU5-LxKdOmz5FAtw66tc/pub?gid=1132845715&single=true&output=csv')
df_original = df.copy()

display(df.head())
print(df.shape)

## Explore and clean the data
1. Drop the 'Address column', it's too specific.
2. Drop any duplicates
3. Look for missing values.  If you want to drop rows or columns, now is the time.  Wait on imputing until after the split.
4. Check summary statistics to look for outliers.

In [None]:
df.info()

In [None]:
# check for duplicates
df.duplicated().sum()

Notice the values in the 'Unique' rows.  Which categorical columns have high cardinality (Many different categories)?

In [None]:
# check summary statistics
df.describe(include='all')

In [None]:
df['Price'].describe()

In [None]:
# explore numeric distributions
for col in df.select_dtypes('number'):
  print('\n', col, '\n')
  df[col].plot(kind='box')
  plt.show()

# Feature Engineering

What would you do to improve this dataset?

### Some Ideas:
1. Remove outliers
2. Change the distribution with np.log, np.sqrt, np.cbrt
3. Bin features or target with .replace or .apply
4. Combine features
5. Extract hour, day, or month from datetime
6. Encode data: one-hot encoding, ordinal encoding, target encoding
7. Parse strings
8. Try different imputation strategies

In [None]:
# check for missing values
df.isna().sum()

# 1. Handle Missing Values
Ideas:
1. Drop columns
2. Drop rows
3. Wait and impute later

In [None]:
# 



# 2. Engineer Categorical Features

Ideas:
1. Extract day, month, and/or year from the 'Date' column
2. Remove columns with high cardinality
3. Bin categories to reduce cardinality
4. Combine categorical features
5. Split categorical features


In [None]:
for col in df.select_dtypes('object'):
  print(col, df[col].nunique())

# 3. Engineer Numeric Features

Possible Options:
1. Remove outliers
2. Reshape distributions with np.log, np.sqrt, or np.cbrt
3. Bin a numeric feature to make it nominal or ordinal



# (optional) 4. Engineer the Target

Options:
1. Transform the target with np.log, np.sqrt, np.cbrt
2. Bin the target to make this a classification problem

**Do NOT leak data!**

## Validation Split

## Bin the Target

# 5. Modeling:

1. Create a Base Model on the original data
2. Copy the model type and fit it on your engineered data
3. Compare the performance of each model.

### Original features from before feature engineering

In [None]:
X_og = df_original.drop(columns='Price')
y_og = df_original['Price']

X_train_og, X_test_og, y_train_og, y_test_og = train_test_split(X_og, y_og, 
                                                                random_state=42)

### Uncomment and run the cell below if you binned the target above

In [None]:
# mean_price = y_train_og.mean()
# y_train_og = y_train_og.apply(lambda x: 1 if x > mean_price else 0)
# y_test_og = y_test_og.apply(lambda x: 1 if x > mean_price else 0)

In [None]:
# Create Preprocessing
scaler = StandardScaler()
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

median_imputer = SimpleImputer(strategy='median')
missing_imputer = SimpleImputer(strategy='constant', fill_value='missing')

cat_cols = make_column_selector(dtype_include='object')
num_cols = make_column_selector(dtype_include='number')

num_pipe = make_pipeline(median_imputer, scaler)
cat_pipe = make_pipeline(missing_imputer, ohe)

processor = make_column_transformer((num_pipe, num_cols), (cat_pipe, cat_cols))
print(processor.fit_transform(X_train, y_train).shape)

In [None]:
# Instantiate and fit a model
base_model = ## Choose a model of the appropriate type (regression or classification)

base_pipe = make_pipeline(processor, base_model)
base_pipe.fit(X_train_og, y_train_og)

train_pred = base_pipe.predict(X_train_og)
test_pred = base_pipe.predict(X_test_og)

In [None]:
# Evaluate model
# train_scores = 

# test_scores = 

scores = pd.concat([train_scores, test_scores])
scores

# Modeling: Engineered Data

Use your engineered data to fit a new model of the same time as your base model.

You might also do some more engineering here as well, if you want.

Ideas:
1. Different encoders
2. Different imputation strategies
3. Different preprocessing, like scaling, PCA, or PolynomialFeatures

In [None]:
X = df.drop(columns=['Price'])
y = df['Price']

X_train, X_test, y_train, y_test =  train_test_split(X, y, random_state=42)

In [None]:
X_train.head()

In [None]:
# Create Preprocessor


In [None]:
# Create and fit model

# Make Predictions

In [None]:
# evaluate model
# train_scores = Your Code

# test_scores = Your Code

scores = pd.concat([train_scores, test_scores])
scores

# If you have extra time:

Try other feature engineering strategies