# Feature Engineering

*In which we boost, combine, split, or otherwise manipulate the features of crabs.*


### Define Constants


In [15]:
CACHE_FILE = '../cache/normlcrabs.feather'
NEXT_CACHE_FILE = '../cache/designrcrabs.feather'
NEXT_NOTEBOOK = '../3-evaluation/evaluation.ipynb'

PREDICTION_TARGET = 'Age'    # 'Age' is predicted
DATASET_COLUMNS = ['Sex','Length','Diameter','Height','Weight','Shucked Weight','Viscera Weight','Shell Weight',PREDICTION_TARGET]
REQUIRED_COLUMNS = [PREDICTION_TARGET]


### Importing Libraries


In [21]:
from notebooks.time_for_crab.mlutils import display_df

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

pd.set_option('mode.copy_on_write', True)


### Load Data from Cache

In the [previous section](../1-models/models.ipynb), we saved the normalized training data to the cache.


In [17]:
crabs = pd.read_feather(CACHE_FILE)

display_df(crabs, show_distinct=True)
# split features from target
X = crabs.drop([PREDICTION_TARGET], axis=1)
y = crabs[PREDICTION_TARGET]


DataFrame shape: (3114, 10)
First 5 rows:
        Length  Diameter    Height    Weight  Shucked Weight  Viscera Weight  \
1698  0.500977  0.394531 -0.725586 -0.199707       -0.126953       -0.445801   
1361  0.743164  0.713867 -0.645996  0.489258        0.507812       -0.045898   
1972  0.013672 -0.025391 -0.787598 -0.706543       -0.755859       -0.750977   
960   0.163086  0.126953 -0.813965 -0.537109       -0.616211       -0.527344   
2639  0.716797  0.748047 -0.690430  0.099609       -0.041504       -0.026367   

      Shell Weight  Sex_F  Sex_I  Sex_M  
1698     -0.362305  False  False   True  
1361      0.100586  False  False   True  
1972     -0.701660   True  False  False  
960      -0.579102  False  False   True  
2639      0.159180   True  False  False  
<class 'pandas.core.frame.DataFrame'>
Index: 3114 entries, 1698 to 645
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Length          3114 non-n

### Feature Significance

Remove features with low variance. These features are likely to be less important for the model.


In [18]:
from sklearn.feature_selection import VarianceThreshold


### Data Augmentation

Crabs are complex creatures. Let's engineer some features to help our model find the best crabs for harvest.

We'll need to use domain knowledge to extract more features from our dataset's column.

![This kills the crab.](https://i.kym-cdn.com/photos/images/newsfeed/000/112/843/killcrab.jpg)

For example, we can find the edible weight of the crab by subtracting the viscera weight from the shucked weight.  
However, we need to be careful not to overfit the model by adding coliinear features.


In [19]:
def data_augmentation(df:pd.DataFrame) -> pd.DataFrame:
    """Add new features to the DataFrame.

    Driven by domain knowledge.

    :param df: The data.
    :return: The data with new features.
    """
    # add new features by combining existing features
    df['Edible Weight'] = df['Shucked Weight'] - df['Viscera Weight']
    return df


### Save the Data

So we can pick this back up on the [next step](../3-evaluation/evaluation.ipynb).


In [20]:
crabs.to_feather(NEXT_CACHE_FILE)


### Onwards to Final Evaluation

See the [next section](../3-evaluation/evaluation.ipynb) for the final evaluation.
