# Processing and Transforming Features

## Introduction

To understand how linear regression works, we've stuck to using features from the training dataset that contained no missing values and were already in a convenient numeric representation. In this mission, we'll explore how to transform some of the the remaining features so we can use them in our model. Broadly, the process of processing and creating new features is known as **[feature engineering**](https://en.wikipedia.org/wiki/Feature_engineering). 
* Feature engineering is a bit of an art and having knowledge in the specific domain (in this case real estate) can help you **create better features**. In this mission, we'll focus on some domain-independent strategies that work for all problems.<br>

In the first half of this mission, we'll focus only on columns that contain no missing values but still aren't in the proper format to use in a linear regression model. In the latter half of this mission, we'll explore some ways to deal with missing values.<br>

Amongst the columns that don't contain missing values, some of the common issues include:

* the column is not numerical (e.g. a zoning code represented using text)
* the column is numerical but not ordinal (e.g. zip code values)
* the column is numerical but isn't representative of the type of relationship with the target column (e.g. year values)

Let's start by filtering the training set to just the columns containing no missing values.


In [30]:
import pandas as pd

data = pd.read_csv('data/AmesHousing.txt', delimiter="\t")
train = data[0:1460]
test = data[1460:]

train_null_counts = train.isnull().sum()
#print(train_null_counts)

In [31]:
zero_null_features = train_null_counts[train_null_counts == 0].index
df_no_mv = train[zero_null_features]
df_no_mv.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,...,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,31770,Pave,IR1,Lvl,AllPub,Corner,...,0,0,0,0,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,11622,Pave,Reg,Lvl,AllPub,Inside,...,0,0,120,0,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,14267,Pave,IR1,Lvl,AllPub,Corner,...,0,0,0,0,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,11160,Pave,Reg,Lvl,AllPub,Corner,...,0,0,0,0,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,13830,Pave,IR1,Lvl,AllPub,Inside,...,0,0,0,0,0,3,2010,WD,Normal,189900


## Categorical Features

You'll notice that some of the columns in the data frame `df_no_mv` contain string values. If these columns contain only a limited set of uniuqe values, they're known as **categorical features**. As the name suggests, a categorical feature groups a specific training example into a specific category. Here are some examples from the dataset:

```python
>>> train['Utilities'].value_counts()
AllPub    1457
NoSewr       2
NoSeWa       1
Name: Utilities, dtype: int64

>>> train['Street'].value_counts()
Pave    1455
Grvl       5

>>> train['House Style'].value_counts()
1Story    743
2Story    440
1.5Fin    160
SLvl       60
SFoyer     35
2.5Unf     11
1.5Unf      8
2.5Fin      3
```

To use these features in our model, we need to transform them into numerical representations. Thankfully, pandas makes this easy because the library has a special [categorical data type](https://pandas.pydata.org/pandas-docs/stable/categorical.html). We can convert any column that contains no missing values (or an error will be thrown) to the categorical data type using the `pandas.Series.astype()` method:

```python
>>> train['Utilities'] = train['Utilities'].astype('category')
```

When a column is converted to the categorical data type, pandas assigns a code to each unique value in the column. Unless we access these values directly, most of the pandas manipulation operations that work for string columns will work for categorical ones as well.

```python
>>> train['Utilities']
0       AllPub
1       AllPub
2       AllPub
3       AllPub
4       AllPub
5       AllPub
...
```

We need to use the `.cat` accessor followed by the .`codes` property to actually access the underlying numerical representation of a column:

```python
>>> train['Utilities'].cat.codes
```
Let's convert all of the text columns that contain no missing values into the categorical data type.

In [32]:
text_cols = df_no_mv.select_dtypes(include=['object']).columns

for col in text_cols:
    print(col+":", len(train[col].unique()))
    train[col] = train[col].astype('category')

MS Zoning: 6
Street: 2
Lot Shape: 4
Land Contour: 4
Utilities: 3
Lot Config: 5

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Land Slope: 3
Neighborhood: 26
Condition 1: 9
Condition 2: 6
Bldg Type: 5
House Style: 8
Roof Style: 6
Roof Matl: 5
Exterior 1st: 14
Exterior 2nd: 16
Exter Qual: 4
Exter Cond: 5
Foundation: 6
Heating: 6
Heating QC: 4
Central Air: 2
Electrical: 4
Kitchen Qual: 5
Functional: 7
Paved Drive: 3
Sale Type: 9
Sale Condition: 5


In [33]:
train['Utilities'].value_counts()

AllPub    1457
NoSewr       2
NoSeWa       1
Name: Utilities, dtype: int64

In [34]:
train['Utilities'].cat.codes.value_counts()

0    1457
2       2
1       1
dtype: int64

## Dummy Coding

When we convert a column to the categorical data type, pandas assigns a number from `0` to `n-1` (where `n` is the number of unique values in a column) for each value. The drawback with this approach is that one of the assumptions of linear regression is violated here. Linear regression operates under the assumption that the features are linearly correlated with the target column. For a categorical feature, however, there's no actual numerical meaning to the categorical codes that pandas assigned for that colum. An increase in the `Utilities` column from `1` to `2` has no correlation value with the target column, and the categorical codes are instead used for uniqueness and exclusivity (the category associated with `0` is different than the one associated with `1`).<br>

The common solution is to use a technique called [dummy coding](https://en.wikipedia.org/wiki/Dummy_variable_%28statistics%29). Instead of having a single column with n integer codes, we have `n` **binary columns**. Here's what that would look like for the Utilities column:

In [27]:
pd.get_dummies(train['Utilities']).head()

Unnamed: 0,AllPub,NoSeWa,NoSewr
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0


Because the original values for the first 4 rows were `AllPub`, in the new scheme, they contain the binary value for true (`1`) in the `Utilities_AllPub` column and `0` for the other 2 columns.<br>

Pandas thankfully has a convenience method to help us apply this transformation for all of the text columns called `pandas.get_dummies()`:

```python
dummy_cols = pd.get_dummies()
```

In [28]:
for text_col in text_cols:
    
    train = pd.concat([train, pd.get_dummies(train[text_col])], axis=1)
    train.drop([text_col], axis=1, inplace=True)

In [29]:
train.head()

Unnamed: 0,Order,PID,MS SubClass,Lot Frontage,Lot Area,Alley,Overall Qual,Overall Cond,Year Built,Year Remod/Add,...,ConLI,ConLw,New,Oth,WD,Abnorml,Alloca,Family,Normal,Partial
0,1,526301100,20,141.0,31770,,6,5,1960,1960,...,0,0,0,0,1,0,0,0,1,0
1,2,526350040,20,80.0,11622,,5,6,1961,1961,...,0,0,0,0,1,0,0,0,1,0
2,3,526351010,20,81.0,14267,,6,6,1958,1958,...,0,0,0,0,1,0,0,0,1,0
3,4,526353030,20,93.0,11160,,7,5,1968,1968,...,0,0,0,0,1,0,0,0,1,0
4,5,527105010,60,74.0,13830,,5,5,1997,1998,...,0,0,0,0,1,0,0,0,1,0


In [35]:
train = pd.concat([train, pd.get_dummies(train[text_cols])], axis=1)
train.drop(text_cols, axis=1, inplace=True)
train.head()

Unnamed: 0,Order,PID,MS SubClass,Lot Frontage,Lot Area,Alley,Overall Qual,Overall Cond,Year Built,Year Remod/Add,...,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_WD,Sale Condition_Abnorml,Sale Condition_Alloca,Sale Condition_Family,Sale Condition_Normal,Sale Condition_Partial
0,1,526301100,20,141.0,31770,,6,5,1960,1960,...,0,0,0,0,1,0,0,0,1,0
1,2,526350040,20,80.0,11622,,5,6,1961,1961,...,0,0,0,0,1,0,0,0,1,0
2,3,526351010,20,81.0,14267,,6,6,1958,1958,...,0,0,0,0,1,0,0,0,1,0
3,4,526353030,20,93.0,11160,,7,5,1968,1968,...,0,0,0,0,1,0,0,0,1,0
4,5,527105010,60,74.0,13830,,5,5,1997,1998,...,0,0,0,0,1,0,0,0,1,0


## Transforming Improper Numerical Features

In the last few screens, we focused on categorical values that were represented as text columns. **Some of the numerical columns in the data set are also categorical and only have a limited set of unique values**. We won't explicitly explore those coumns in this mission, but the feature transformation process is the same if the numbers used in those categories have no numerical meaning.<br>

Let's now look at numerical features that aren't categorical, but whose numerical representation needs to be improved. We'll focus on the `Year Remod/Add` and `Year Built` columns:

```python
>>> train[['Year Remod/Add', 'Year Built']]
0   1960    1960
1   1961    1961
2   1958    1958
3   1968    1968
4   1998    1997
...
```
The two main issues with these features are:

* Year values aren't representative of how old a house is
* The `Year Remod/Add` column doesn't actually provide useful information for a linear regression model

The challenge with year values like `1960` and `1961` is that they don't do a good capture how old a house is. For example, a house that was built in `1960` but sold in `1980` was sold in half the time one built in `1960` and sold in `2000`. Instead of the years certain events happened, we want the difference between those years. We should create a new column that's the difference between both of these columns.<br>

For this particular piece of information (years until remodeled), this is a sensible approach. Domain knowledge can help you understand how to best transform features to represent information well for a linear model. If you're ever confused about a feature or how it should be represented, reading scientific papers or posts by researchers in the specific domain is critical. Many winners of [Kaggle data science competitions](https://www.import.io/post/how-to-win-a-kaggle-competition/), for example, claim that their focus on data preparation and feature engineering combined with common machine learning models helped them win.

In [38]:
train['years_until_remod'] = train['Year Remod/Add'] - train['Year Built']

## Missing Values

In the next few screens, we'll focus on handling columns with missing values. When values are missing in a column, there are two main approaches we can take:

* **Remove** rows containing missing values for specific columns
  * Pro: Rows containing missing values are removed, leaving only clean data for modeling
  * Con: Entire observations from the training set are removed, which can reduce overall prediction accuracy
* **Impute (or replace)** missing values using a descriptive statistic from the column
  * Pro: Missing values are replaced with potentially similar estimates, preserving the rest of the observation in the model.
  * Con: Depending on the approach, we may be adding noisy data for the model to learn

Given that we only have 1460 training examples (with ~80 potentially useful features), we don't want to remove any of these rows from the dataset. Let's instead focus on **imputation** techniques.

We'll focus on columns that contain at least 1 missing value but less than 365 missing values (or 25% of the number of rows in the training set). There's no strict threshold, and many people instead use a 50% cutoff (if half the values in a column are missing, it's automatically dropped). Having some domain knowledge can help with determining an acceptable cutoff value.


In [41]:
df_missing_values = train[
    train_null_counts[(0 < train_null_counts) & (train_null_counts < 584)].index
]
df_missing_values.head()

Unnamed: 0,Lot Frontage,Mas Vnr Type,Mas Vnr Area,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Bsmt Full Bath,Bsmt Half Bath,Garage Type,Garage Yr Blt,Garage Finish,Garage Qual,Garage Cond
0,141.0,Stone,112.0,TA,Gd,Gd,BLQ,639.0,Unf,0.0,441.0,1080.0,1.0,0.0,Attchd,1960.0,Fin,TA,TA
1,80.0,,0.0,TA,TA,No,Rec,468.0,LwQ,144.0,270.0,882.0,0.0,0.0,Attchd,1961.0,Unf,TA,TA
2,81.0,BrkFace,108.0,TA,TA,No,ALQ,923.0,Unf,0.0,406.0,1329.0,0.0,0.0,Attchd,1958.0,Unf,TA,TA
3,93.0,,0.0,TA,TA,No,ALQ,1065.0,Unf,0.0,1045.0,2110.0,1.0,0.0,Attchd,1968.0,Fin,TA,TA
4,74.0,,0.0,Gd,TA,No,GLQ,791.0,Unf,0.0,137.0,928.0,0.0,0.0,Attchd,1997.0,Fin,TA,TA


In [46]:
for col in df_missing_values.columns:
    print(col, end=' / ')
    print(df_missing_values[col].isnull().sum(), end=' / ')
    print(df_missing_values[col].dtype)

Lot Frontage / 249 / float64
Mas Vnr Type / 11 / object
Mas Vnr Area / 11 / float64
Bsmt Qual / 40 / object
Bsmt Cond / 40 / object
Bsmt Exposure / 41 / object
BsmtFin Type 1 / 40 / object
BsmtFin SF 1 / 1 / float64
BsmtFin Type 2 / 41 / object
BsmtFin SF 2 / 1 / float64
Bsmt Unf SF / 1 / float64
Total Bsmt SF / 1 / float64
Bsmt Full Bath / 1 / float64
Bsmt Half Bath / 1 / float64
Garage Type / 74 / object
Garage Yr Blt / 75 / float64
Garage Finish / 75 / object
Garage Qual / 75 / object
Garage Cond / 75 / object


## Imputing Missing Values

It looks like about half of the columns in `df_missing_values` are string columns (`object` data type), while about half are `float64` columns. For numerical columns with missing values, a common strategy is to compute the mean, median, or mode of each column and replace all missing values in that column with that value.<br>

Because imputation is a common task, pandas contains a method named [`pandas.DataFrame.fillna()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) that we can use for this. If we pass in a value, all of the missing values (`NaN`) in the data frame are replaced by that value:



In [54]:
float_cols = df_missing_values.select_dtypes(include=['float'])
float_cols.fillna(float_cols.mean(), inplace=True)

float_cols.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 9 columns):
Lot Frontage      1460 non-null float64
Mas Vnr Area      1460 non-null float64
BsmtFin SF 1      1460 non-null float64
BsmtFin SF 2      1460 non-null float64
Bsmt Unf SF       1460 non-null float64
Total Bsmt SF     1460 non-null float64
Bsmt Full Bath    1460 non-null float64
Bsmt Half Bath    1460 non-null float64
Garage Yr Blt     1460 non-null float64
dtypes: float64(9)
memory usage: 102.7 KB


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
