To understand how linear regression works, we've stuck to using features from the training dataset that contained no missing values and were already in a convenient numeric representation.

In this project, we'll explore how to transform some of the remaining features so we can use them in our model. Broadly, the process of processing and creating new features is known as **feature engineering**. 
* Feature engineering is a bit of an art and having knowledge in the specific domain (in this case real estate) can help us create better features.

In this project, we'll focus on some domain-independent strategies that work for all problems.

In the first half of this project, we'll focus only on columns that contain no missing values but still aren't in the proper format to use in a linear regression model. In the latter half of this project, we'll explore some ways to deal with missing values.

Amongst the columns that don't contain missing values, some of the common issues include:

* the column is not numerical (e.g. a zoning code represented using text)
* the column is numerical but not ordinal (e.g. zip code values)
* the column is numerical but isn't representative of the type of relationship with the target column (e.g. year values)

In [76]:
import pandas as pd
import numpy as np

data = pd.read_csv('AmesHousing.txt', delimiter="\t")
train = data[0:1460]
test = data[1460:]

In [46]:
# Select just the columns from the train data frame that contain no missing values.
train_null_counts = train.isnull().sum()

cols_no_mv = train_null_counts[(train_null_counts == 0)].index
df_no_mv = train[cols_no_mv]

Some of the columns in the data frame **df_no_mv** contain string values. If these columns contain only a limited set of uniuqe values, they're known as **categorical features**. As the name suggests, a categorical feature groups a specific training example into a specific category. 

In [47]:
# Here are some examples from the dataset:

print(train['Utilities'].value_counts())
print()
print(train['Street'].value_counts())
print()
print(train['House Style'].value_counts())

AllPub    1457
NoSewr       2
NoSeWa       1
Name: Utilities, dtype: int64

Pave    1455
Grvl       5
Name: Street, dtype: int64

1Story    743
2Story    440
1.5Fin    160
SLvl       60
SFoyer     35
2.5Unf     11
1.5Unf      8
2.5Fin      3
Name: House Style, dtype: int64


To use these features in our model, we need to transform them into numerical representations. Thankfully, pandas makes this easy because the library has a special [categorical data type](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html). We can convert any column that contains no missing values (or an error will be thrown) to the categorical data type using the `pandas.Series.astype()` method:

`train['Utilities'] = train['Utilities'].astype('category')`

When a column is converted to the categorical data type, pandas assigns a code to each unique value in the column. Unless we access these values directly, most of the pandas manipulation operations that work for string columns will work for categorical ones as well.

We need to use the `.cat` accessor followed by the `.codes` property to actually access the underlying numerical representation of a column:

`train['Utilities'].cat.codes`

In [48]:
text_cols = df_no_mv.select_dtypes(include = ["object"]).columns
text_cols

Index(['MS Zoning', 'Street', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl',
       'Exterior 1st', 'Exterior 2nd', 'Exter Qual', 'Exter Cond',
       'Foundation', 'Heating', 'Heating QC', 'Central Air', 'Electrical',
       'Kitchen Qual', 'Functional', 'Paved Drive', 'Sale Type',
       'Sale Condition'],
      dtype='object')

In [49]:
for col in text_cols:
    train[col] = train[col].astype("category")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [50]:
train["Utilities"].value_counts()

AllPub    1457
NoSewr       2
NoSeWa       1
Name: Utilities, dtype: int64

In [51]:
train["Utilities"].cat.codes.value_counts()

0    1457
2       2
1       1
dtype: int64

When we convert a column to the categorical data type, pandas assigns a number from `0` to `n-1` (where `n` is the number of unique values in a column) for each value. The drawback with this approach is that one of the assumptions of linear regression is violated here. 
* Linear regression operates under the assumption that the features are linearly correlated with the target column. For a categorical feature, however, there's no actual numerical meaning to the categorical codes that pandas assigned for that column. 
* An increase in the Utilities column from `1` to `2` has no correlation value with the target column, and the categorical codes are instead used for uniqueness and exclusivity (the category associated with `0` is different than the one associated with `1`).

The common solution is to use a technique called [dummy coding](https://en.wikipedia.org/wiki/Dummy_variable_%28statistics%29). Instead of having a single column with `n` integer codes, we have `n` binary columns. Here's what that would look like for the `Utilities` column:

![image.png](attachment:image.png)

Because the original values for the first 4 rows were `AllPub`, in the new scheme, they contain the binary value for `true` `(1)` in the `Utilities_AllPub` column and `0` for the other 2 columns.

Pandas thankfully has a convenience function to help us apply this transformation for all of the text columns called `pandas.get_dummies()`.

In [52]:
# Convert all of the columns in text_cols from the train data frame into dummy columns.

for col in text_cols:
    dummy = pd.get_dummies(train[col])
    train = pd.concat([train, dummy], axis = 1)
    del train[col]

In the last few cells, we focused on categorical values that were represented as text columns. Some of the numerical columns in the data set are also categorical and only have a limited set of unique values. We won't explicitly explore those columns in this project, but the feature transformation process is the same if the numbers used in those categories have no numerical meaning.

Let's now look at numerical features that aren't categorical, but whose numerical representation needs to be improved. We'll focus on the `Year Remod/Add` and `Year Built` columns:

In [53]:
print(train[['Year Remod/Add', 'Year Built']].head())

   Year Remod/Add  Year Built
0            1960        1960
1            1961        1961
2            1958        1958
3            1968        1968
4            1998        1997


The two main issues with these features are:

* `Year` values aren't representative of how old a house is
* The Year `Remod/Add` column doesn't actually provide useful information for a linear regression model

The challenge with year values like 1960 and 1961 is that they don't do a good job of capturing how old a house is. For example, a house that was built in 1960 but sold in 1980 was sold in half the time as one built in 1960 and sold in 2000. Instead of the years certain events happened, we want the difference between those years. We should create a new column that's the difference between both of these columns.

For this particular piece of information (years until remodeled), this is a sensible approach. Domain knowledge can help us understand how to best transform features to represent information well for a linear model. If we're ever confused about a feature or how it should be represented, reading scientific papers or posts by researchers in the specific domain is critical. 
* Many winners of [Kaggle data science competitions](https://www.import.io/post/how-to-win-a-kaggle-competition/), for example, claim that their focus on data preparation and feature engineering combined with common machine learning models helped them win.

In [54]:
train["years_until_remod"] = train["Year Remod/Add"] - train["Year Built"]

Now, we'll focus on handling columns with missing values. When values are missing in a column, there are two main approaches we can take:

* Remove rows containing missing values for specific columns
 * `Pro`: Rows containing missing values are removed, leaving only clean data for modeling
 * `Con`: Entire observations from the training set are removed, which can reduce overall prediction accuracy

* Impute (or replace) missing values using a descriptive statistic from the column
 * `Pro`: Missing values are replaced with potentially similar estimates, preserving the rest of the observation in the model.
 * `Con`: Depending on the approach, we may be adding noisy data for the model to learn

Given that we only have 1460 training examples (with ~80 potentially useful features), we don't want to remove any of these rows from the dataset. Let's instead focus on **imputation** techniques.

We'll focus on columns that contain at least 1 missing value but less than 365 missing values (or (365/1460) 25% of the number of rows in the training set). 
* There's no strict threshold, and many people instead use a 50% cutoff (if half the values in a column are missing, it's automatically dropped). Having some domain knowledge can help with determining an acceptable cutoff value.

In [72]:
# Select only the columns from train that contain more than 0 missing values but less than 584 missing values

cols = train_null_counts[(train_null_counts > 0) & (train_null_counts < 584)].index

df_missing_values = train[cols]
df_missing_values.isnull().sum()

Lot Frontage      249
Mas Vnr Type       11
Mas Vnr Area       11
Bsmt Qual          40
Bsmt Cond          40
Bsmt Exposure      41
BsmtFin Type 1     40
BsmtFin SF 1        1
BsmtFin Type 2     41
BsmtFin SF 2        1
Bsmt Unf SF         1
Total Bsmt SF       1
Bsmt Full Bath      1
Bsmt Half Bath      1
Garage Type        74
Garage Yr Blt      75
Garage Finish      75
Garage Qual        75
Garage Cond        75
dtype: int64

In [73]:
df_missing_values.dtypes

Lot Frontage      float64
Mas Vnr Type       object
Mas Vnr Area      float64
Bsmt Qual          object
Bsmt Cond          object
Bsmt Exposure      object
BsmtFin Type 1     object
BsmtFin SF 1      float64
BsmtFin Type 2     object
BsmtFin SF 2      float64
Bsmt Unf SF       float64
Total Bsmt SF     float64
Bsmt Full Bath    float64
Bsmt Half Bath    float64
Garage Type        object
Garage Yr Blt     float64
Garage Finish      object
Garage Qual        object
Garage Cond        object
dtype: object

It looks like about half of the columns in `df_missing_values` are **string columns** (object data type), while about half are **float64 columns**. 
* For numerical columns with missing values, a common strategy is to compute the mean, median, or mode of each column and replace all missing values in that column with that value.

Because imputation is a common task, pandas contains a `pandas.DataFrame.fillna()` method that we can use for this. If we pass in a value, all of the missing values (`NaN`) in the data frame are replaced by that value

`# Only select float columns.
missing_floats = df_missing_vals.select_dtypes(include=['float'])`

`# Returns a data frame with missing values replaced with 0.
fill_with_zero = missing_floats.fillna(0)`

We can also pass in a column-wise summarization function and fill in missing values that way:

`# Returns a data frame with missing values replaced with mean of that column.
fill_with_mean = missing_floats.fillna(missing_floats.mean())`

In [78]:
# Impute the missing values in float col

float_cols = df_missing_values.select_dtypes(include = ["float"])
float_cols = float_cols.fillna(np.mean)

In [79]:
float_cols.isnull().sum()

Lot Frontage      0
Mas Vnr Area      0
BsmtFin SF 1      0
BsmtFin SF 2      0
Bsmt Unf SF       0
Total Bsmt SF     0
Bsmt Full Bath    0
Bsmt Half Bath    0
Garage Yr Blt     0
dtype: int64

In this project, we explored a few different techniques for transforming features into appropriate representations for a linear regression model.