# Creating Features
  
Every day you read about the amazing breakthroughs in how the newest applications of machine learning are changing the world. Often this reporting glosses over the fact that a huge amount of data munging and feature engineering must be done before any of these fancy models can be used. In this course, you will learn how to do just that. You will work with Stack Overflow Developers survey, and historic US presidential inauguration addresses, to understand how best to preprocess and engineer features from categorical, continuous, and unstructured data. This course will give you hands-on experience on how to prepare any data for your own machine learning models.
  
In this chapter, you will explore what feature engineering is and how to get started with applying it to real-world data. You will load, explore and visualize a survey response dataset, and in doing so you will learn about its underlying data types and why they have an influence on how you should engineer your features. Using the pandas package you will create new features from both categorical and continuous columns.

In [20]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Why generate features?
  
  
**Feature Engineering**
  
Feature engineering is the act of taking raw data and extracting features from it that are suitable for tasks like machine learning. Most machine learning algorithms work with tabular data. When we talk about features, we are referring to the information stored in the columns of these tables. For example, if we were looking at information on houses, the features would be things like square foot, number of rooms, etc. This course is designed for data scientists who want to expand their knowledge of how to incorporate feature engineering into their data science workflow.
  
<img src='../_images/what-are-features-what-do-they-look-like.png' text='alt text' width='500'>
  
**Different types of data**
  
Most machine learning algorithms require their input data to be represented as a vector or a matrix, and many assume that the data is distributed normally. In the real world, more often than not you will receive data that is not in this format. You will also need to work with many different types of data, some data types you will often encounter are: continuous variables, categorical data, ordinal data, boolean values, and dates and times. Dealing with these is manageable, but requires a well thought out approach. Feature engineering is often overlooked in machine learning discussions, but any real-world practitioner will confirm that data manipulation and feature engineering is the most important aspect of the project.
  
<img src='../_images/common-different-types-of-data-in-ml.png' text='alt text' width='500'>
  
**Course structure**
  
Over the span of this course, we will be addressing how to deal with many different types of data and how to convert them into a format that can be easily used for machine learning. In the first chapter, you will ingest and create basic features from tabular data. In the second chapter, you will learn how to deal with data that has missing values. You will then move on to transforming your data so that it conforms to statistical assumptions often necessary for machine learning models, and finally, you will convert free form text into tabular data so it can be used with machine learning models.
  
**Pandas**
  
Now lets jump straight in with some examples. During this course we will be leveraging the pandas package substantially as it is very useful when working with data in tabular form. It is a common practice to import pandas using the pd alias. You can use the `pd.read_csv()` function to import a CSV file and use the `.head()` method to quickly look at the first few rows of the DataFrame.
  
**Dataset**
  
For the first three chapters of this course, you will be working with a modified subset of the Stackoverflow survey response data. This dataset records the details and preferences of hundreds of users of the StackOverflow website.
  
**Column names**
  
To see the features used in this subset, you can use the DataFrame columns attribute to print the names of all the columns in the DataFrame.
  
**Column types**
  
To print the data type of each column, you can use the `df.dtypes` attribute. Here you can see three different data types - integers, floats and objects - in pandas objects are columns that contain strings.
  
**Selecting specific data types**
  
Knowing the types of each column can be very useful if you are performing analysis based on a subset of specific data types. To do this, you can use the `.select_dtypes()` method and pass a list of relevant data types to the include argument. For example, if you want to select only the integer columns, call the `pd.select_dtypes()` method on df and set `include=['int']`.

Getting to know your data
Pandas is one the most popular packages used to work with tabular data in Python. It is generally imported using the alias pd and can be used to load a CSV (or other delimited files) using `pd.read_csv()`.
  
You will be working with a modified subset of the [Stackoverflow survey response data](https://insights.stackoverflow.com/survey/2018/#overview) in the first three chapters of this course. This dataset records the details, and preferences of thousands of users of the StackOverflow website.
  
1. Import the pandas library as pd.
2. `so_survey_csv` contains the URL to a CSV file. Import it using `pd.read_csv()` into `so_survey_df`.
3. Print the first five rows of `so_survey_df`.
4. Print the data type of each column in `so_survey_df`.
5. Question  
What type of data is the `ConvertedSalary` column?  
*Possible answers*
  
- [ ] Datetime
- [x] Numeric
- [ ] String
- [ ] Boolean
  
Correct! `ConvertedSalary` contains floats which are numeric.

In [21]:
# Importing the data
so_survey_df = pd.read_csv('../_datasets/Combined_DS_v10.csv')

# Print the first five rows of the DataFrame
so_survey_df.head()

Unnamed: 0,SurveyDate,FormalEducation,ConvertedSalary,Hobby,Country,StackOverflowJobsRecommend,VersionControl,Age,Years Experience,Gender,RawSalary
0,2/28/18 20:20,Bachelor's degree (BA. BS. B.Eng.. etc.),,Yes,South Africa,,Git,21,13,Male,
1,6/28/18 13:26,Bachelor's degree (BA. BS. B.Eng.. etc.),70841.0,Yes,Sweeden,7.0,Git;Subversion,38,9,Male,70841.00
2,6/6/18 3:37,Bachelor's degree (BA. BS. B.Eng.. etc.),,No,Sweeden,8.0,Git,45,11,,
3,5/9/18 1:06,Some college/university study without earning ...,21426.0,Yes,Sweeden,,Zip file back-ups,46,12,Male,21426.00
4,4/12/18 22:41,Bachelor's degree (BA. BS. B.Eng.. etc.),41671.0,Yes,UK,8.0,Git,39,7,Male,"£41,671.00"


In [22]:
# Print the data type of each column
print(so_survey_df.dtypes)

SurveyDate                     object
FormalEducation                object
ConvertedSalary               float64
Hobby                          object
Country                        object
StackOverflowJobsRecommend    float64
VersionControl                 object
Age                             int64
Years Experience                int64
Gender                         object
RawSalary                      object
dtype: object


### Selecting specific data types
  
Often a dataset will contain columns with several different data types (like the one you are working with). The majority of machine learning models require you to have a consistent data type across features. Similarly, most feature engineering techniques are applicable to only one type of data at a time. For these reasons among others, you will often want to be able to access just the columns of certain types when working with a DataFrame.
  
The DataFrame (`so_survey_df`) from the previous exercise is available in your workspace.
  
1. Create a subset of `so_survey_df` consisting of only the numeric (int and float) columns.
2. Print the column names contained in `so_survey_df_num`.

In [23]:
# Create subset of only the numberic columns
so_numeric_df = so_survey_df.select_dtypes(include=['int', 'float'])

# Print the column names contained in so_numeric_df
print(so_numeric_df.columns)

Index(['ConvertedSalary', 'StackOverflowJobsRecommend', 'Age',
       'Years Experience'],
      dtype='object')


In the next lesson, you will learn the most common ways of dealing with categorical data.

## Dealing with Categorical Variables
  
Categorical variables are used to represent groups that are qualitative in nature. Some examples are colors, such as blue, red, black etc. or country of birth, such as Ireland, England or USA. While these can easily be understood by a human, you will need to encode categorical features as numeric values to use them in your machine learning models.
  
**Encoding categorical features**
  
As an example, here is a table which consists of the country of residence of different respondents in the Stackoverflow survey. To get from qualitative inputs to quantitative features, one may naively think that assigning every category in a column a number would suffice, for example India could be 1, USA 2 etc. But these categories are unordered, so assigning this order may greatly penalize the effectiveness of your model. Thus, you cannot allocate arbitrary numbers to each category as that would imply some form of ordering in the categories.
  
<img src='../_images/encoding-categorical-features-example.png' text='alt text' width='500'>
  
**Encoding categorical features**
  
Instead, values can be encoded by creating additional binary features corresponding to whether each value was picked or not as shown in the table on the right. In doing so your model can leverage the information of what country is given, without inferring any order between the different options.
  
<img src='../_images/encoding-categorical-features-example1.png' text='alt text' width='500'>
  
**Encoding categorical features**
  
There are two main approaches when representing categorical columns in this way:
  
- one-hot-encoding
- dummy encoding
  
These are very similar and often confused. In fact, by default, pandas performs one-hot encoding when you use the `pd.get_dummies()` function.
  
**One-hot encoding**
  
One-hot-encoding converts n categories into n features as shown here. You can use the `pd.get_dummies()` function to one-hot encode columns. The function takes a DataFrame and a list of categorical columns you want converted into one hot encoded columns, and returns an updated DataFrame with these columns included. Specifying a prefix with the prefix argument can improve readability like the letter C for country has been used here.
  
<img src='../_images/encoding-categorical-features-example2.png' text='alt text' width='500'>
  
**Dummy encoding**
  
On the other hand, dummy encoding creates $Nth-1$ features for $N$ categories, omitting the first category. Notice that this time there is no feature for France, the first category. In dummy encoding, the base value, France in this case, is encoded by the absence of all other countries as you can see on the last row here and its value is represented by the intercept. For dummy encoding, you can use the same `pd.get_dummies()` function with an additional argument, `drop_first=` set to `True` as shown here.
  
<img src='../_images/encoding-categorical-features-example3.png' text='alt text' width='500'>
  
**One-hot vs. dummies**
  
Both these methods have different advantages. One-hot-encoding generally creates much more explainable features, as each country will have its own weight that can be observed after training. But one must be aware that one hot encoding may create features that are entirely collinear due to the same information being represented multiple times.
  
- One-hot-encoding: Generally creates much more explainable features
- Dummy-encoding: Necessary information without duplication
  
**One-hot vs. dummies**
  
Take for example a simpler categorical column recording the sex of the survey takers. By recording a 1 for male the information of whether the person is female is already known when the male column is 0. This double representation can lead to instability in your models and dummy values would be more appropriate.
  
<img src='../_images/encoding-categorical-features-example4.png' text='alt text' width='380'>
  
**Limiting your columns**
  
However, both one-hot-encoding and dummy-encoding may result in a huge number of columns being created if there are too many different categories in a column. In these cases, you may want to only create columns for the most common values. You can check the number of occurrences of different features in a column using the `.value_counts()` method on a specific column.
  
**Limiting your columns**
  
Once you have your counts of occurrences, you can use it to limit what values you will include by first creating a mask of the values that occur less than $N$ times. A mask is a list of booleans outlining which values in a column should be affected. First we find the categories that occur less than n times using the index attribute and wrap this inside the `.isin()` method. After you create the mask, you can use it to replace these categories that occur less than n times with a value of your choice as shown here.
  
<img src='../_images/encoding-categorical-features-example5.png' text='alt text' width='500'>

### One-hot encoding and dummy variables
  
To use categorical variables in a machine learning model, you first need to represent them in a quantitative way. The two most common approaches are to one-hot encode the variables using or to use dummy variables. In this exercise, you will create both types of encoding, and compare the created column sets. We will continue using the same DataFrame from previous lesson loaded as `so_survey_df` and focusing on its Country column.
  
1. One-hot encode the Country column, adding "OH" as a prefix for each column.
2. Create dummy variables for the Country column, adding "DM" as a prefix for each column.

In [24]:
# Convert the Country column to a one hot encoded DataFrame
one_hot_encoded = pd.get_dummies(so_survey_df, columns=['Country'], prefix='OH')

# Print the columns names
print(one_hot_encoded.columns)

Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',
       'StackOverflowJobsRecommend', 'VersionControl', 'Age',
       'Years Experience', 'Gender', 'RawSalary', 'OH_France', 'OH_India',
       'OH_Ireland', 'OH_Russia', 'OH_South Africa', 'OH_Spain', 'OH_Sweeden',
       'OH_UK', 'OH_USA', 'OH_Ukraine'],
      dtype='object')


In [25]:
# Convert the Country column to a one hot encoded DataFrame
dummy = pd.get_dummies(so_survey_df, columns=['Country'], drop_first=True, prefix='DM')

# Print the columns names
print(dummy.columns)

Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',
       'StackOverflowJobsRecommend', 'VersionControl', 'Age',
       'Years Experience', 'Gender', 'RawSalary', 'DM_India', 'DM_Ireland',
       'DM_Russia', 'DM_South Africa', 'DM_Spain', 'DM_Sweeden', 'DM_UK',
       'DM_USA', 'DM_Ukraine'],
      dtype='object')


Did you notice that the column for France was missing when you created dummy variables? Now you can choose to use one-hot encoding or dummy variables where appropriate.

### Dealing with uncommon categories
  
Some features can have many different categories but a very uneven distribution of their occurrences. Take for example Data Science's favorite languages to code in, some common choices are Python, R, and Julia, but there can be individuals with bespoke choices, like FORTRAN, C etc. In these cases, you may not want to create a feature for each value, but only the more common occurrences.
  
1. Extract the Country column of `so_survey_df` as a series and assign it to `countries`.
2. Find the counts of each category in the newly created `countries` series.
3. Create a mask for values occurring less than 10 times in `country_counts`.
4. Print the first 5 rows of the mask.
5. Label values occurring less than the mask cutoff as 'Other'.
6. Print the new category counts in `countries`.

In [26]:
# Create a series out of the Country columns
countries = so_survey_df.Country.copy()

# Get the counts of each category
country_counts = countries.value_counts()

# Print the count values for each category
print(country_counts)

South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
UK               95
India            95
Ukraine           9
Ireland           5
Name: Country, dtype: int64


In [27]:
# Create a mask for only categories that occur less than 10 times
mask = countries.isin(country_counts[country_counts < 10].index)

# Print the top 5 rows in the mask series
print(mask.head())

0    False
1    False
2    False
3    False
4    False
Name: Country, dtype: bool


In [28]:
# Label all other categories as Other
countries.loc[mask] = 'Other'

# Print the updated category counts
print(countries.value_counts())

South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
UK               95
India            95
Other            14
Name: Country, dtype: int64


NOTE: By using `.copy()` when assigning the countries series, you create an explicit copy of the column to prevent the warning. Additionally, we replace the use of `countries[mask]` with `countries.loc[mask]` to modify the original DataFrame column instead of working with a copy.

Good work, now you can work with large datasets while grouping low frequency categories.

## Numeric variables
  
As mentioned in the previous lesson, most machine learning models will require your data to be in numeric format. However, even if your raw data is all numeric, there is still a lot you can do to improve your features.
  
**Types of numeric features**
  
Numeric features can be used to represent a huge array of different characteristics and measurements. Pretty much anything that can be quantitatively measured can be recorded as numeric data. For example, age, the price of an item, counts, and even spatial data such as coordinates. Depending on the use case, numeric features can be treated in several different ways. We will work through a few of the considerations and possible feature engineering steps to keep in mind when dealing with numeric data.
  
**Does size matter?**
  
One of the first questions you should ask when working with numeric features is whether the magnitude of the feature is its most important trait, or just its direction. For example, if you had a dataset of restaurant health and safety ratings containing the number of times a restaurant had major violations, you might care far more about whether the restaurant had any major violations at all (as you would rather not take any chances), over whether it was a repeat offender. Looking at this toy dataset containing restaurant IDs and the number of times they had major violations, we can see that some restaurants have no major violations but many have one or more. We will be creating a new binary column representing whether or not a restaurant committed any violation.
  
<img src='../_images/does-size-matter-restaurant-violations.png' text='alt text' width='500'>
  
**Binarizing numeric variables**
  
Here we first create a new column `Binary_Violation` and set it to zero. Then, we use the `.loc[]` notation to find all rows where `Number_of_Violations` is greater than zero and set the `Binary_Violation` column to 1.
  
`df['Binary_Violation'] = 0`  
`df.loc[df['Number_of_Violations'] > 0, 'Binary_Violation'] = 1`  
  
**Binarizing numeric variables**
  
As you can see here, all rows where `Number_of_Violations` is equal to 0 are also zeros in `Binary_Violation`. However, for all rows where `Number_of_Violations` is greater than zero is 1 in `Binary_Violation`.
  
<img src='../_images/does-size-matter-restaurant-violations1.png' text='alt text' width='500'>
  
**Binning numeric variables**
  
An extension of this is perhaps you wish to group a numeric variable into more than two bins. This is often useful for variables such as age, wage brackets, etc where exact numbers are less relevant than the general magnitude of the value. Consider the same dataset of restaurant health and safety ratings containing the number of times a restaurant has had major violations. This time we will be creating three groups;  
- Group 1 for restaurants with no offenses,  
- Group 2 for restaurants with one or two offenses,  
- Group 3 for all restaurants with three or more offenses,  
  
Bins are created by using the `pd.cut()` function. You can define the intervals using the bins argument as shown here, which in this case is a list of 4 values. You can also pass a list of labels like so.
  
<img src='../_images/binning-numeric-variables-encoding.png' text='alt text' width='500'>
  
**Binning numeric variables**
  
Note as we want to include 0 in the first bin, we must set the leftmost edge to lower than that, so all values between $-\infty$ and 0 are labeled as 1, all values equal to 1 or 2 are labeled as 2, and values greater than 2 are labeled as 3.
  
<img src='../_images/binning-numeric-variables-encoding1.png' text='alt text' width='500'>

### Binarizing columns
  
While numeric values can often be used without any feature engineering, there will be cases when some form of manipulation can be useful. For example on some occasions, you might not care about the magnitude of a value but only care about its direction, or if it exists at all. In these situations, you will want to binarize a column. In the `so_survey_df` data, you have a large number of survey respondents that are working voluntarily (without pay). You will create a new column titled `Paid_Job` indicating whether each person is paid (their salary is greater than zero).
  
1. Create a new column called `Paid_Job` filled with zeros.
2. Replace all the `Paid_Job` values with a 1 where the corresponding `ConvertedSalary` is greater than 0.

In [29]:
# Create the Paid_Job column filled with zeros
so_survey_df['Paid_Job'] = 0

# Replace all the Paid_Job values where ConvertedSalary is > 0, if element1(condition) True do element2
so_survey_df.loc[so_survey_df['ConvertedSalary'] > 0, 'Paid_Job'] = 1

# Print the first five rows of the columns
so_survey_df[['Paid_Job', 'ConvertedSalary']].head()

Unnamed: 0,Paid_Job,ConvertedSalary
0,0,
1,1,70841.0
2,0,
3,1,21426.0
4,1,41671.0


Good work, binarizing columns can also be useful for your target variables.

### Binning values
  
For many continuous values you will care less about the exact value of a numeric column, but instead care about the bucket it falls into. This can be useful when plotting values, or simplifying your machine learning models. It is mostly used on continuous variables where accuracy is not the biggest concern e.g. age, height, wages.
  
Bins are created using `pd.cut(df['column_name'], bins)` where bins can be an integer specifying the number of evenly spaced bins, or a list of bin boundaries.
  
1. Bin the value of the `ConvertedSalary` column in `so_survey_df` into 5 equal bins, in a new column called `equal_binned`.
2. Bin the `ConvertedSalary` column using the boundaries in the list bins and label the bins using `labels=`.

In [30]:
# Bin the continuous variable ConvertedSalary into 5 bins
so_survey_df['equal_binned'] = pd.cut(so_survey_df['ConvertedSalary'], bins=5)

# Print the first 5 rows of the equal_binned column
so_survey_df[['equal_binned', 'ConvertedSalary']].head()

Unnamed: 0,equal_binned,ConvertedSalary
0,,
1,"(-2000.0, 400000.0]",70841.0
2,,
3,"(-2000.0, 400000.0]",21426.0
4,"(-2000.0, 400000.0]",41671.0


In [31]:
# Specify the boundaries of the bins
bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]

# List of bin labels
labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']

# Bin the continous variable ConvertedSalary using these boundaries
so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'], bins=bins, labels=labels)

# Print the first 5 rows of the boundary_binned column
so_survey_df[['boundary_binned', 'ConvertedSalary']].head()

Unnamed: 0,boundary_binned,ConvertedSalary
0,,
1,Medium,70841.0
2,,
3,Low,21426.0
4,Low,41671.0


Correct, now you can bin columns with equal spacing and predefined boundaries.