# Manual Feature Selection

The simplest method of selecting features that offer predictive power to our target variable is to use our external and domain knowledge to select our features with confidence. In this notebook, we will recap simple methods that you've seen throughout the course, and recap when it appropriate to use them.

These manual methods will become more and more redundant as you explore more advanced techniques, but they serve as a useful way to explore your features manually.

### Import Basic Packages & Data

In [None]:
#Load the required libraries

# Data manipulation libraries
import pandas as pd
import numpy as np

# Data visualisation libraries
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Import data to a pandas dataframe
df_cars = pd.read_csv('indian cars dataset incnulls.csv')

df_cars.head()

# We will be working with the Indian cars dataset. The target variable in this instance is the starting price of a vehicle.

### Removing Features Using Domain Knowledge

Most importantly, we can use our domain or situation knowledge of the scenario we are working with to make conclusions about which features to select.

- Some data may simply be unavailable in a real world scenario.
- Some data might be available, but deemed too expensive to collect by the business. 
- Some features may have a known scientific or commonly known connection.

**In this case** we are trying to predict the starting price of each auction. We therefore know that the ending price will not be available to help us with the prediction (The ending price cannot exist if the starting price is still unknown). For this reason, the feature is irrelevant when predicting our target variable and can be removed.

In [None]:
# Removing irrelevant feature
df_cars  = df_cars.drop(columns = ['ending_price'])
df_cars

### Removing Features Based on Logical Reasoning

When we look at our dataset, we can use our understanding of our data to make some obvious decisions about predictive power. 

For example, we can say that the name of the car is not going to be helpful when predicting the starting price of a car. There are 199 distinct car names out of 203 rows of data, which would mean that there is hardly any predictive power in that feature. 

Inversely, if there is a feature in our dataset that has only one unique value, then there is no reason to keep that feature as that does not provide any predictive power as well.

In [None]:
# Use the unique function to explore how many unique values in each column.
df_cars.nunique()

With this knowledge, we will choose to drop the car name feature from our data.

In [None]:
df_cars = df_cars.drop(columns = ['car_name'])
df_cars

### Removing Features Based on Incomplete or Missing Data

In our cleaning data chapter, we identified and populated null values where possible using a technique called imputation.

However this may not always be possible.

In [None]:
df_cars.isnull().sum()

If for whatever reason, a feature still has a high number of nulls, we may consider dropping the feature entirely. The decision to do this depends on how important or powerful the feature is likely to be in our model. We should also consider whether null values should be expected in the scenario.

- For example, in a feature containing Marriage_Date, we may expect a significant % of nulls.
- If an important feature from domain knowledge perspective has many nulls, we may ask the business to provide more data.

Note: There are no black and white rules for removing features.