# Manual Feature Selection

The simplest method of selecting features that offer predictive power to our target variable is to use our external and domain knowledge to select our features with confidence. In this notebook, we will recap simple methods that you've seen throughout the course, and recap when it appropriate to use them.

These manual methods will become more and more redundant as you explore more advanced techniques, but they sevre as a useful way to explore your features manually.

### Import Basic Packages & Data

In [21]:
#Load the required libraries

# Data manipulation libraries
import pandas as pd
import numpy as np

# Data visualisation libraries
import matplotlib.pyplot as plt
import seaborn as sns

In [22]:
# Import data to a pandas dataframe
df_cars = pd.read_csv('indian cars dataset incnulls.csv')
df_cars.head()

# We will be working with the Indian cars dataset. The target variable in this instance is the starting price of a vehicle.

Unnamed: 0,ending_price,starting_price,reviews_count,max_torque_nm,max_torque_rpm,max_power_bhp,max_power_rp,fuel_tank_capacity,no_cylinder,rating,seating_capacity,fuel_type,engine_displacement,transmission_type,body_type,car_name
0,583000,399000,51,89.0,3500,65.71,5500,27.0,3,4.5,5.0,Petrol,998,Automatic,Hatchback,Maruti Alto K10
1,1396000,799000,86,136.8,4400,101.65,6000,48.0,4,4.5,5.0,Petrol,1462,Automatic,SUV,Maruti Brezza
2,1603000,1353000,242,300.0,2800,130.0,3750,57.0,4,4.5,4.0,Diesel,2184,Automatic,SUV,Mahindra Thar
3,2458000,1318000,313,450.0,2800,182.38,3500,60.0,4,4.5,7.0,Diesel,2198,Automatic,SUV,Mahindra XUV700
4,2390000,1199000,107,400.0,2750,172.45,3500,57.0,4,4.5,7.0,Diesel,2198,Automatic,SUV,Mahindra Scorpio-N


### Removing Features Using Domain Knowledge

Most importantly, we can use our domain or situation knowledge of the scenario we are working with to make conclusions about which features to select.

- Some data may simply be unavailable in a real world scenario.
- Some data might be available, but deemed too expensive to collect by the business. 
- Some features may have a known scientific or commonly known connection.

**In this case** we are trying to predict the starting price of each auction. We therefore know that the ending price will not be available to help us with the prediction (The ending price cannot exist if the starting price is still unknown). For this reason, the feature is irrelevant when predicting our target variable and can be removed.

In [23]:
# Removing irrelevant feature
df_cars = df_cars.drop(columns=['ending_price'])
df_cars

Unnamed: 0,starting_price,reviews_count,max_torque_nm,max_torque_rpm,max_power_bhp,max_power_rp,fuel_tank_capacity,no_cylinder,rating,seating_capacity,fuel_type,engine_displacement,transmission_type,body_type,car_name
0,399000,51,89.0,3500,65.71,5500,27.0,3,4.5,5.0,Petrol,998,Automatic,Hatchback,Maruti Alto K10
1,799000,86,136.8,4400,101.65,6000,48.0,4,4.5,5.0,Petrol,1462,Automatic,SUV,Maruti Brezza
2,1353000,242,300.0,2800,130.00,3750,57.0,4,4.5,4.0,Diesel,2184,Automatic,SUV,Mahindra Thar
3,1318000,313,450.0,2800,182.38,3500,60.0,4,4.5,7.0,Diesel,2198,Automatic,SUV,Mahindra XUV700
4,1199000,107,400.0,2750,172.45,3500,57.0,4,4.5,7.0,Diesel,2198,Automatic,SUV,Mahindra Scorpio-N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
198,659000,35,500.0,5250,415.71,6750,0.0,4,4.5,5.0,Petrol,1991,Automatic,Hatchback,Mercedes-Benz AMG A 45 S
199,1041000,3,400.0,4400,254.79,5000,59.0,4,4.5,5.0,Petrol,1998,Automatic,Sedan,BMW 3 Series Gran Limousine
200,1615000,2,350.0,2500,167.67,3750,60.0,4,4.5,7.0,Diesel,1956,Manual,SUV,MG Hector Plus
201,21700000,9,800.0,4500,591.39,6000,85.0,8,3.5,5.0,Petrol,3998,Automatic,SUV,Audi RS Q8


### Removing Features Based on Logical Reasoning

When we look at our dataset, we can use our understanding of our data to make some obvious decisions about predictive power. 

For example, we can say that the name of the car is not going to be helpful when predicting the starting price of a car. There are 199 distinct car names out of 203 rows of data, which would mean that there is hardly any predictive power in that feature. 

Inversely, if there is a feature in our dataset that has only one unique value, then there is no reason to keep that feature as that does not provide any predictive power as well.

In [24]:
# Use the unique function to explore how many unique values in each column.
df_cars.nunique()

starting_price         190
reviews_count           97
max_torque_nm           82
max_torque_rpm          50
max_power_bhp          143
max_power_rp            36
fuel_tank_capacity      47
no_cylinder              9
rating                   5
seating_capacity         6
fuel_type                4
engine_displacement     76
transmission_type        2
body_type               11
car_name               199
dtype: int64

With this knowledge, we will choose to drop the car name feature from our data.

In [25]:
df_cars = df_cars.drop(columns=['car_name'])
df_cars

Unnamed: 0,starting_price,reviews_count,max_torque_nm,max_torque_rpm,max_power_bhp,max_power_rp,fuel_tank_capacity,no_cylinder,rating,seating_capacity,fuel_type,engine_displacement,transmission_type,body_type
0,399000,51,89.0,3500,65.71,5500,27.0,3,4.5,5.0,Petrol,998,Automatic,Hatchback
1,799000,86,136.8,4400,101.65,6000,48.0,4,4.5,5.0,Petrol,1462,Automatic,SUV
2,1353000,242,300.0,2800,130.00,3750,57.0,4,4.5,4.0,Diesel,2184,Automatic,SUV
3,1318000,313,450.0,2800,182.38,3500,60.0,4,4.5,7.0,Diesel,2198,Automatic,SUV
4,1199000,107,400.0,2750,172.45,3500,57.0,4,4.5,7.0,Diesel,2198,Automatic,SUV
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
198,659000,35,500.0,5250,415.71,6750,0.0,4,4.5,5.0,Petrol,1991,Automatic,Hatchback
199,1041000,3,400.0,4400,254.79,5000,59.0,4,4.5,5.0,Petrol,1998,Automatic,Sedan
200,1615000,2,350.0,2500,167.67,3750,60.0,4,4.5,7.0,Diesel,1956,Manual,SUV
201,21700000,9,800.0,4500,591.39,6000,85.0,8,3.5,5.0,Petrol,3998,Automatic,SUV


### Removing Features Based on Incomplete or Missing Data

In our cleaning data chapter, we identified and populated null values where possible using a technique called imputation.

However this may not always be possible.

In [26]:
df_cars.isnull().sum()

starting_price         0
reviews_count          0
max_torque_nm          0
max_torque_rpm         0
max_power_bhp          0
max_power_rp           0
fuel_tank_capacity     0
no_cylinder            0
rating                 0
seating_capacity       1
fuel_type              0
engine_displacement    0
transmission_type      0
body_type              0
dtype: int64

If for whatever reason, a feature still has a high number of nulls, we may consider dropping the feature entirely. The decision to do this depends on how important or powerful the feature is likely to be in our model. We should also consider whether null values should be expected in the scenario.

- For example, in a feature containing Marriage_Date, we may expect a significant % of nulls.
- If an important feature from domain knowledge perspective has many nulls, we may ask the business to provide more data.

Note: There are no black and white rules for removing features.