# Feature Engineering - Practical Walkthrough 

### Problem Statement
Your task is to perform feature engineering in-order to tranform both numerical and categorical data. The dataset can be found at - "./datasets/survey_data.csv". 

There are a total of 18 column and 1000 rows.



### 1) Let's start by importing the necessary libraries and data set.

In [None]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns   

#importing the dataset
df = pd.read_csv('./datasets/data.csv')

#checking the data
df.head(10)
df.shape

## Now let's go through the numerical data  
But always check for missing values first

In [None]:
#Checking for missing values
df.isnull().sum()

**Let's start by taking a look at numerical data**
Before that let's seperate our data into numerical and Categorical


In [None]:
## Assigning variables to distinguish numerical and categorical variables
numerical_data = ["age", "weight", "height", "hours_of_exercise","avg_daily_calories", "city_temperature", "income"]
categorical_data = ["marriage_status", "gender", "diet_type", "if_smokes", "if_drinks", "if_drugs", "city", "diseases_or_conditions","education_level"]


numerical_df = df.loc[:,numerical_data]
categorical_df = df.loc[:,categorical_data] 

In [None]:
#summarizing numerical data
numerical_df.describe()

Let's start from age. we can create a new feature called age group from the age which can help add value to our model.To do this we'll create a new column -  age group. and group people based on age.

With weight an height we can calculate the BMI which is kg/m2. We can apply this formula and create a new BMI feature and also group people based on their BMI.



In [None]:

#create age_group based on age
df['age_group'] = np.where(df.age <= 30, 1, np.where(df.age <= 40, 2, np.where(df.age <= 50, 3, np.where(df.age <= 60, 4, 5))))

# Calculating the bmi and assigning as a new column
df['bmi'] = df.weight/(df.height/100)**2

# creating weight class based on bmi
df["weight_class"] = np.where(df.bmi <= 18.5, 1, np.where(df.bmi <= 25, 2, np.where(df.bmi <= 30, 3, np.where(df.bmi <= 35, 4, 5))))

df

### Now let's see how you can deal with some of the Categorical data


In [None]:
categorical_df.info()

Let's start with the nominal data

- let's do some one hot encoding. we can see that "if_smokes" ,"if_drinks","if_drugs"  features are already one hot encoded . We just have to convert them into binary.

- We'll try to build a function to automate the one hot encoding process to deal witj "Marriage_status", "gender", "diet_type", "city" and "diseases or conditions".

- We can also add additional data from outside which we think are useful. Population of each city in- order to get more perspective"


In [None]:
#Converting "if_smokes","if_drinks,if_drugs" to binary
df["if_smokes"] = df['if_smokes'].replace({"yes": 1, "no": 0})
df["if_drinks"] = df['if_drinks'].replace({"yes": 1, "no": 0})
df["if_drugs"] = df['if_drugs'].replace({"yes": 1, "no": 0})


#printing the value counts for each each of them
print(f"""
smoker: 
{df["if_smokes"].value_counts()}
drinkers:
{df["if_drinks"].value_counts()}
drug_users: 
{df["if_drugs"].value_counts()}
{df[["if_smokes","if_drinks","if_drugs"]].head(10)}""")

df.shape

Now let's build a function for encoding encode all of the nominal data.

In [None]:

def one_hot_encoder(df, column_name):
    """
    One hot encoder for categorical data
    """
    df = pd.get_dummies(df[column_name])
    return df

city_encoded = one_hot_encoder(df, "city")
diseases_encoded = one_hot_encoder(df, "diseases_or_conditions")
gender_encoded = one_hot_encoder(df,"gender")
marriage_status_encoded = one_hot_encoder(df, "marriage_status")

df = df.join([city_encoded,  diseases_encoded, gender_encoded, marriage_status_encoded])

for education_level we'll build a value map with dict and then map those values.

In [None]:
df.education_level.value_counts()

scale_mapper = {"high_school":1,
                "college":3,
                "university":4,
                "other":5}

df["education_level"].replace(scale_mapper, inplace=True)


### There is much more you can work on here. these are just a few examples and scenarios of how you can solve these problems by yourselves. 