# Feature Engineering

1. Missing data
2. Feature Normalization
3. Categorical Encoding
4. Transformations
5. Discretization
6. Outliers
7. Optional: Date and Time

---
## 1. Missing data

In [None]:
import pylab 
import datetime
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt

warnings.filterwarnings('ignore')

In [None]:
titanic = pd.read_csv('https://raw.githubusercontent.com/anyoneai/notebooks/main/datasets/titanic.csv')

In [None]:
data0 = titanic.copy()

In [None]:
data0.isna().sum()

In [None]:
print(f'Percentage of data without missing values: {data0.dropna().shape[0]/ np.float(data0.shape[0])}')

In [None]:
data1 = titanic.copy()

In [None]:
data1.isna().mean()

In [None]:
data1.info()

**TODO:** `Age` is a continuous variable. First, we will check the distribution of `age` variable.

In [None]:
# create a histogram
plt.hist(data1['Age'], bins=20)

# add labels and title
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution of Age')

# show the plot
plt.show()

We create a histogram with 20 bins that show the distribution of the Age variable. The x-axis represents the age range, and the y-axis represents the frequency (number of passengers) in each age bin. The resulting plot can help us to understand the central tendency, spread, and shape of the Age variable distribution.

**TODO:** We can see that the `age` distribution is skewed. So, we will use the median imputation.

In [None]:
# Calculating the median of 'Age':
median_age = data1['Age'].median()

# Filling the missing values with the median age:
data1['Age'].fillna(median_age, inplace=True)

# Showing the results:
print('We show the replaced column: ', data1['Age'])

---
## 2. Feature Normalization

In [None]:
data2 = titanic.copy()
median = data2.Age.median()
data2['Age'] = data2['Age'].fillna(median)
data2.head()

**TODO:** We are going to normalize the Age in two ways using MinMax Scaler and Standard Scaler

In [None]:
# Importing needed libraries:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

data3 = titanic.copy()

# Normalizing 'Age' using MinMax Scaler:
minmax_scaler = MinMaxScaler()
data3['Age_MinMax'] = minmax_scaler.fit_transform(data3[['Age']])

# Now, we normalize 'Age' using Standard Scaler:
std_scaler = StandardScaler()
data3['Age_Standard'] = std_scaler.fit_transform(data3[['Age']])

# Showing the first few rows of the new dataframe
data3.head()

---
## 3. Categorical Encoding

### One-Hot Encoding

In [None]:
data3 = titanic.copy()

In [None]:
data3['Sex'].head()

In [None]:
data3_oh = pd.get_dummies(data3['Sex'])
data3_oh.head()

In [None]:
data3 = data3.join(data3_oh)
data3.head()

We can see that we only need 1 of the 2 dummy variables to represent the original categorical variable `Sex`. Any of the 2 will do the job, and it doesn't matter which one we select, since they are equivalent. Therefore, to encode a categorical variable with 2 labels, we need only 1 dummy variable.

To extend this concept, to encode categorical variable with k labels, we need k-1 dummy variables. We can achieve this task as follows:

**TODO:** Obtaining k-1 labels on __Sex__ and __Embarked__ features

**TODO:** Investigate Scikt-Learn API [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). 

---
## 4 Transformations

In [None]:
data4 = pd.read_csv('https://raw.githubusercontent.com/anyoneai/notebooks/main/datasets/titanic.csv', usecols=['Age', 'Fare', 'Survived'])
data4.head()

In [None]:
data4['Age'] = data4['Age'].fillna(data4.Age.median())
data4.isna().sum()

### Example: Logarithmic transformation

In [None]:
data4['Age_log'] = np.log(data4.Age)

**TODO:** Convert Age to months

---
## 5. Discretization

**TODO:** Apply binning to __Age__ and plot Age count per bin

---
## 6. Outliers

**TODO:** Load the numerical variables of the Titanic Dataset

**TODO:** We can see that `Age` and `Fare` are continuous variables. So, you'll need to limit outliers on those variables.

**TODO:** Plot histograms on __Age__ and __Fare__

**TODO:** __Age__ is quite Gaussian and __Fare__ is skewed, so you will use the Gaussian assumption for __Age__, and the interquantile range for __Fare__.

### Find outliers

---
## 7. Optional: Date and Time

In some machine learning problems, temporary features appear, such as dates, times, etc. That type of data must be treated in a particular way.

**NOTE:** There is an area of machine learning where time data becomes critical, time series.

In [None]:
data7 = pd.read_csv('https://raw.githubusercontent.com/anyoneai/notebooks/main/datasets/stock_prices.csv')
data7.head()

**TODO:** Parse the dates, currently coded as strings, into datetime.

**TIP:** Investigate pandas [to_datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) and take care of date format!

**TODO:** Extract Month from Date

**TODO:** Convert Day to numeric from 1-31

**TODO:** Convert Day of the week to numeric from 0 to 6

**TODO:** Convert Day of the week to name

**TODO:** Was on Weekend? Generate a binary feature that indicates if the date corresponds to a weekend day.

**TODO:** Extract year 

**TODO:** Extract hour