# Feature Engineering Practice Notebook
This notebook contains a sample dataset and questions to practice feature engineering tasks.

## Dataset Overview
Here is a preview of the dataset you will work with:

In [1]:
import pandas as pd
df_orig = pd.read_csv('/Users/dinaeshs/Desktop/MLOps/Practice/Python_practice/Feature_Engineering/sample_feature_engineering_data.csv')
df_orig

Unnamed: 0,CustomerID,Age,AnnualIncome,PurchaseHistory,LastPurchaseDate,City
0,1,23.0,60000,"Electronics, Fashion",2025-01-01,New York
1,2,45.0,80000,"Groceries, Fashion",2024-12-15,San Francisco
2,3,31.0,50000,Electronics,2025-01-10,Chicago
3,4,35.0,75000,Fashion,2024-11-25,New York
4,5,,40000,Groceries,,Los Angeles



## Feature Engineering Questions

1. **Handling Missing Values**: Fill the missing values in the `Age` and `LastPurchaseDate` columns with appropriate values.
2. **Feature Extraction**: Extract the year, month, and day from the `LastPurchaseDate` column.
3. **One-Hot Encoding**: Convert the `City` column into one-hot encoded features.
4. **Frequency Encoding**: Perform frequency encoding on the `PurchaseHistory` column.
5. **Bucketization**: Divide customers into age groups (e.g., 18-30, 31-50, 51+) based on the `Age` column.
6. **Date Transformation**: Calculate the number of days since the last purchase for each customer.
7. **Feature Scaling**: Scale the `AnnualIncome` column using Min-Max scaling.
8. **Derived Features**: Create a new feature indicating the total number of categories in `PurchaseHistory`.
9. **Interaction Features**: Combine `City` and `PurchaseHistory` to create a new interaction feature.
10. **Outlier Detection**: Identify any outliers in the `AnnualIncome` column.


## Start Your Feature Engineering Tasks Below

#1. **Handling Missing Values**: Fill the missing values in the `Age` and `LastPurchaseDate` columns with appropriate values.

In [72]:
working_df= df_orig.copy()

In [73]:
working_df['Age'].fillna(working_df['Age'].mean(), inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  working_df['Age'].fillna(working_df['Age'].mean(), inplace = True)


In [74]:
working_df

Unnamed: 0,CustomerID,Age,AnnualIncome,PurchaseHistory,LastPurchaseDate,City
0,1,23.0,60000,"Electronics, Fashion",2025-01-01,New York
1,2,45.0,80000,"Groceries, Fashion",2024-12-15,San Francisco
2,3,31.0,50000,Electronics,2025-01-10,Chicago
3,4,35.0,75000,Fashion,2024-11-25,New York
4,5,33.5,40000,Groceries,,Los Angeles


2. **Feature Extraction**: Extract the year, month, and day from the `LastPurchaseDate` column.

In [75]:
working_df['Age'] = working_df['Age'].round(0)

In [None]:
working_df['Age'] = working_df['Age'].convert_dtypes(convert_integer=True)

In [108]:
working_df['LastPurchasemonth'] = pd.to_datetime(working_df['LastPurchaseDate']).dt.month.convert_dtypes(convert_integer=True)

In [59]:
working_df

Unnamed: 0,CustomerID,Age,AnnualIncome,PurchaseHistory,LastPurchaseDate,City,LastPurchaseyear,LastPurchasemonth
0,1,23,60000,"Electronics, Fashion",2025-01-01,New York,2025.0,1.0
1,2,45,80000,"Groceries, Fashion",2024-12-15,San Francisco,2024.0,12.0
2,3,31,50000,Electronics,2025-01-10,Chicago,2025.0,1.0
3,4,35,75000,Fashion,2024-11-25,New York,2024.0,11.0
4,5,34,40000,Groceries,,Los Angeles,,


In [107]:
working_df['LastPurchaseday'] = pd.to_datetime(working_df['LastPurchaseDate']).dt.day.convert_dtypes(convert_integer=True)

In [61]:
working_df

Unnamed: 0,CustomerID,Age,AnnualIncome,PurchaseHistory,LastPurchaseDate,City,LastPurchaseyear,LastPurchasemonth,LastPurchaseday
0,1,23,60000,"Electronics, Fashion",2025-01-01,New York,2025.0,1.0,1.0
1,2,45,80000,"Groceries, Fashion",2024-12-15,San Francisco,2024.0,12.0,15.0
2,3,31,50000,Electronics,2025-01-10,Chicago,2025.0,1.0,10.0
3,4,35,75000,Fashion,2024-11-25,New York,2024.0,11.0,25.0
4,5,34,40000,Groceries,,Los Angeles,,,


3. **One-Hot Encoding**: Convert the `City` column into one-hot encoded features.

In [77]:
working_df = pd.get_dummies(working_df, columns=['City'])
working_df

Unnamed: 0,CustomerID,Age,AnnualIncome,PurchaseHistory,LastPurchaseDate,City_Chicago,City_Los Angeles,City_New York,City_San Francisco
0,1,23.0,60000,"Electronics, Fashion",2025-01-01,False,False,True,False
1,2,45.0,80000,"Groceries, Fashion",2024-12-15,False,False,False,True
2,3,31.0,50000,Electronics,2025-01-10,True,False,False,False
3,4,35.0,75000,Fashion,2024-11-25,False,False,True,False
4,5,34.0,40000,Groceries,,False,True,False,False


4. **Frequency Encoding**: Perform frequency encoding on the `PurchaseHistory` column.

In [78]:
from sklearn.preprocessing import MultiLabelBinarizer

In [79]:
mlb = MultiLabelBinarizer()

In [89]:
data_split = [location.split(', ') for location in working_df['PurchaseHistory']]

In [96]:
data_split

[['Electronics', 'Fashion'],
 ['Groceries', 'Fashion'],
 ['Electronics'],
 ['Fashion'],
 ['Groceries']]

In [90]:
encoded_result = mlb.fit_transform(data_split)

In [91]:
encoded_result

array([[1, 1, 0],
       [0, 1, 1],
       [1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]])

In [92]:
encoded_df = pd.DataFrame(encoded_result, columns=mlb.classes_)

In [93]:
encoded_df

Unnamed: 0,Electronics,Fashion,Groceries
0,1,1,0
1,0,1,1
2,1,0,0
3,0,1,0
4,0,0,1


In [109]:
working_df = pd.concat([working_df,encoded_df], axis=1)

In [110]:
working_df

Unnamed: 0,CustomerID,Age,AnnualIncome,PurchaseHistory,LastPurchaseDate,City_Chicago,City_Los Angeles,City_New York,City_San Francisco,Electronics,Fashion,Groceries,LastPurchaseday,LastPurchasemonth,Electronics.1,Fashion.1,Groceries.1
0,1,23.0,60000,"Electronics, Fashion",2025-01-01,False,False,True,False,1,1,0,1.0,1.0,1,1,0
1,2,45.0,80000,"Groceries, Fashion",2024-12-15,False,False,False,True,0,1,1,15.0,12.0,0,1,1
2,3,31.0,50000,Electronics,2025-01-10,True,False,False,False,1,0,0,10.0,1.0,1,0,0
3,4,35.0,75000,Fashion,2024-11-25,False,False,True,False,0,1,0,25.0,11.0,0,1,0
4,5,34.0,40000,Groceries,,False,True,False,False,0,0,1,,,0,0,1


5. **Bucketization**: Divide customers into age groups (e.g., 18-30, 31-50, 51+) based on the `Age` column.

In [112]:
working_df_tmp = working_df.copy()

In [113]:
working_df_tmp['Age Cat'] = pd.cut(working_df['Age'],bins = [18,30,50,80], labels= ['18-30','31-50','51+'], right=False)
working_df_tmp

Unnamed: 0,CustomerID,Age,AnnualIncome,PurchaseHistory,LastPurchaseDate,City_Chicago,City_Los Angeles,City_New York,City_San Francisco,Electronics,Fashion,Groceries,LastPurchaseday,LastPurchasemonth,Electronics.1,Fashion.1,Groceries.1,Age Cat
0,1,23.0,60000,"Electronics, Fashion",2025-01-01,False,False,True,False,1,1,0,1.0,1.0,1,1,0,18-30
1,2,45.0,80000,"Groceries, Fashion",2024-12-15,False,False,False,True,0,1,1,15.0,12.0,0,1,1,31-50
2,3,31.0,50000,Electronics,2025-01-10,True,False,False,False,1,0,0,10.0,1.0,1,0,0,31-50
3,4,35.0,75000,Fashion,2024-11-25,False,False,True,False,0,1,0,25.0,11.0,0,1,0,31-50
4,5,34.0,40000,Groceries,,False,True,False,False,0,0,1,,,0,0,1,31-50
