<a href="https://colab.research.google.com/github/Uzma-Jawed/python-class_work-and-practice/blob/main/26_FeatureEngineering_BrandData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

📘 Uzma Jawed

📅 Class Work - August 09


---



### 🚗 Feature Engineering with Brand Data  
This notebook covers:
- Definitions of important concepts (Feature Engineering, Machine Learning, Deep Learning)
- Creating and manipulating Pandas DataFrames
- Splitting text into multiple columns
- Converting categorical data into numerical data
- Working with date ranges in Pandas




---


### 📌 Definitions

**Feature Engineering**  
The art of transforming raw data into features that better represent the problem to predictive models.  
*Best Practice:* Always ask: "Would this feature help a human expert make the decision?"

**Categorical → Numerical Conversion**  
Essential because most ML algorithms can't process text directly. Two main approaches:
- **Label Encoding:** Assigns arbitrary integers (Best for ordinal data)
(Ordinal data is categorical data with a natural rank order, but uneven or unknown distances between categories.)
- **One-Hot Encoding:** Creates binary columns (Best for nominal data)
(Nominal data is a type of categorical data that cannot be ordered or measured.)

**Machine Learning**  
Three main paradigms:
- Supervised Learning (Labeled data)
- Unsupervised Learning (Pattern discovery)
- Reinforcement Learning (Reward-based)

**Deep Learning**  
Specialized ML using:
- Neural networks with ≥3 layers
- Automatic feature extraction
- Requires large datasets


---



In [229]:
# Importing required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

### 🛠 Create a Dictionary

We will create a dictionary with one key `'brand'` and pass multiple values in the form of a list.  
Each value will be in the format: **BrandName_Year**.


---



In [230]:
brand_data = {
    'brand': [
        'Lamborghini_1998', 'Porsche_2014', 'Volkswagen_2015',
        'Chrysler_2009', 'Volvo_1927', 'Audi_2019',
        'BMW_2020', 'Mercedes_2021', 'Ferrari_2022'
    ]
}

In [231]:
df = pd.DataFrame(brand_data)

In [232]:
df

Unnamed: 0,brand
0,Lamborghini_1998
1,Porsche_2014
2,Volkswagen_2015
3,Chrysler_2009
4,Volvo_1927
5,Audi_2019
6,BMW_2020
7,Mercedes_2021
8,Ferrari_2022




---


### ✂ Splitting Brand Column

Split the `brand` column into:
- `brand_name`
- `year`

This is an example of **feature engineering**.


---



In [233]:
# Splitting 'brand' column into two new columns
df[['brand_name', 'year']] = df['brand'].str.split('_', expand=True)

In [234]:
df['year'] = df['year'].astype(int)

In [235]:
# Drop original column
df.drop('brand', axis=1, inplace=True)

In [236]:
df

Unnamed: 0,brand_name,year
0,Lamborghini,1998
1,Porsche,2014
2,Volkswagen,2015
3,Chrysler,2009
4,Volvo,1927
5,Audi,2019
6,BMW,2020
7,Mercedes,2021
8,Ferrari,2022




---


### Extracting with Indexing


---



In [237]:
brand_data = {
    'brand': [
        'Lamborghini_1998', 'Porsche_2014', 'Volkswagen_2015',
        'Chrysler_2009', 'Volvo_1927', 'Audi_2019',
        'BMW_2020', 'Mercedes_2021', 'Ferrari_2022'
    ]
}

In [238]:
df = pd.DataFrame(brand_data)

In [239]:
df

Unnamed: 0,brand
0,Lamborghini_1998
1,Porsche_2014
2,Volkswagen_2015
3,Chrysler_2009
4,Volvo_1927
5,Audi_2019
6,BMW_2020
7,Mercedes_2021
8,Ferrari_2022


In [240]:
# Extract brand name only
df['brand_name_only'] = df['brand'].str.split("_").str[0]

In [241]:
df

Unnamed: 0,brand,brand_name_only
0,Lamborghini_1998,Lamborghini
1,Porsche_2014,Porsche
2,Volkswagen_2015,Volkswagen
3,Chrysler_2009,Chrysler
4,Volvo_1927,Volvo
5,Audi_2019,Audi
6,BMW_2020,BMW
7,Mercedes_2021,Mercedes
8,Ferrari_2022,Ferrari


In [242]:
df = pd.DataFrame(brand_data)

In [243]:
# Extract year only
df['year_only'] = df['brand'].str.split("_").str[1].astype(int)

In [None]:
df

Unnamed: 0,brand,year_only
0,Lamborghini_1998,1998
1,Porsche_2014,2014
2,Volkswagen_2015,2015
3,Chrysler_2009,2009
4,Volvo_1927,1927
5,Audi_2019,2019
6,BMW_2020,2020
7,Mercedes_2021,2021
8,Ferrari_2022,2022




---


### 🔢 Converting Categorical Data into Numerical Data

We can use `Label Encoding` to convert brand names into numeric values.


---



In [244]:
# Initialize label encoder
le = LabelEncoder()

In [245]:
brand_data = {
    'brand': [
        'Lamborghini_1998', 'Porsche_2014', 'Volkswagen_2015',
        'Chrysler_2009', 'Volvo_1927', 'Audi_2019',
        'BMW_2020', 'Mercedes_2021', 'Ferrari_2022'
    ]
}

In [246]:
df = pd.DataFrame(brand_data)

In [247]:
df

Unnamed: 0,brand
0,Lamborghini_1998
1,Porsche_2014
2,Volkswagen_2015
3,Chrysler_2009
4,Volvo_1927
5,Audi_2019
6,BMW_2020
7,Mercedes_2021
8,Ferrari_2022


In [248]:
# Fit and transform brand names
df['brand_encoded'] = le.fit_transform(df['brand'])

In [249]:
df

Unnamed: 0,brand,brand_encoded
0,Lamborghini_1998,4
1,Porsche_2014,6
2,Volkswagen_2015,7
3,Chrysler_2009,2
4,Volvo_1927,8
5,Audi_2019,0
6,BMW_2020,1
7,Mercedes_2021,5
8,Ferrari_2022,3




---
### 📅 Working with Dates in Pandas

Create date ranges and extract Year, Month, and Day.


---






In [250]:
# Creating a date range for January 2025
days = pd.date_range(start="2025-01-01", end="2025-01-31")

In [251]:
# Convert to DataFrame
df_dates = pd.DataFrame({'Date': days})

In [252]:
# Extract year, month, day into separate columns
df_dates['Year'] = df_dates['Date'].dt.year
df_dates['Month'] = df_dates['Date'].dt.month
df_dates['Day'] = df_dates['Date'].dt.day

In [None]:
df_dates

Unnamed: 0,Date,Year,Month,Day
0,2025-01-01,2025,1,1
1,2025-01-02,2025,1,2
2,2025-01-03,2025,1,3
3,2025-01-04,2025,1,4
4,2025-01-05,2025,1,5
5,2025-01-06,2025,1,6
6,2025-01-07,2025,1,7
7,2025-01-08,2025,1,8
8,2025-01-09,2025,1,9
9,2025-01-10,2025,1,10




---


### Advanced Date Handling


---



In [253]:
# Create business-day aware date range
date_rng = pd.date_range(start='2025-01-01', end='2025-01-31', freq='B')  # B = business days

In [254]:
# Create DataFrame with enhanced features
df_dates = pd.DataFrame({'date': date_rng})
df_dates['year'] = df_dates['date'].dt.year
df_dates['month'] = df_dates['date'].dt.month_name()
df_dates['day_of_week'] = df_dates['date'].dt.day_name()
df_dates['is_weekend'] = df_dates['date'].dt.dayofweek > 4

In [255]:
print("\nEnhanced Date Features:")
df_dates


Enhanced Date Features:


Unnamed: 0,date,year,month,day_of_week,is_weekend
0,2025-01-01,2025,January,Wednesday,False
1,2025-01-02,2025,January,Thursday,False
2,2025-01-03,2025,January,Friday,False
3,2025-01-06,2025,January,Monday,False
4,2025-01-07,2025,January,Tuesday,False
5,2025-01-08,2025,January,Wednesday,False
6,2025-01-09,2025,January,Thursday,False
7,2025-01-10,2025,January,Friday,False
8,2025-01-13,2025,January,Monday,False
9,2025-01-14,2025,January,Tuesday,False




---


### 🏠 Homework:
Study more about **feature engineering** and why it is important in machine learning.


---




### 🎯 One-Hot Encoding

**One-Hot Encoding** converts categorical values into separate binary (0/1) columns for each category.

Example:  
If brands are `Audi`, `BMW`, `Volvo`, then One-Hot Encoding creates columns:  
`Audi`, `BMW`, `Volvo` — each containing 0 or 1 depending on the row.


---




In [260]:
brand_data = {
    'brand': [
        'Lamborghini_1998', 'Porsche_2014', 'Volkswagen_2015',
        'Chrysler_2009', 'Volvo_1927', 'Audi_2019',
        'BMW_2020', 'Mercedes_2021', 'Ferrari_2022'
    ]
}

In [261]:
df = pd.DataFrame(brand_data)
df

Unnamed: 0,brand
0,Lamborghini_1998
1,Porsche_2014
2,Volkswagen_2015
3,Chrysler_2009
4,Volvo_1927
5,Audi_2019
6,BMW_2020
7,Mercedes_2021
8,Ferrari_2022


In [263]:
# Splitting 'brand' column into two new columns
df[['brand_name', 'year']] = df['brand'].str.split('_', expand=True)

In [264]:
# Perform One-Hot Encoding
df_one_hot = pd.get_dummies(df, columns=['brand_name'], prefix='brand')

In [208]:
df_one_hot

Unnamed: 0,year,brand_Audi,brand_BMW,brand_Chrysler,brand_Ferrari,brand_Lamborghini,brand_Mercedes,brand_Porsche,brand_Volkswagen,brand_Volvo
0,1998,False,False,False,False,True,False,False,False,False
1,2014,False,False,False,False,False,False,True,False,False
2,2015,False,False,False,False,False,False,False,True,False
3,2009,False,False,True,False,False,False,False,False,False
4,1927,False,False,False,False,False,False,False,False,True
5,2019,True,False,False,False,False,False,False,False,False
6,2020,False,True,False,False,False,False,False,False,False
7,2021,False,False,False,False,False,True,False,False,False
8,2022,False,False,False,True,False,False,False,False,False


### 🔍 Difference Between Label Encoding & One-Hot Encoding

- **Label Encoding** assigns each category a unique integer value.  
  Example:  
  Audi → 0  
  BMW → 1  
  Volvo → 2  

- **One-Hot Encoding** creates a new column for each category and assigns binary values (0 or 1).  

In [265]:
# Example
# Create the data as a dictionary
data = {
    "Audi": [1, 0, 0],
    "BMW": [0, 1, 0],
    "Volvo": [0, 0, 1]
}

In [266]:
# Convert dictionary to DataFrame
df = pd.DataFrame(data)

In [267]:
df

Unnamed: 0,Audi,BMW,Volvo
0,1,0,0
1,0,1,0
2,0,0,1




---


### Make it so row names are the car models instead of just numbers.


---



In [268]:
# Row labels (models)
index_labels = ["Car 1", "Car 2", "Car 3"]

In [269]:
# Convert dictionary to DataFrame with row labels
df = pd.DataFrame(data, index=index_labels)

In [270]:
df

Unnamed: 0,Audi,BMW,Volvo
Car 1,1,0,0
Car 2,0,1,0
Car 3,0,0,1
