<a href="https://colab.research.google.com/github/Uzma-Jawed/python-class_work-and-practice/blob/main/26_FeatureEngineering_BrandData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

📘 Uzma Jawed

📅 Class Work - August 09


---



### 🚗 Feature Engineering with Brand Data  
This notebook covers:
- Definitions of important concepts (Feature Engineering, Machine Learning, Deep Learning)
- Creating and manipulating Pandas DataFrames
- Splitting text into multiple columns
- Converting categorical data into numerical data
- Working with date ranges in Pandas




---


### 📌 Definitions

**Feature Engineering**  
The art of transforming raw data into features that better represent the problem to predictive models.  
*Best Practice:* Always ask: "Would this feature help a human expert make the decision?"

**Categorical → Numerical Conversion**  
Essential because most ML algorithms can't process text directly. Two main approaches:
- **Label Encoding:** Assigns arbitrary integers (Best for ordinal data)
(Ordinal data is categorical data with a natural rank order, but uneven or unknown distances between categories.)
- **One-Hot Encoding:** Creates binary columns (Best for nominal data)
(Nominal data is a type of categorical data that cannot be ordered or measured.)

**Machine Learning**  
Three main paradigms:
- Supervised Learning (Labeled data)
- Unsupervised Learning (Pattern discovery)
- Reinforcement Learning (Reward-based)

**Deep Learning**  
Specialized ML using:
- Neural networks with ≥3 layers
- Automatic feature extraction
- Requires large datasets


---



In [285]:
# Importing required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

### 🛠 Create a Dictionary

We will create a dictionary with one key `'brand'` and pass multiple values in the form of a list.  
Each value will be in the format: **BrandName_Year**.


---



In [286]:
brand_data = {
    'brand': [
        'Lamborghini_1998', 'Porsche_2014', 'Volkswagen_2015',
        'Chrysler_2009', 'Volvo_1927', 'Audi_2019',
        'BMW_2020', 'Mercedes_2021', 'Ferrari_2022'
    ]
}

In [287]:
df = pd.DataFrame(brand_data)

In [288]:
df

Unnamed: 0,brand
0,Lamborghini_1998
1,Porsche_2014
2,Volkswagen_2015
3,Chrysler_2009
4,Volvo_1927
5,Audi_2019
6,BMW_2020
7,Mercedes_2021
8,Ferrari_2022




---


### ✂ Splitting Brand Column

Split the `brand` column into:
- `brand_name`
- `year`

This is an example of **feature engineering**.


---



In [289]:
# Splitting 'brand' column into two new columns
df[['brand_name', 'year']] = df['brand'].str.split('_', expand=True)

In [290]:
df['year'] = df['year'].astype(int)

In [291]:
# Drop original column
df.drop('brand', axis=1, inplace=True)

In [292]:
df

Unnamed: 0,brand_name,year
0,Lamborghini,1998
1,Porsche,2014
2,Volkswagen,2015
3,Chrysler,2009
4,Volvo,1927
5,Audi,2019
6,BMW,2020
7,Mercedes,2021
8,Ferrari,2022




---


### Extracting with Indexing


---



In [293]:
brand_data = {
    'brand': [
        'Lamborghini_1998', 'Porsche_2014', 'Volkswagen_2015',
        'Chrysler_2009', 'Volvo_1927', 'Audi_2019',
        'BMW_2020', 'Mercedes_2021', 'Ferrari_2022'
    ]
}

In [294]:
df = pd.DataFrame(brand_data)

In [295]:
df

Unnamed: 0,brand
0,Lamborghini_1998
1,Porsche_2014
2,Volkswagen_2015
3,Chrysler_2009
4,Volvo_1927
5,Audi_2019
6,BMW_2020
7,Mercedes_2021
8,Ferrari_2022


In [296]:
# Extract brand name only
df['brand_name_only'] = df['brand'].str.split("_").str[0]

In [297]:
df

Unnamed: 0,brand,brand_name_only
0,Lamborghini_1998,Lamborghini
1,Porsche_2014,Porsche
2,Volkswagen_2015,Volkswagen
3,Chrysler_2009,Chrysler
4,Volvo_1927,Volvo
5,Audi_2019,Audi
6,BMW_2020,BMW
7,Mercedes_2021,Mercedes
8,Ferrari_2022,Ferrari


In [298]:
df = pd.DataFrame(brand_data)

In [299]:
# Extract year only
df['year_only'] = df['brand'].str.split("_").str[1].astype(int)

In [300]:
df

Unnamed: 0,brand,year_only
0,Lamborghini_1998,1998
1,Porsche_2014,2014
2,Volkswagen_2015,2015
3,Chrysler_2009,2009
4,Volvo_1927,1927
5,Audi_2019,2019
6,BMW_2020,2020
7,Mercedes_2021,2021
8,Ferrari_2022,2022




---


### 🔢 Converting Categorical Data into Numerical Data

We can use `Label Encoding` to convert brand names into numeric values.


---



In [301]:
# Initialize label encoder
le = LabelEncoder()

In [302]:
brand_data = {
    'brand': [
        'Lamborghini_1998', 'Porsche_2014', 'Volkswagen_2015',
        'Chrysler_2009', 'Volvo_1927', 'Audi_2019',
        'BMW_2020', 'Mercedes_2021', 'Ferrari_2022'
    ]
}

In [303]:
df = pd.DataFrame(brand_data)

In [304]:
df

Unnamed: 0,brand
0,Lamborghini_1998
1,Porsche_2014
2,Volkswagen_2015
3,Chrysler_2009
4,Volvo_1927
5,Audi_2019
6,BMW_2020
7,Mercedes_2021
8,Ferrari_2022


In [305]:
# Fit and transform brand names
df['brand_encoded'] = le.fit_transform(df['brand'])

In [None]:
df

Unnamed: 0,brand,brand_encoded
0,Lamborghini_1998,4
1,Porsche_2014,6
2,Volkswagen_2015,7
3,Chrysler_2009,2
4,Volvo_1927,8
5,Audi_2019,0
6,BMW_2020,1
7,Mercedes_2021,5
8,Ferrari_2022,3




---
### 📅 Working with Dates in Pandas

Create date ranges and extract Year, Month, and Day.


---






In [306]:
# Creating a date range for January 2025
days = pd.date_range(start="2025-01-01", end="2025-01-31")

In [307]:
# Convert to DataFrame
df_dates = pd.DataFrame({'Date': days})

In [308]:
# Extract year, month, day into separate columns
df_dates['Year'] = df_dates['Date'].dt.year
df_dates['Month'] = df_dates['Date'].dt.month
df_dates['Day'] = df_dates['Date'].dt.day

In [309]:
df_dates

Unnamed: 0,Date,Year,Month,Day
0,2025-01-01,2025,1,1
1,2025-01-02,2025,1,2
2,2025-01-03,2025,1,3
3,2025-01-04,2025,1,4
4,2025-01-05,2025,1,5
5,2025-01-06,2025,1,6
6,2025-01-07,2025,1,7
7,2025-01-08,2025,1,8
8,2025-01-09,2025,1,9
9,2025-01-10,2025,1,10




---


### Advanced Date Handling


---



In [310]:
# Create business-day aware date range
date_rng = pd.date_range(start='2025-01-01', end='2025-01-31', freq='B')  # B = business days

In [311]:
# Create DataFrame with enhanced features
df_dates = pd.DataFrame({'date': date_rng})
df_dates['year'] = df_dates['date'].dt.year
df_dates['month'] = df_dates['date'].dt.month_name()
df_dates['day_of_week'] = df_dates['date'].dt.day_name()
df_dates['is_weekend'] = df_dates['date'].dt.dayofweek > 4

In [312]:
print("\nEnhanced Date Features:")
df_dates


Enhanced Date Features:


Unnamed: 0,date,year,month,day_of_week,is_weekend
0,2025-01-01,2025,January,Wednesday,False
1,2025-01-02,2025,January,Thursday,False
2,2025-01-03,2025,January,Friday,False
3,2025-01-06,2025,January,Monday,False
4,2025-01-07,2025,January,Tuesday,False
5,2025-01-08,2025,January,Wednesday,False
6,2025-01-09,2025,January,Thursday,False
7,2025-01-10,2025,January,Friday,False
8,2025-01-13,2025,January,Monday,False
9,2025-01-14,2025,January,Tuesday,False




---


### 🏠 Homework:
Study more about **feature engineering** and why it is important in machine learning.


---

