# Feature Construction

---

🔹 What is Feature Construction?

Feature Construction = Creating new features (variables) from the existing data to make patterns easier for models to detect.

It improves model performance by providing more informative, higher-level, or domain-specific features.

Think of it as transforming raw data into better inputs for ML models.

🔹 Why is it important?

Models often perform poorly on raw data.

Constructed features can:

Capture nonlinear relationships.

Encode domain knowledge.

Reduce noise and improve interpretability.

Help simple models (like Linear/Logistic Regression) compete with complex ones.

🔹 Types of Feature Construction
1. Mathematical Transformations

Apply math functions on features:

Log, Square root, Exponential

Ratios, Differences, Products

Example: BMI = weight / (height^2)

2. Polynomial Features

Create interaction terms and higher-order features:

Example:
𝑥
1
2
,
𝑥
2
2
,
𝑥
1
×
𝑥
2
x
1
2
	​

,x
2
2
	​

,x
1
	​

×x
2
	​


Captures nonlinear relationships.

3. Discretization / Binning

Convert continuous values into bins (categories).

Example:

Age → [0–18 = "Child", 19–35 = "Young Adult", 36–60 = "Adult", 60+ = "Senior"].

4. Encoding Categorical Features

Combine or transform categorical variables:

One-hot encoding

Frequency encoding

Target encoding

5. Datetime Feature Construction

Extract useful features from timestamps:

Day, Month, Year

Day of week, Is weekend, Quarter

Time since event

Example: Transaction_Date → DayOfWeek, Month, Year, Holiday_Flag

6. Text Feature Construction

From raw text:

Word count, Character count

TF-IDF features

Sentiment scores

N-grams

7. Domain-specific Feature Construction

Features based on domain knowledge:

Finance: Debt-to-Income ratio

Healthcare: BMI, Risk score

Retail: Sales per customer, Discount percentage

8. Aggregation Features

Group-based summaries:

Mean, Max, Min, Count within groups

Example: "Average spending per customer ID"

🔹 Example in Python
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

# Example dataset
df = pd.DataFrame({
    "Weight": [60, 72, 90, 45],
    "Height": [1.65, 1.70, 1.80, 1.55],
    "Salary": [40000, 52000, 60000, 35000],
    "Date": pd.to_datetime(["2024-01-01", "2024-02-14", "2024-06-20", "2024-09-15"])
})

print("Original Data:")
print(df)

# 1. Construct BMI feature
df["BMI"] = df["Weight"] / (df["Height"] ** 2)

# 2. Salary per kg
df["Salary_per_kg"] = df["Salary"] / df["Weight"]

# 3. Polynomial features (Weight & Height)
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[["Weight", "Height"]])
df_poly = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(["Weight", "Height"]))

# Merge polynomial features
df = pd.concat([df, df_poly], axis=1)

# 4. Date-based features
df["Month"] = df["Date"].dt.month
df["DayOfWeek"] = df["Date"].dt.dayofweek
df["IsWeekend"] = df["DayOfWeek"].isin([5, 6]).astype(int)

print("\nData after Feature Construction:")
print(df)

🔹 Best Practices

✅ Use domain knowledge – the best features often come from understanding the problem.
✅ Avoid too many irrelevant features (risk of overfitting).
✅ Standardize new features if scales differ a lot.
✅ Try feature selection later to keep only useful ones.
✅ Always create features after splitting train/test data to avoid data leakage.

✅ In summary:
Feature Construction = turning raw data into powerful, informative features.
It includes mathematical transformations, interaction terms, date-time splits, aggregations, text features, and domain-specific knowledge.

# Feature Spliting


---

🔹 What is Feature Splitting?

Feature Splitting means dividing one feature (column) into multiple meaningful sub-features.
It is used when a feature is composite (contains more than one piece of information) or when splitting improves model interpretability and performance.

🔹 Why Do We Use Feature Splitting?

Extract more information – a single feature may hide patterns.

Improve model accuracy – ML algorithms perform better when data is granular.

Handle categorical or text data – splitting can convert unstructured/combined data into structured form.

Enable feature engineering – allows creation of domain-specific features.

🔹 Examples of Feature Splitting
1. Date/Time Feature

Original feature: "2025-08-22 21:00:00"

After splitting:

Year = 2025

Month = 08

Day = 22

Hour = 21

DayOfWeek = Friday

📌 Useful for sales, traffic, weather, and seasonal predictions.

2. Full Name

Original feature: "John Smith"

After splitting:

First_Name = John

Last_Name = Smith

📌 Useful for customer data analysis.

3. Address or Location

Original feature: "221B Baker Street, London"

After splitting:

Street = Baker Street

House_Number = 221B

City = London

📌 Useful in geospatial analysis, delivery systems.

4. Categorical Encoding

Original feature: "Red_Blue" (combined info)

After splitting:

Color1 = Red

Color2 = Blue

📌 Used in product datasets.

5. Numeric Ranges

Original feature: "10-20"

After splitting:

Lower = 10

Upper = 20

📌 Useful in age groups, salary brackets.

🔹 Example in Python (Pandas)
import pandas as pd

# Example dataset
data = {
    "Name": ["John Smith", "Alice Brown"],
    "DOB": ["1999-06-15", "2001-12-05"],
    "Salary_Range": ["30000-40000", "40000-50000"]
}

df = pd.DataFrame(data)

# Split Name into First and Last Name
df[['First_Name', 'Last_Name']] = df['Name'].str.split(" ", 1, expand=True)

# Split DOB into Year, Month, Day
df['Year'] = pd.to_datetime(df['DOB']).dt.year
df['Month'] = pd.to_datetime(df['DOB']).dt.month
df['Day'] = pd.to_datetime(df['DOB']).dt.day

# Split Salary_Range into Min and Max
df[['Salary_Min', 'Salary_Max']] = df['Salary_Range'].str.split("-", expand=True)
df[['Salary_Min', 'Salary_Max']] = df[['Salary_Min', 'Salary_Max']].astype(int)

print(df)

🔹 Output
          Name         DOB Salary_Range First_Name Last_Name  Year  Month  Day  Salary_Min  Salary_Max
0   John Smith  1999-06-15   30000-40000       John     Smith  1999      6   15       30000       40000
1  Alice Brown  2001-12-05   40000-50000      Alice     Brown  2001     12    5       40000       50000

🔹 Key Points

Feature splitting increases dimensionality, so apply carefully.

Too much splitting can cause curse of dimensionality.

Always check if the split features are useful and relevant for prediction.

In [65]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [66]:
df = pd.read_csv('titanic.csv')[['Age','Pclass','SibSp','Parch','Survived']]

In [67]:
df.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Survived
0,22.0,3,1,0,0
1,38.0,1,1,0,1
2,26.0,3,0,0,1
3,35.0,1,1,0,1
4,35.0,3,0,0,0


In [68]:
df.dropna(inplace=True)

In [69]:
x=df.iloc[:,0:4]
y=df.iloc[:,-1]

In [70]:
x.head()

Unnamed: 0,Age,Pclass,SibSp,Parch
0,22.0,3,1,0
1,38.0,1,1,0
2,26.0,3,0,0
3,35.0,1,1,0
4,35.0,3,0,0


In [71]:
np.mean(cross_val_score(LogisticRegression(),x,y,scoring='accuracy',cv=20))

np.float64(0.6933333333333332)

Applying Feature Construction

In [72]:
x['family_size'] = x['SibSp']+x['Parch']+1

In [73]:
x.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,family_size
0,22.0,3,1,0,2
1,38.0,1,1,0,2
2,26.0,3,0,0,1
3,35.0,1,1,0,2
4,35.0,3,0,0,1


In [74]:
def myfunc(num):
  # Alone
  if num==1:
    return 0
  # Small Family
  elif num>1 and num<=4:
    return 1;
  # Large Family
  else:
    return 2;

In [75]:
x['Family_type'] = x['family_size'].apply(myfunc)

In [76]:
x.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,family_size,Family_type
0,22.0,3,1,0,2,1
1,38.0,1,1,0,2,1
2,26.0,3,0,0,1,0
3,35.0,1,1,0,2,1
4,35.0,3,0,0,1,0


In [77]:
x.drop(columns=['SibSp','Parch','family_size'],inplace=True)

In [78]:
x.head()

Unnamed: 0,Age,Pclass,Family_type
0,22.0,3,1
1,38.0,1,1
2,26.0,3,0
3,35.0,1,1
4,35.0,3,0


In [79]:
np.mean(cross_val_score(LogisticRegression(),x,y,scoring='accuracy',cv=20))

np.float64(0.7003174603174602)

Feature Splitting

In [80]:
df = pd.read_csv('titanic.csv')

In [81]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [82]:
df['Name']

Unnamed: 0,Name
0,"Braund, Mr. Owen Harris"
1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,"Heikkinen, Miss. Laina"
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,"Allen, Mr. William Henry"
...,...
886,"Montvila, Rev. Juozas"
887,"Graham, Miss. Margaret Edith"
888,"Johnston, Miss. Catherine Helen ""Carrie"""
889,"Behr, Mr. Karl Howell"


In [83]:
df['Title'] = df['Name'].str.split(',',expand=True)[1].str.split('.',expand=True)[0]

In [84]:
df[['Title','Name']]

Unnamed: 0,Title,Name
0,Mr,"Braund, Mr. Owen Harris"
1,Mrs,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,Miss,"Heikkinen, Miss. Laina"
3,Mrs,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,Mr,"Allen, Mr. William Henry"
...,...,...
886,Rev,"Montvila, Rev. Juozas"
887,Miss,"Graham, Miss. Margaret Edith"
888,Miss,"Johnston, Miss. Catherine Helen ""Carrie"""
889,Mr,"Behr, Mr. Karl Howell"


In [88]:
# Filter out rows where 'Title' is not a numeric value
df = df[pd.to_numeric(df['Title'], errors='coerce').notna()]

# Now calculate the mean of 'Survived' for each 'Title' on the filtered DataFrame
(df.groupby('Title').mean()['Survived']).sort_values(ascending=False)

Unnamed: 0_level_0,Survived
Title,Unnamed: 1_level_1


In [90]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Is_Married


In [92]:
df['Is_Married']=0
df["Is_Married"].loc[df['Title']=='Mrs']=1

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df["Is_Married"].loc[df['Title']=='Mrs']=1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Is_Married"].loc

In [93]:
df['Is_Married']

Unnamed: 0,Is_Married
