<a href="https://colab.research.google.com/github/faisu6339-glitch/Machine-learning/blob/main/F10_Feature_Construction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Feature Construction Explained

Feature construction (also known as feature engineering or feature creation) is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. It's often the most critical step in applied machine learning.

#### What are Features?

In machine learning, features are individual measurable properties or characteristics of a phenomenon being observed. They are the input variables that a model uses to make predictions. For example, in a house price prediction model, features might include 'number of bedrooms', 'square footage', 'location', 'year built', etc.

#### Why is Feature Construction Important?

1.  **Improved Model Performance:** Well-engineered features can significantly boost the performance of a machine learning model, often more than complex algorithms or hyperparameter tuning. It helps the model 'see' patterns that might be hidden in the raw data.
2.  **Domain Knowledge Integration:** Feature construction is where domain expertise truly shines. Understanding the problem context allows you to create features that are meaningful and relevant.
3.  **Simplification:** Sometimes, complex relationships can be simplified into a single, highly informative feature, making the model easier to train and interpret.
4.  **Handling Data Issues:** It can help address issues like missing values, outliers, or skewed distributions by creating new, robust features.

#### Common Techniques and Strategies:

1.  **Binning/Discretization:**
    *   **Concept:** Converting continuous numerical features into categorical features (bins).
    *   **Example:** Grouping 'age' into 'child', 'teenager', 'adult', 'senior'. This can help capture non-linear relationships.

2.  **One-Hot Encoding/Label Encoding:**
    *   **Concept:** Converting categorical features into a numerical format that machine learning algorithms can understand.
    *   **One-Hot Encoding:** Creates new binary columns for each category (e.g., 'Color_Red', 'Color_Blue').
    *   **Label Encoding:** Assigns a unique integer to each category (e.g., 'Red': 0, 'Blue': 1). Use carefully, as it implies an ordinal relationship.

3.  **Interaction Features:**
    *   **Concept:** Combining two or more features to create a new feature that captures their combined effect.
    *   **Example:** `Age * Income` or `Number_of_Bedrooms / Square_Footage` (density).

4.  **Polynomial Features:**
    *   **Concept:** Creating new features by raising existing features to a certain power.
    *   **Example:** For a feature 'X', creating `X^2`, `X^3`, etc. This helps capture non-linear relationships.

5.  **Lag Features (Time Series Data):**
    *   **Concept:** Using past values of a time series as features to predict future values.
    *   **Example:** In sales forecasting, `Sales_Yesterday`, `Sales_Last_Week`.

6.  **Window Features (Time Series Data):**
    *   **Concept:** Calculating statistics over a moving window of time.
    *   **Example:** `Average_Sales_Last_7_Days`, `Max_Temperature_Last_24_Hours`.

7.  **Feature Scaling (Standardization/Normalization):**
    *   **Concept:** Rescaling numerical features to a standard range or distribution. While not strictly *creating* a new feature, it transforms existing ones into a more suitable format for many algorithms.
    *   **Standardization:** Transforms data to have a mean of 0 and standard deviation of 1.
    *   **Normalization:** Scales data to a fixed range, usually 0 to 1.

8.  **Datetime Features:**
    *   **Concept:** Extracting components from date/time columns to create new features.
    *   **Example:** `Year`, `Month`, `Day`, `Day_of_Week`, `Day_of_Year`, `Is_Weekend`, `Hour`.

9.  **Text Features (NLP):**
    *   **Concept:** Transforming raw text into numerical representations.
    *   **Examples:** Bag-of-Words, TF-IDF, Word Embeddings (e.g., Word2Vec, BERT embeddings), character counts, word counts.

10. **Geospatial Features:**
    *   **Concept:** Deriving features from location data.
    *   **Examples:** Distance to nearest landmark, population density of an area, proximity to a specific point of interest.

11. **Aggregations:**
    *   **Concept:** Summarizing information from related rows or groups.
    *   **Example:** For a customer, `Total_Purchases`, `Average_Order_Value`, `Time_Since_Last_Purchase`.

#### The Process of Feature Construction:

1.  **Understand the Domain:** Deep knowledge of the problem and the data is crucial. What are the key drivers? What relationships might exist?
2.  **Explore Data:** Use EDA (Exploratory Data Analysis) to identify patterns, distributions, correlations, and potential issues.
3.  **Brainstorm Features:** Based on domain knowledge and EDA, hypothesize new features that might be useful.
4.  **Create Features:** Implement the new features using programming libraries (e.g., pandas in Python).
5.  **Evaluate Features:** Test the impact of new features on model performance. Use techniques like feature importance, correlation analysis, or simply train and evaluate models with and without the new features.
6.  **Iterate:** Feature construction is an iterative process. You might go back to step 1 or 2 based on evaluation results.

#### Tools and Libraries:

*   **Pandas:** For data manipulation and creating new columns.
*   **Scikit-learn:** Provides transformers for scaling, encoding, and creating polynomial features.
*   **Featuretools:** An automated feature engineering library.
*   **Numpy:** For numerical operations.

In summary, feature construction is an art and a science. It requires creativity, domain expertise, and a systematic approach to unlock the full potential of your data for machine learning models.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression



In [2]:
df=pd.read_csv('Titanic.csv')[['Age','Pclass','SibSp','Parch','Survived']]

In [3]:
df.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Survived
0,22.0,3,1,0,0
1,38.0,1,1,0,1
2,26.0,3,0,0,1
3,35.0,1,1,0,1
4,35.0,3,0,0,0


In [4]:
df.isnull().mean()*100

Unnamed: 0,0
Age,19.86532
Pclass,0.0
SibSp,0.0
Parch,0.0
Survived,0.0


In [5]:
df.dropna(inplace=True)

In [6]:
df.isnull().mean()*100

Unnamed: 0,0
Age,0.0
Pclass,0.0
SibSp,0.0
Parch,0.0
Survived,0.0


In [7]:
df.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Survived
0,22.0,3,1,0,0
1,38.0,1,1,0,1
2,26.0,3,0,0,1
3,35.0,1,1,0,1
4,35.0,3,0,0,0


In [9]:
X = df.drop(columns=['Survived'])
y = df['Survived']

In [10]:
np.mean(cross_val_score(LogisticRegression(),X,y,scoring='accuracy',cv=20))

np.float64(0.6933333333333332)

#Applying Feature Construction

In [11]:
X['Family_size']=X['SibSp']+X['Parch']+1

In [12]:
X.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Family_size
0,22.0,3,1,0,2
1,38.0,1,1,0,2
2,26.0,3,0,0,1
3,35.0,1,1,0,2
4,35.0,3,0,0,1


In [13]:
def myfunc(num):
  if num==1:
    #alone
    return 0
  elif num >1 and num <=4:
    #Small family
    return 1
  else:
    #large family
    return 2

In [14]:
myfunc(4)

1

In [15]:
X['Family_type']=X['Family_size'].apply(myfunc)

In [16]:
X.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Family_size,Family_type
0,22.0,3,1,0,2,1
1,38.0,1,1,0,2,1
2,26.0,3,0,0,1,0
3,35.0,1,1,0,2,1
4,35.0,3,0,0,1,0


In [17]:
X.drop(columns=['SibSp','Parch','Family_size'],inplace=True)

In [18]:
X.head()

Unnamed: 0,Age,Pclass,Family_type
0,22.0,3,1
1,38.0,1,1
2,26.0,3,0
3,35.0,1,1
4,35.0,3,0


In [19]:
np.mean(cross_val_score(LogisticRegression(),X,y,scoring='accuracy',cv=20))

np.float64(0.7003174603174602)

#Feature Splitting

In [20]:
df=pd.read_csv("Titanic.csv")

In [21]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [22]:
df['Name']

Unnamed: 0,Name
0,"Braund, Mr. Owen Harris"
1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,"Heikkinen, Miss. Laina"
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,"Allen, Mr. William Henry"
...,...
886,"Montvila, Rev. Juozas"
887,"Graham, Miss. Margaret Edith"
888,"Johnston, Miss. Catherine Helen ""Carrie"""
889,"Behr, Mr. Karl Howell"


In [23]:
df['Title']=df['Name'].str.split(',',expand=True)[1].str.split('.',expand=True)[0]

In [26]:
df['Name'].str.split(',',expand=True)[1]

Unnamed: 0,1
0,Mr. Owen Harris
1,Mrs. John Bradley (Florence Briggs Thayer)
2,Miss. Laina
3,Mrs. Jacques Heath (Lily May Peel)
4,Mr. William Henry
...,...
886,Rev. Juozas
887,Miss. Margaret Edith
888,"Miss. Catherine Helen ""Carrie"""
889,Mr. Karl Howell


In [27]:
df['Name'].str.split(',',expand=True)[1].str.split('.',expand=True)

Unnamed: 0,0,1,2
0,Mr,Owen Harris,
1,Mrs,John Bradley (Florence Briggs Thayer),
2,Miss,Laina,
3,Mrs,Jacques Heath (Lily May Peel),
4,Mr,William Henry,
...,...,...,...
886,Rev,Juozas,
887,Miss,Margaret Edith,
888,Miss,"Catherine Helen ""Carrie""",
889,Mr,Karl Howell,


In [25]:
df[['Title','Name']]

Unnamed: 0,Title,Name
0,Mr,"Braund, Mr. Owen Harris"
1,Mrs,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,Miss,"Heikkinen, Miss. Laina"
3,Mrs,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,Mr,"Allen, Mr. William Henry"
...,...,...
886,Rev,"Montvila, Rev. Juozas"
887,Miss,"Graham, Miss. Margaret Edith"
888,Miss,"Johnston, Miss. Catherine Helen ""Carrie"""
889,Mr,"Behr, Mr. Karl Howell"


In [29]:
df.groupby('Title')['Survived'].mean().sort_values(ascending=False)

Unnamed: 0_level_0,Survived
Title,Unnamed: 1_level_1
Lady,1.0
Ms,1.0
Sir,1.0
Mme,1.0
the Countess,1.0
Mlle,1.0
Mrs,0.792
Miss,0.697802
Master,0.575
Major,0.5


In [30]:
df['IS_Married']=0
df['IS_Married'].loc[df['Title']=='Mrs']=1

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df['IS_Married'].loc[df['Title']=='Mrs']=1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['IS_Married'].loc

In [32]:
df['IS_Married']

Unnamed: 0,IS_Married
0,0
1,0
2,0
3,0
4,0
...,...
886,0
887,0
888,0
889,0
