### **Feature Construction and Feature Splitting**  

Both **Feature Construction** and **Feature Splitting** are techniques used in machine learning and data preprocessing to improve the quality of data for predictive modeling.  

---

### **1. Feature Construction**  
Feature Construction (also called Feature Creation or Feature Engineering) is the process of **creating new features** from existing ones to improve model performance.  

#### **Why is Feature Construction Important?**  
- Helps capture complex relationships in data.  
- Enhances the predictive power of a model.  
- Reduces dimensionality by creating more meaningful features.  

#### **Examples of Feature Construction**  
1. **Combining Features**  
   - Example: If you have "height" and "weight" features, you can create a new feature **BMI (Body Mass Index)** = `weight / height²`.  
2. **Extracting Information**  
   - Example: From a "Date" column, you can extract new features like **Year, Month, Day, or Weekday**.  
3. **Transforming Features**  
   - Example: Creating a **log transformation** of skewed numerical data to normalize its distribution.  
4. **Creating Polynomial Features**  
   - Example: If you have a feature `X`, you can create `X²` and `X³` to capture non-linear relationships.  

---

### **2. Feature Splitting**  
Feature Splitting (also known as Feature Decomposition) is the process of **breaking a feature into multiple smaller features** to capture more information and improve model performance.  

#### **Why is Feature Splitting Useful?**  
- Reduces noise in the dataset by making features more specific.  
- Helps machine learning models understand categorical or text-based data better.  
- Improves interpretability of features.  

#### **Examples of Feature Splitting**  
1. **Splitting Text Features**  
   - Example: A "Full Name" column (`"John Doe"`) can be split into `"First Name"` (`"John"`) and `"Last Name"` (`"Doe"`).  
2. **Splitting Date/Time Features**  
   - Example: A "Timestamp" column (`"2024-02-17 10:30:00"`) can be split into `"Year"`, `"Month"`, `"Day"`, `"Hour"`, `"Minute"`.  
3. **Splitting Categorical Features**  
   - Example: A "Location" column (`"New York, USA"`) can be split into `"City"` (`"New York"`) and `"Country"` (`"USA"`).  
4. **Splitting Numerical Ranges**  
   - Example: Instead of using raw income values (`50,000`), you can create an `"Income Category"` feature (`"Low"`, `"Medium"`, `"High"`).  

---

### **Key Differences**  
| Aspect  | Feature Construction | Feature Splitting |
|---------|----------------------|-------------------|
| **Definition** | Creating new features by combining or transforming existing ones. | Breaking an existing feature into multiple smaller features. |
| **Purpose** | Helps capture complex relationships and interactions in data. | Makes individual features more interpretable and specific. |
| **Example** | Creating a `"BMI"` feature from `"Height"` and `"Weight"`. | Splitting `"Full Name"` into `"First Name"` and `"Last Name"`. |

Both techniques are commonly used together to improve the quality of features for machine learning models. 🚀

# Here I take a different example with a different dataset. I think you understood the above theory of this topic.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import warnings
warnings.filterwarnings('ignore')

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/train-or-house-train-dataset/house-train.csv
/kaggle/input/train-or-house-train-dataset/train (1).csv


In [2]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import seaborn as sns

In [3]:
df = pd.read_csv('/kaggle/input/train-or-house-train-dataset/train (1).csv')[['Age','Pclass','SibSp','Parch','Survived']]

In [4]:
df.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Survived
0,22.0,3,1,0,0
1,38.0,1,1,0,1
2,26.0,3,0,0,1
3,35.0,1,1,0,1
4,35.0,3,0,0,0


In [5]:
df.dropna(inplace=True)

In [6]:
x = df.iloc[:,0:4]
y = df.iloc[:,-1]

In [7]:
x.head()

Unnamed: 0,Age,Pclass,SibSp,Parch
0,22.0,3,1,0
1,38.0,1,1,0
2,26.0,3,0,0
3,35.0,1,1,0
4,35.0,3,0,0


In [8]:
np.mean(cross_val_score(LogisticRegression(),x,y,scoring='accuracy',cv=20))

0.6933333333333332

# Applying Feature Construction

In [9]:
x['Family_size'] = x['SibSp'] + x['Parch'] + 1

In [10]:
def myfunc(num):
    if num == 1:
        #alone
        return 0
    elif num > 1 and num <= 4:
        #small family
        return 1
    else:
        return 2

In [11]:
myfunc(7)

2

In [12]:
x['Family_type'] = x['Family_size'].apply(myfunc)

In [13]:
x.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Family_size,Family_type
0,22.0,3,1,0,2,1
1,38.0,1,1,0,2,1
2,26.0,3,0,0,1,0
3,35.0,1,1,0,2,1
4,35.0,3,0,0,1,0


In [14]:
x.drop(columns=['SibSp','Parch','Family_size'],inplace=True)

In [15]:
x.head()

Unnamed: 0,Age,Pclass,Family_type
0,22.0,3,1
1,38.0,1,1
2,26.0,3,0
3,35.0,1,1
4,35.0,3,0


In [16]:
np.mean(cross_val_score(LogisticRegression(),x,y,scoring='accuracy',cv=20))

0.7003174603174602

# Feature Splitting

In [17]:
df = pd.read_csv('/kaggle/input/train-or-house-train-dataset/train (1).csv')

In [18]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [19]:
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [20]:
df['Title'] = df['Name'].str.split(',',expand=True)[1].str.split('.',expand=True)[0]

In [21]:
df[['Title','Name']]

Unnamed: 0,Title,Name
0,Mr,"Braund, Mr. Owen Harris"
1,Mrs,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,Miss,"Heikkinen, Miss. Laina"
3,Mrs,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,Mr,"Allen, Mr. William Henry"
...,...,...
886,Rev,"Montvila, Rev. Juozas"
887,Miss,"Graham, Miss. Margaret Edith"
888,Miss,"Johnston, Miss. Catherine Helen ""Carrie"""
889,Mr,"Behr, Mr. Karl Howell"


In [22]:
(df.groupby('Title').mean()['Survived']).sort_values(ascending=False)

TypeError: agg function failed [how->mean,dtype->object]

In [None]:
df['Is_Married'] = 0
df['Is_Married'].loc[df['Title'] == 'Mrs'] = 1

In [None]:
df['Is_Married']