## Feature Construction

Feature construction, also known as feature engineering, is the process of creating new features or transforming existing features in a dataset to improve the performance of machine learning models. Features are the individual measurable properties or characteristics of the data that are used as input for a machine learning algorithm. Feature construction involves manipulating these features in a way that enhances the model's ability to learn and make accurate predictions.

## Feature splitting

Feature splitting in machine learning refers to the process of dividing a single feature into multiple features based on certain criteria or conditions. This technique is commonly used to extract more information from a single feature, especially when the feature contains multiple pieces of information or when different parts of the feature have varying levels of importance for the predictive task at hand.

Feature splitting can be applied in various ways, depending on the nature of the data and the specific problem being addressed. Some common approaches to feature splitting include:

1. **Textual feature splitting**: In natural language processing tasks, textual features such as sentences or paragraphs can be split into individual words or tokens. This allows the model to capture the semantics and relationships between different words more effectively.

2. **Temporal feature splitting**: Time-related features, such as dates or timestamps, can be split into components such as year, month, day, hour, etc. This can help the model identify temporal patterns and seasonality in the data.

3. **Spatial feature splitting**: Spatial features, such as geographic coordinates, can be split into components like latitude and longitude. This allows the model to capture spatial relationships and geographic patterns more accurately.

4. **Numeric feature splitting**: Numeric features with a wide range of values can be split into bins or intervals. This discretization process can help the model handle non-linear relationships and improve interpretability.

5. **Categorical feature splitting**: Categorical features with multiple levels can be split into binary or multi-level indicator variables using techniques like one-hot encoding or binary encoding. This allows the model to handle categorical variables more efficiently.

6. **Feature splitting based on domain knowledge**: Features can also be split based on domain-specific knowledge or insights about the data. For example, in financial modeling, a single financial metric such as revenue can be split into components like gross revenue, net revenue, recurring revenue, etc., to better capture the underlying business dynamics.

By splitting features into meaningful components, machine learning models can often achieve better performance and generalization on complex datasets. However, it's important to carefully consider the trade-offs involved, such as increased dimensionality and potential loss of information, when applying feature splitting techniques.

Data = https://raw.githubusercontent.com/campusx-official/100-days-of-machine-learning/main/day45-feature-construction-and-feature-splitting/train.csv

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [2]:
df = pd.read_csv(' https://raw.githubusercontent.com/campusx-official/100-days-of-machine-learning/main/day45-feature-construction-and-feature-splitting/train.csv',
                usecols=['Age', 'Pclass', 'SibSp', 'Parch', 'Survived'])

In [3]:
df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch
0,0,3,22.0,1,0
1,1,1,38.0,1,0
2,1,3,26.0,0,0
3,1,1,35.0,1,0
4,0,3,35.0,0,0


In [4]:
df.isnull().sum()

Survived      0
Pclass        0
Age         177
SibSp         0
Parch         0
dtype: int64

In [5]:
df.dropna(inplace=True)

In [6]:
df.sample(5)

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch
43,1,2,3.0,1,2
210,0,3,24.0,0,0
327,1,2,36.0,0,0
192,1,3,19.0,1,0
873,0,3,47.0,0,0


In [7]:
df.isnull().sum()

Survived    0
Pclass      0
Age         0
SibSp       0
Parch       0
dtype: int64

In [8]:
x = df.drop(columns=['Survived'])
y = df['Survived']

In [9]:
x.head()

Unnamed: 0,Pclass,Age,SibSp,Parch
0,3,22.0,1,0
1,1,38.0,1,0
2,3,26.0,0,0
3,1,35.0,1,0
4,3,35.0,0,0


In [10]:
np.mean(cross_val_score(LogisticRegression(),x,y,scoring='accuracy',cv=20))

0.6933333333333332



- `np.mean`: This function calculates the mean (average) of a set of values. It's from the NumPy library.

- `cross_val_score`: This function is used for performing cross-validation on a machine learning model. It evaluates the model's performance by splitting the data into multiple subsets (folds), training the model on some of the folds, and then testing it on the remaining fold. This process is repeated multiple times, and the scores from each iteration are returned.

- `LogisticRegression()`: This is the logistic regression model that will be used for classification. It's assumed that you've imported this model from a library like scikit-learn (`from sklearn.linear_model import LogisticRegression`).

- `x`: This is the input features (independent variables) of the dataset.

- `y`: This is the target variable (dependent variable) of the dataset.

- `scoring='accuracy'`: This parameter specifies the evaluation metric to be used for scoring the performance of the model. In this case, it's set to 'accuracy', which means the proportion of correctly classified instances.

- `cv=20`: This parameter specifies the number of folds (partitions) to be used in the cross-validation process. In this case, it's set to 20, meaning the data will be split into 20 folds for cross-validation.

Putting it all together, the line of code calculates the mean accuracy score of a logistic regression model using 20-fold cross-validation on the dataset defined by the input features `x` and the target variable `y`.

### Applying Feature Construction

In [11]:
x['Family_size'] = x['SibSp'] + x['Parch'] + 1

In [12]:
x.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Family_size
0,3,22.0,1,0,2
1,1,38.0,1,0,2
2,3,26.0,0,0,1
3,1,35.0,1,0,2
4,3,35.0,0,0,1


In [13]:
def familyfunc(num):
    if num == 1:
        #alone
        return 0
    elif num >1 and num <=4:
        return 1
        #small fam
    else:
        return 2
        #large fam

In [14]:
familyfunc(1)

0

In [15]:
familyfunc(2)

1

In [16]:
familyfunc(5)

2

In [17]:
x['Family_type'] = x['Family_size'].apply(familyfunc)

In [18]:
x.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Family_size,Family_type
0,3,22.0,1,0,2,1
1,1,38.0,1,0,2,1
2,3,26.0,0,0,1,0
3,1,35.0,1,0,2,1
4,3,35.0,0,0,1,0


In [19]:
x.drop(columns=['SibSp','Parch','Family_size'],inplace=True)

In [20]:
np.mean(cross_val_score(LogisticRegression(), x,y, scoring='accuracy', cv=20))

0.7003174603174602

### Feature Splitting

In [21]:
df = pd.read_csv(' https://raw.githubusercontent.com/campusx-official/100-days-of-machine-learning/main/day45-feature-construction-and-feature-splitting/train.csv')

In [22]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [23]:
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [24]:
df['Title'] = df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]

In [25]:
df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]

0        Mr
1       Mrs
2      Miss
3       Mrs
4        Mr
       ... 
886     Rev
887    Miss
888    Miss
889      Mr
890      Mr
Name: 0, Length: 891, dtype: object

In [26]:
df[['Title','Name']]

Unnamed: 0,Title,Name
0,Mr,"Braund, Mr. Owen Harris"
1,Mrs,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,Miss,"Heikkinen, Miss. Laina"
3,Mrs,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,Mr,"Allen, Mr. William Henry"
...,...,...
886,Rev,"Montvila, Rev. Juozas"
887,Miss,"Graham, Miss. Margaret Edith"
888,Miss,"Johnston, Miss. Catherine Helen ""Carrie"""
889,Mr,"Behr, Mr. Karl Howell"


In [27]:
(df.groupby('Title').mean()['Survived']).sort_values(ascending=False)

  (df.groupby('Title').mean()['Survived']).sort_values(ascending=False)


Title
the Countess    1.000000
Mlle            1.000000
Sir             1.000000
Ms              1.000000
Lady            1.000000
Mme             1.000000
Mrs             0.792000
Miss            0.697802
Master          0.575000
Col             0.500000
Major           0.500000
Dr              0.428571
Mr              0.156673
Jonkheer        0.000000
Rev             0.000000
Don             0.000000
Capt            0.000000
Name: Survived, dtype: float64