<div style="background:#67FFF0; color:#000; display: flex; justify-content:space-between;">
    <img src="https://lh3.googleusercontent.com/a/ACg8ocJ2Kso9dHD6qSdvkKkBE5_t0E20Sqa_DCTSSfRH53dl-sPyBZE=s576-c-no" style="width:100px; flex:end" alt="DATAIDEA">
    <h1 style="padding-left: 15px; font-weight:bold">Programming For Data Science Course
        <br>
        Week 7: Data Wrangling and Feature Engineering
    </h1>
</div>

## Feature Selection
Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.

Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression.

Three benefits of performing feature selection before modeling your data are:

- Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
- Improves Accuracy: Less misleading data means modeling accuracy improves.
- Reduces Training Time: Less data means that algorithms train faster.

You can learn more about feature selection with scikit-learn in the article Feature selection.

In [1]:
from dataidea.tabular import *
from sklearn.model_selection import train_test_split

In [2]:
data = load('demo')

In [3]:
?load

[0;31mSignature:[0m [0mload[0m[0;34m([0m[0mname[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m [0minbuilt[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m [0mfile_type[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Easily load datasets that are inbuit in DATAIDEA

parameters:
name: this is the name of the dataset, eg demo, fpl, music, titanic etc
inbuilt: boolean to specify whether data is from DATAIDEA or custom data
type: specifies the type of the dataset eg 'csv', 'excel' etc
[0;31mFile:[0m      ~/venvs/dataanalysis/lib/python3.10/site-packages/dataidea/datasets.py
[0;31mType:[0m      function

In [8]:
data = load('../assets/demo_cleaned.csv', inbuilt=False, file_type='csv')

In [9]:
data

Unnamed: 0,age,gender,marital_status,address,income,income_category,job_category
0,55,f,1,12,72,3,3
1,56,m,0,29,153,4,3
2,24,m,1,4,26,2,1
3,45,m,0,9,76,4,2
4,44,m,1,17,144,4,3
...,...,...,...,...,...,...,...
188,45,f,0,3,86,4,3
189,23,f,1,2,27,2,1
190,66,f,1,32,11,1,2
191,49,m,0,4,30,2,1


#### Univariate Selection
Statistical tests can be used to select those features that have the strongest relationship with the output variable.

The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

Many different statistical test scan be used with this selection method. For example the ANOVA F-value method is appropriate for numerical inputs and categorical data, as we see in the Pima dataset. This can be used via the f_classif() function. We will select the 4 best features using this method in the example below.

In [3]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

In [4]:
data.head(n=5)

Unnamed: 0,age,gender,marital_status,address,income,income_category,job_category
0,55,f,1,12,72,3,3
1,56,m,0,29,153,4,3
2,24,m,1,4,26,2,1
3,45,m,0,9,76,4,2
4,44,m,1,17,144,4,3


In [5]:
dummed_data = pd.get_dummies(data, 
                             columns=['gender'], 
                             drop_first=True,
                            dtype='int')

In [6]:
X = dummed_data.drop('marital_status', axis=1)
y = dummed_data.marital_status

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [8]:
# feature extraction
test = SelectKBest(score_func=f_classif, k=6)
fit = test.fit(X, y)
# summarize scores
test_scores = fit.scores_
features = fit.transform(X)
# summarize selected features

In [9]:
from sklearn.feature_selection import f_regression

test = SelectKBest(score_func=f_regression, k=5) # Select top 5 features, adjust k as needed

# Fit the selector to the data
fit = test.fit(X, y)

# get scores
test_scores = fit.scores_

# summarize selected features
features = fit.transform(X)

# Get the selected feature indices
selected_indices = fit.get_support(indices=True)

#### Recursive Feature Elimination
The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain.

It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

You can learn more about the RFE class in the scikit-learn documentation.

In [36]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [118]:
# feature extraction
model = LogisticRegression()
rfe = RFE(model)
fit = rfe.fit(X, y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

Num Features: 3
Selected Features: [False False False  True  True  True]
Feature Ranking: [2 3 4 1 1 1]


#### Feature Importance
Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.

In the example below we construct a ExtraTreesClassifier classifier for the Pima Indians onset of diabetes dataset. You can learn more about the ExtraTreesClassifier class in the scikit-learn API.

In [77]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier

In [120]:
# feature extraction
model = ExtraTreesClassifier(n_estimators=100)
model.fit(X, y)
print(model.feature_importances_)

[0.29681826 0.24594035 0.25805697 0.07274302 0.08510347 0.04133793]


In [115]:
rfc = RandomForestClassifier()

In [116]:
rfc.fit(X_train, y_train)
rfc.score(X_test, y_test)

0.4897959183673469

In [91]:
X.head(n=3)

Unnamed: 0,age,address,income,income_category,job_category,gender_m
0,55,12,72,3,3,0
1,56,29,153,4,3,1
2,24,4,26,2,1,1


In [117]:
rfc.fit(X_train[['income_category',	'job_category',	'gender_m']], y_train)
rfc.score(X_test[['income_category',	'job_category',	'gender_m']], y_test)

0.4897959183673469

<div style="font-style: futura; background:#67FFF0; color:#000;
    padding:15px">
    <h1>Do you seriously want to learn Programming and Data Analysis with Python?</h1>
    <p>
If you’re serious about learning Programming, Data Analysis with Python and getting prepared for Data Science roles, I highly encourage you to enroll in my Programming for Data Science Course, which I've taught to hundreds of students. Don’t waste your time following disconnected, outdated tutorials
    </p>
    <p>
    My Complete Programming for Data Science Course has everything you need in one place. 
    </p>
    <ul>
        The course offers:
        <li>Duration: Usually 3-4 months</li>
        <li>Sessions: Four times a week (one on one)</li>
        <li>Location: Online or/and at UMF House, Sir Apollo Kagwa Road</li>
    </ul>
    <ul>
        What you'l learn:
        <li>Fundamentals of programming</li>
        <li>Data manipulation and analysis </li>
        <li>Visualization techniques</li>
        <li>Introduction to machine learning</li>
        <li>Database Management with SQL (optional)</li>
        <li>Web Development with Django (optional)</li>
    </ul>
    <ul style="list-style: none">
    <li>Best</li>
    <li>Juma Shafara</li>
    <li>Data Scientis, Instructor</li>
    <li>
    <a href="mailto:jumashafara0@gmail.com">jumashafara0@gmail.com</
    </li>
        <li>
    <a href="tel:+256701520768">+256701520768</a> or <a href="tel:+256771754118">+256771754118</a> 
        </li>
    </ul
    <div>
        <img src='../assets/profile.jpg' style="width:100px" alt="Juma Shafara">
        <img src='../assets/dataidea-logo.png' style="width:100px" alt="DATAIDEA">
    </div>
</div>
