Dimensionality Reduction Using Feature Selection	
->Feature selection is the process of reducing the number of input variables when developing a predictive model
->There are two main types of feature selection techniques:
    1.supervised and unsupervised
    2.supervised methods may be divided into 
        a.wrapper
        b.filter
        c.intrinsic
->Filter-based feature selection methods use statistical measures to score the correlation or dependence between input variables that can be filtered to choose the most relevant features
->Statistical measures for feature selection must be carefully chosen based on the data type of the input variable and the output or response variable


->Unsupervised: Do not use the target variable (e.g. remove redundant variables).
    a.Correlation
->Supervised: Use the target variable (e.g. remove irrelevant variables).
    a.Wrapper: Search for well-performing subsets of features.
        ->RFE (Recursively Eliminating Features)
    b.Filter: Select subsets of features based on their relationship with the target.
        ->Statistical Methods
        ->Feature Importance Methods
    c.Intrinsic: Algorithms that perform automatic feature selection during training.
        ->Decision Trees
->Dimensionality Reduction: Project input data into a lower-dimensional feature space.

Common data types include numerical (such as height) and categorical (such as a label)


Common input variable data types:

a.Numerical Variables
b.Integer Variables.
c.Floating Point Variables.
d.Categorical Variables.
e.Boolean Variables (dichotomous).
f.Ordinal Variables.
g.Nominal Variables.


    Numerical Feature Variance:
    ->This is a regression predictive modeling problem with numerical input variables.

    ->The most common techniques are to use a correlation coefficient, such as Pearson’s for a linear correlation, or rank-based methods for a nonlinear correlation.

        a.Pearson’s correlation coefficient (linear).
        b.Spearman’s rank coefficient (nonlinear)

	Binary Feature Variance
	Highly Correlated Features
    Correlation Statistics
        ->The scikit-learn library provides an implementation of most of the useful statistical measures.
	Removing Irrelevant Features
	Recursively Eliminating Features
   

In [2]:
#Recursively Eliminating Features
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
model = LogisticRegression(solver='lbfgs')
rfe = RFE(model, n_features_to_select=5, step=1)
fit = rfe.fit(X, Y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

Num Features: 5
Selected Features: [ True  True False False False  True  True  True]
Feature Ranking: [1 1 2 3 4 1 1 1]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [3]:
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFE(estimator, n_features_to_select=5, step=1)
selector = selector.fit(X, y)
print(selector.support_)
print(selector.ranking_)

[ True  True  True  True  True False False False False False]
[1 1 1 1 1 6 4 3 2 5]


Model Evaluation
->Evaluating the model based on numbers 
    Introduction
	Cross-Validating Models
	Creating a Baseline Regression Model
	Creating a Baseline Classification Model

In [6]:
#Cross-Validating Models
#overfitting
    #eg: 1k Inputs training cat -> Model (trained for n steps) -> 200 imagesTest data cat-> not a model value 

#(Grid Search)

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=0)
X_train.shape, y_train.shape
X_test.shape, y_test.shape
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
print(clf.score(X_test, y_test))

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1, random_state=42)
scores = cross_val_score(clf, X, y, cv=5)
print(scores)

#The mean score and the standard deviation 
print(scores.mean(), scores.std())

0.9666666666666667
[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001 0.016329931618554516


In [11]:
#Creating a Baseline Regression Model
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
cancer = datasets.load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = cancer.target
print(X)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
clf = LogisticRegression(max_iter=10000, random_state=42)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)



     mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0          17.99         10.38          122.80     1001.0          0.11840   
1          20.57         17.77          132.90     1326.0          0.08474   
2          19.69         21.25          130.00     1203.0          0.10960   
3          11.42         20.38           77.58      386.1          0.14250   
4          20.29         14.34          135.10     1297.0          0.10030   
..           ...           ...             ...        ...              ...   
564        21.56         22.39          142.00     1479.0          0.11100   
565        20.13         28.25          131.20     1261.0          0.09780   
566        16.60         28.08          108.30      858.1          0.08455   
567        20.60         29.33          140.10     1265.0          0.11780   
568         7.76         24.54           47.92      181.0          0.05263   

     mean compactness  mean concavity  mean concave points  mea

In [12]:
#Creating a Baseline Classification Model

import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import svm
cancer = datasets.load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = cancer.target
print(X)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
classifier = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
print(classifier.score(X_test, y_test))

     mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0          17.99         10.38          122.80     1001.0          0.11840   
1          20.57         17.77          132.90     1326.0          0.08474   
2          19.69         21.25          130.00     1203.0          0.10960   
3          11.42         20.38           77.58      386.1          0.14250   
4          20.29         14.34          135.10     1297.0          0.10030   
..           ...           ...             ...        ...              ...   
564        21.56         22.39          142.00     1479.0          0.11100   
565        20.13         28.25          131.20     1261.0          0.09780   
566        16.60         28.08          108.30      858.1          0.08455   
567        20.60         29.33          140.10     1265.0          0.11780   
568         7.76         24.54           47.92      181.0          0.05263   

     mean compactness  mean concavity  mean concave points  mea