<a href="https://colab.research.google.com/github/angelaaaateng/ftw_python/blob/main/Week_09_05_techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

05 Modeling Techniques
===
Now that we've gone thru the basics, we'll showcase a few techniques that can be used to further improve your model development workflow.


# Model-Based Feature Selection

##### **Feature selection** is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.


Irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression.

Three benefits of performing feature selection before modeling your data are:

1. **Reduces Overfitting**: Less redundant data means less opportunity to make decisions based on noise.
2. **Improves Accuracy**: Less misleading data means modeling accuracy improves.
3. **Reduces Training Time**: Less data means that algorithms train faster.
    
### Univariate Selection


### SelectKBest 
Selects the K best features according to strength of relationship with the output variable.
To test the strength of relationship, different statistical tests can be used. 


In [1]:
from sklearn.datasets import load_digits
from sklearn.feature_selection import SelectKBest, chi2
X, y = load_digits(return_X_y=True)
print(X.shape)
X_new = SelectKBest(chi2, k=20).fit_transform(X, y)
print(X_new.shape)

(1797, 64)
(1797, 20)


From the original 64 features, our new image dimension was reduced to 20. 

### Recursive Feature Elimination (RFE)
Recursively remove attributes and building a model on those attributes that remain.

In [3]:
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFE(estimator, n_features_to_select=5, step=1)
selector = selector.fit(X, y)
selector.support_
selector.ranking_

array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])

### Principal Component Analysis

Reduces the number of features 

In [4]:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)
PCA(n_components=2)
print(pca.explained_variance_ratio_)
print(pca.singular_values_)


[0.99244289 0.00755711]
[6.30061232 0.54980396]


In [5]:
pca = PCA(n_components=2, svd_solver='full')
pca.fit(X)
PCA(n_components=2, svd_solver='full')
print(pca.explained_variance_ratio_)
print(pca.singular_values_)


[0.99244289 0.00755711]
[6.30061232 0.54980396]


In [6]:
pca = PCA(n_components=1, svd_solver='arpack')
pca.fit(X)
PCA(n_components=1, svd_solver='arpack')
print(pca.explained_variance_ratio_)
print(pca.singular_values_)


[0.99244289]
[6.30061232]


# Hold out strategies

### Train Test **Validation** Split

In [7]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

X, y = load_digits(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1) # 0.25 x 0.8 = 0.2

### Cross Validation (Accuracy Score)

In [8]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
scores = cross_val_score(clf, X, y, cv=5,scoring='f1_macro')
scores

#compute the mean to get a single value

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

array([0.92315119, 0.86956104, 0.94078712, 0.93790736, 0.89780237])

# Pipeline 

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# add your data here
X_train,y_train = load_digits(return_X_y=True)

# it takes a list of tuples as parameter
pipeline = Pipeline([
    ('scaler',StandardScaler()),
    ('clf', LogisticRegression())
])

# use the pipeline object as you would
# a regular classifier
pipeline.fit(X_train,y_train)

Pipeline(steps=[('scaler', StandardScaler()), ('clf', LogisticRegression())])

# Grid Search

In [10]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
gsc = GridSearchCV(
        estimator=SVR(kernel='rbf'),
        param_grid={
            'C': [0.1, 1, 100],
            'epsilon': [0.0001, 0.0005, 0.001],
            'gamma': [0.0001, 0.001, 0.005]
        },
        cv=5, scoring='neg_mean_squared_error', verbose=0, n_jobs=-1)

grid_result = gsc.fit(X, y)
best_params = grid_result.best_params_
best_params

{'C': 100, 'epsilon': 0.0001, 'gamma': 0.001}

# Why not combine them?

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# add your data here
X_train,y_train = load_digits(return_X_y=True)

gsc = GridSearchCV(
        estimator=LogisticRegression(),
        param_grid={
            'C': [0.1, 1, 100, 1000]
        
        },
        cv=5, scoring='neg_mean_squared_error', verbose=0, n_jobs=-1)

# it takes a list of tuples as parameter
pipeline = Pipeline([
    ('scaler',StandardScaler()),
    ('clf', gsc)
])

# use the pipeline object as you would
# a regular classifier
pipeline.fit(X_train,y_train)

Pipeline(steps=[('scaler', StandardScaler()),
                ('clf',
                 GridSearchCV(cv=5, estimator=LogisticRegression(), n_jobs=-1,
                              param_grid={'C': [0.1, 1, 100, 1000]},
                              scoring='neg_mean_squared_error'))])

# Model Persistence

In [12]:
from sklearn import svm
from sklearn import datasets
clf = svm.SVC()
X, y= datasets.load_iris(return_X_y=True)
clf.fit(X, y)


from joblib import dump, load
dump(clf, 'trained.mdl') 

['trained.mdl']

In [13]:
clf = load('trained.mdl') 
clf.predict(X[0:1])

array([0])