In [38]:
import numpy as np
import pandas as pd
from sklearn import metrics
from scipy.spatial.distance import cdist
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.manifold import Isomap


1. Take one of the supervised learning models you have built recently and apply at least
three dimensionality reduction techniques to it (separately). Be sure to create a short
summary of each technique you use. Indicate how each changed the model
performance. Reference:
https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/

In [18]:
diabetes_df = pd.read_csv("../week_13/diabetes.csv")

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

# Define a Standard Scaler to normalize inputs
scaler = StandardScaler()
X = scaler.fit_transform(X)

### PCA

PCA is a statistical process that turns a series of observations of possibly correlated variables into a set of principle component values, which are linearly uncorrelated variables. PCA is frequently used to simplify data, decrease noise, and identify unmeasured "latent variables," to put it another way. This implies PCA will assist us in identifying a smaller number of features that will compress our original dataset, capturing up to a percentage of its variation depending on the number of new features we choose.

In [40]:
# Define a pipeline to search for the best combination of PCA truncation and classifier regularization.
pca = PCA(n_components=8,random_state=42)
logistic = LogisticRegression(random_state=42)
pca_model = Pipeline(steps=[("pca", pca), ("logistic", logistic)])

# evaluate model
pca_cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
pca_scores = cross_val_score(pca_model, X, y, scoring='accuracy', cv=pca_cv, n_jobs=-1)
print('Accuracy: %.3f' % (np.mean(pca_scores)))

Accuracy: 0.776


### LDA

PCA chooses new axes for dimensions such that variance (and hence the ‘shape’) of the data is preserved, LDA chooses new axes such that the separability between two classes is optimized.

The likelihood that a new set of inputs belongs to each class is estimated using linear discriminant analysis. The output class with the highest probability is chosen. The LDA makes its prediction in this manner. To estimate probabilities, LDA employs Bayes' Theorem.

In [42]:
lda=LinearDiscriminantAnalysis(n_components=1)
lda_model = Pipeline(steps=[("lda", lda), ("logistic", logistic)])

# evaluate model
lda_cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
lda_scores = cross_val_score(lda_model, X, y, scoring='accuracy', cv=lda_cv, n_jobs=-1)
print('Accuracy: %.3f' % (np.mean(lda_scores)))

Accuracy: 0.772


### Isomap Embedding

Unlike Principle Component Analysis, Isomap (Isometric Feature Mapping) is a non-linear feature reduction approach.
It employs a KNN technique to locate each data point's k closest neighbors. Once the neighbors have been identified, the neighborhood graph is created, with dots connecting to each other if they are neighbors. The shortest path between each pair of data points is then computed (nodes). Finally, it computes lower-dimensional embeddings via multidimensional scaling (MDS).

In [47]:
iso = Isomap(n_components=10)
iso_model = Pipeline(steps=[("iso", iso), ("logistic", logistic)])

iso_cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
iso_scores = cross_val_score(iso_model, X, y, scoring='accuracy', cv=iso_cv, n_jobs=-1)
print('Accuracy: %.3f' % (np.mean(iso_scores)))

Accuracy: 0.723


2. Write a function that will indicate if an inputted IPv4 address is accurate or not.
IP addresses are valid if they have 4 values between 0 and 255 (inclusive), punctuated
by periods.

In [11]:
def IP_addr(IP_address ):
    values = IP_address.split('.')
    if len(values) != 4:
        return False
    for num in values:
        if int(num)> 255 or int(num)<0:
            return False
    return True

In [12]:
IP_addr('2.33.245.5')

True

In [13]:
IP_addr('12.345.67.89')

False