1. Take one of the supervised learning models you have built recently and apply at least three dimensionality reduction techniques to it (separately). Be sure to create a short summary of each technique you use. Indicate how each changed the model performance. Reference: https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/

In [40]:
import numpy as np
import pandas as pd
from sklearn import metrics
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt

In [41]:
import warnings
warnings.filterwarnings("ignore")

In [56]:
arrhythmia_df = pd.read_csv("arrhythmiaNEW.csv") 
arrhythmia_df

Unnamed: 0,age,sex,height,weight,QRSduration,PRinterval,Q-Tinterval,Tinterval,Pinterval,QRS,...,chV2_TwaveAmp,chV2_QRSTA,chV3_SwaveAmp,chV3_PwaveAmp,chV4_RwaveAmp,chV4_PwaveAmp,chV5_JJwaveAmp,chV5_SPwaveAmp,chV6_SPwaveAmp,class
0,75.0,0.0,190.0,80.0,91.0,193.0,371.0,174.0,121.0,-16.0,...,2.9,15.2,-10.0,0.6,15.2,0.9,-0.4,0.0,0.0,1.0
1,56.0,1.0,165.0,64.0,81.0,174.0,401.0,149.0,39.0,25.0,...,2.0,1.2,-7.7,0.9,9.5,0.5,-0.4,0.0,0.0,1.0
2,54.0,0.0,172.0,95.0,138.0,163.0,386.0,185.0,102.0,96.0,...,-2.4,-2.6,-4.1,0.4,10.0,0.5,1.3,0.0,0.0,1.0
3,55.0,0.0,175.0,94.0,100.0,202.0,380.0,179.0,143.0,28.0,...,2.9,18.0,-7.9,0.1,15.0,0.1,0.1,0.0,0.0,0.0
4,75.0,0.0,190.0,80.0,88.0,181.0,360.0,177.0,103.0,-16.0,...,2.1,8.6,-10.2,-1.0,15.2,-0.1,-0.2,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
447,53.0,1.0,160.0,70.0,80.0,199.0,382.0,154.0,117.0,-37.0,...,1.5,-4.8,-9.2,-0.1,2.9,0.8,0.1,0.0,0.0,0.0
448,37.0,0.0,190.0,85.0,100.0,137.0,361.0,201.0,73.0,86.0,...,8.8,75.4,-5.4,-0.3,22.5,0.7,-0.7,0.0,0.0,1.0
449,36.0,0.0,166.0,68.0,108.0,176.0,365.0,194.0,116.0,-85.0,...,-7.0,12.7,-34.1,1.4,20.6,1.0,0.2,0.0,0.0,1.0
450,32.0,1.0,155.0,55.0,93.0,106.0,386.0,218.0,63.0,54.0,...,6.5,63.8,-7.7,0.9,11.9,0.6,0.1,0.0,0.0,0.0


In [57]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


X = arrhythmia_df.drop('class', axis=1).values
y = arrhythmia_df['class'].values

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train_sc=sc.fit_transform(X_train)
X_test_sc=sc.fit_transform(X_test)

## Logistic Regression

In [53]:
#logistic regression
clr = LogisticRegression(random_state=42).fit(X_train_sc,y_train)

#predict
y_predicted = clr.predict(X_test_sc)
# print(y_predicted)

print(clr.score(X_test_sc,y_test))

0.7647058823529411


## Three dimensionality reduction techniques


### Singular Value Decomposition:
 
This type of dimensionality reduction is used with sparse data. Sparse date means that the rows of data have many values of zero.

SVD is like a projection method where data with some number of features is projected into a subspace with fewer features/columns, but the main aspects of the original data is still retained. 

### Principle Component Analysis:

PCA is a type of feature extraction instead of feature elimination. In feature elimination, some features are removed and by extension the way in which these variables effect the target is also lost. But with PCA, a type of feature extraction, the "best" parts of all features are retainied. 

In PCA, "new" feature variables are created which are each a combination of each of the original features. Then the "new" variables are ordered by how well they predict the target variable. And the least important ones are removed. However, since the new independent variables still have some combination of the removed variables, the effect that the removed variables would have still is retained. 

An added benift of PCA is that the new feature variables are independent of one another as the linear model requires it.  

PCA is great to use if you are not able to identity important variables in the dataset. 


### Isomap Embedding:

This dimensionality reduction approach is the Non-linear approach. Isomap Embedding is used when data isn't in a linear subspace. Since the most relevant information in high dimensional space is found in a few low dimensional space, we want to reduce the complexity and as we do that (meaning remove attributes), we will see natural clusters of our data.

A 3 dimensional swiss roll (values or data points on a 3 dimensional plane in the shape of a swiss roll) is reduced to 2 dimensions through the Isomap embedding technique so that the values are now lying on a flat 2D plane. 

### Model Performance Comparison

The highest score was attained by using PCA. This may be because all the components of the features are retained. Also SVD performed well too. This could be due to the sparse data in the dataset. 

### Singular Value Decomposition

In [49]:
# Singular Value Decomposition
from sklearn.decomposition import TruncatedSVD

for i in range(2,136): 
    svd = TruncatedSVD(n_components=i)
    X_train_svd=svd.fit_transform(X_train)
    X_test_svd=svd.transform(X_test)
    model_1 = LogisticRegression(random_state=42).fit(X_train_svd, y_train)
    SCORE = model_1.score(X_test_svd, y_test)
    if SCORE >= 0.78:
        print(i,SCORE) 

37 0.7867647058823529
39 0.7867647058823529
51 0.7867647058823529
52 0.7941176470588235
53 0.7867647058823529
54 0.7941176470588235
55 0.7867647058823529


#### SVD Highest Score: 79% at iteration 52

### Principal Component Analysis

In [55]:
from sklearn.decomposition import PCA

for i in range(2,136):
    
    pca_model = PCA(n_components=i)
    X_train_pca = pca_model.fit_transform(X_train)
    X_test_pca = pca_model.transform(X_test)
    model_2 = LogisticRegression(random_state=42).fit(X_train_pca, y_train)
    SCORE = model_2.score(X_test_pca, y_test)
    
    if SCORE >= 0.78:
        print(i,SCORE)

25 0.7867647058823529
38 0.7941176470588235
39 0.8014705882352942
40 0.7941176470588235
41 0.8014705882352942
42 0.8014705882352942
43 0.8014705882352942
44 0.7867647058823529
45 0.7941176470588235
46 0.7867647058823529
48 0.7867647058823529
51 0.8235294117647058
54 0.7867647058823529


#### PCA Highest Score: 82% at iteration 51

### PCA

In [30]:
#EXTRA
from sklearn.decomposition import PCA

for n in range(1,20):
    pca = PCA(n_components=n)

    X_train_pca = pca.fit_transform(X_train)
    X_test_pca = pca.transform(X_test)
    classifier = LogisticRegression(random_state=42).fit(X_train_pca, y_train)


print(n, (classifier.score(X_test_pca, y_test)))

19 0.7647058823529411


### Isomap Embedding

In [33]:
from sklearn.manifold import Isomap

In [58]:
for i in range(2,136):
    
    Isomap_model = Isomap(n_components=i)
    X_train_isomap = Isomap_model.fit_transform(X_train)
    X_test_isomap = Isomap_model.transform(X_test)
    model_3 = LogisticRegression(random_state=42).fit(X_train_isomap, y_train)
    SCORE = model_3.score(X_test_isomap, y_test)
    
    if SCORE >= 0.7:
        print(i,SCORE)

18 0.7132352941176471
19 0.7205882352941176
20 0.7205882352941176
21 0.7279411764705882
22 0.7132352941176471
23 0.7352941176470589
24 0.7205882352941176
25 0.7132352941176471
26 0.7132352941176471
27 0.7132352941176471
28 0.7058823529411765


#### Isomap Highest Score: 74% at iteration 23

2. Write a function that will indicate if an inputted IPv4 address is accurate or not.
IP addresses are valid if they have 4 values between 0 and 255 (inclusive), punctuated by periods.

Input 1:

2.33.245.5

Output 1:

True

Input 2:

12.345.67.89

Output 2:

False

In [35]:
def IPA(address):
    try:
        numbers = address.split('.')
        if len(numbers) != 4:
            return False
        for number in numbers:
            if int(number) < 0 or int(number) > 255:
                return False
            else:
                return True
    except Exception as e:
        return False

In [37]:
IPA('2.33.245.5')

True

In [59]:
IPA('12.345.67.89')

False