##### Week 19 - Group Activity

###### Angela Spencer - February 2, 2022

##### 1. Take one of the supervised learning models you have built recently and apply at least three dimensionality reduction techniques to it (separately). Be sure to create a short summary of each technique you use. Indicate how each changed the model performance. Reference:
        https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#### Diabetes Dataset with Logistic Regression Model

In [2]:
diabetes_df = pd.read_csv("../Datasets/diabetes.csv")

X = diabetes_df.drop('Outcome', axis=1).values
y = diabetes_df['Outcome'].values

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=42, stratify=y)

sc=StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.fit_transform(X_test)

#LogisticRegression

clr = LogisticRegression(random_state=42).fit(X_train_sc, y_train)

#predict
y_predicted = clr.predict(X_test_sc)

#accuracy
clr.score(X_test_sc, y_test)

0.7359307359307359

##### Singular Value Decomposition
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
https://machinelearningmastery.com/singular-value-decomposition-for-machine-learning/

SVD breaks a matrix down to its component parts. For the diabetes dataset, SVD produced an accuracy score of 0.71

In [3]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=6)

X_train_svd = svd.fit_transform(X_train)
X_test_svd = svd.fit_transform(X_test)

clr = LogisticRegression(random_state=42).fit(X_train_svd, y_train)

clr.score(X_test_svd, y_test)

0.70995670995671

##### Linear Discriminant Analysis
https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
https://machinelearningmastery.com/linear-discriminant-analysis-for-dimensionality-reduction-in-python/

LDA is most often used for multi-class classification problems.  LDA reduces the number of input variables in the dataset. 

LDA on the diabetes dataset - Because the diabetes dataset has 2 classes, the maximum number of components to use is 1. We found that LDA gave an accuracy score of 0.37 - this makes sense as this technique is not used for binary classification.

In [8]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(n_components=1)

X_train_lda = lda.fit_transform(X_train, y_train)
X_test_lda = lda.fit_transform(X_test, y_test)

clr = LogisticRegression(random_state=42).fit(X_train_lda, y_train)

clr.score(X_test_lda, y_test)

0.36796536796536794

##### Principal Component Analysis
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/
https://www.datacamp.com/community/tutorials/principal-component-analysis-in-python

PCA is used to reduce the number of features by finding the related components in the dataset and removing the non-essential components. It projects the high dimensional original data into a lower dimensional subspace.

PCA on the diabetes dataset - A PCA with 6 components gave an accuracy score of 0.75

In [5]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 6, random_state=42)

X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.fit_transform(X_test)

clr = LogisticRegression(random_state = 42).fit(X_train_pca, y_train)

clr.score(X_test_pca, y_test)

0.7489177489177489

##### 2. Write a function that will indicate if an inputted IPv4 address is accurate or not. IP addresses are valid if they have 4 values between 0 and 255 (inclusive), punctuated by periods.
        Input 1:
        2.33.245.5
        Output 1:
        True
        Input 2:
        12.345.67.89
        Output 2:
        False

In [24]:
import numbers

# define a function to take in an IP address

def IPv4_address(address):
    
    #if the string does not contain 3 period, return false
    if address.count(".") !=3:
        return False
    
    #if an empty string, return False
    elif address == "":
        return False

    # else split address with period as delimiter
    else:
        lst_split_address = address.split(".")
        
        #for each group of numbers, k, if k is not numeric, return False and break
        for k in lst_split_address:
            if k.isnumeric() == False:
                return False
                break
        
        #for each group of numbers (num), cast to integer and populate list
        slice_split_address = [int(num) for num in lst_split_address[0:]]
        
        #set m to 0
        m=0
        
        # for each group of numbers, u in list of number groups
        for u in slice_split_address:
            
            #if u is > 255, return False
            if u > 255:
                return False
            
            #else, update m
            else:
                m = m + 1
        
        #if m = 4 (there are 4 groups of numbers, each > 255) return True
        if m == 4:
            return True
        else:
            return False

In [20]:
IPv4_address('12.256.67.89')

False

In [23]:
IPv4_address('2.66.245.5')

True