# CS559 - F20 Project #1

## Task Desciption
You are provided with an anonymized dataset containing numeric feature variables, the binary target column, and a string ID_code column.

The task is to predict the value of `target` column in the test set using either **Logistic Regression** and **SVM**. You are welcome to use **regularizaiton**. 

## File descriptions
- train.csv - the training set (202 columns)

- test.csv - the test set. The test set contains some rows which are not included in scoring.

## Rules
- The data does not have specific column names. Therefore, you will not know what data is about. 
- However, you still can do classicaition problem without clustering the training set. **No unsupervised learning techniques in this project**. 
- There are 202 columns. This means that the key of high accuracy comes from **EDA** and **feature enegineering**. 
- There are no rules on EDA and Feature Engineering. 
- On your model, make sure you can reduce the columns at the most of 25%. If we use all columns, we may have high computational cost and getting into bias-variance tradeoff and underfit vs. overfit situations. 
- The project is out of 100. 
    - 50 points will come from your EDA and any pre-processing work. 
    - 30 points will come from your model: Accuracy + overcoming any ML challenges. 
    - 10 points will come from in-class competition. 
        - Ranking the accuracy with less features. 
    - 10 points will come from a report describing your work flow and model evaluations.
        - must be submitted in different file (e.g., pdf, docx). 
        
## Recommand Before-Preprocessing
- You can split the set from the data distribution. 
- You can make multiple new data frames by randomly selecting columns. 
- You can do similar by rows. 

## Recommand Before-training model
- Make sure to delete features from supportive reasons. 

Proejct DUE: 10/23/2020 Friday 11:59 PM. 




In [3]:
#Handle imports
import pandas as pd
import numpy as np
import random

#Read in dataset
dataset = pd.read_csv("Train.csv")


In [4]:
#Seperate features and target, drop unnecessary ID_code column
features = dataset.drop("target", axis = 1)
features = dataset.drop("ID_code", axis = 1)
target = dataset["target"]

In [None]:
#Preprocess best features
correlations = features.corrwith(target)
correlations = correlations.reindex(correlations.abs().sort_values().index)


#Use correlations to select the 25% maximally corelated features
selected_features = []
for row in range(3*len(correlations.index)//4, len(correlations.index)):
    selected_features += [correlations.index[row]]

#Make dataframe of selected features
new_features = selected_features


In [None]:
#Personal implementation of LogisticRegression, allowing for learning rate to be modified
class LogisticRegressionPersonal:
    def __init__(self, lr = 0.01, n = 100000, features_list_index = None):
        self.lr = lr
        self.num_iter = n
        self.features_list_index = features_list_index
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))    
    
    def fit(self, X, y):
        X = np.delete(X.to_numpy(), 0, 1)
        y = y.to_numpy()

        # weights initialization
        self.theta = np.zeros(X.shape[1])
        
        for i in range(self.num_iter):
            try:
                z = np.array(np.dot(X, self.theta), dtype = np.float64)
                h = self.sigmoid(z)
                gradient = np.array(np.dot(X.T, (h - y)) / y.size, dtype = np.float64)
                self.theta -= self.lr * gradient
            except:
                print(X, self.theta)
        
    def predict_prob(self, X):
        z = np.array(np.dot(X, self.theta), dtype = np.float64)
        return self.sigmoid(np.array(np.dot(X, self.theta), dtype = np.float64))
    
    def predict(self, X, threshold = .5):
        X = np.delete(X.to_numpy(), 0, 1)
        return self.predict_prob(X) >= threshold
    
    def __str__(self):
        return "Learning Rate: " + str(self.lr) + '\n' + "Number of Iterations: " + str(self.num_iter) +'\n' + "Features list index: " + str(self.features_list_index)

In [2]:
#List of learning rates to create varied models of LogisticRegressionPersonal
learning_rates = [.003, .005, .01]

#List of iteration limits to create varied models for both implementations of Logistic Regression
num_iterations = [500, 1000, 5000]

#List to hold different feature lists, starting with the one that was selected using EDA to find the most correlated features
features_list = [new_features]

#Select random features in batches and create feature lists with this, to compare to the correlations based one
for i in range(2):
    indices = random.sample(range(1, 200), 50)
    selected_features = []
    for row in indices:
        selected_features += [features.iloc[:, row].name]
    features_list += [selected_features]

#List to store the top 10 performing models
best_lr_models = []



    


NameError: name 'new_features' is not defined

In [None]:
#Create varied models for LogisticRegressionPersonal
for i in range(len(features_list)):
    temp = []
    for rate in learning_rates:
        temp2 = []
        for num in num_iterations:
            print("Learning Rate: " + str(rate))
            print("Number of Iterations: " + str(num))
            print("Features list index: " + str(i))
            lr = LogisticRegressionPersonal(rate, num, i)
            print("Fitting Model...")
            lr.fit(features[features_list[i]], target)
            preds = lr.predict(features[features_list[i]])
            final = list(zip(list(preds), list(target.to_numpy())))
            accuracy = 0
            for pair in final:
                if pair[0] == pair[1]:
                    accuracy += 1
            perc = accuracy/len(target)
            print("Model Accuracy: " + str(perc) + '\n')
            best_lr_models += [("Personal", lr, perc)]
            best_lr_models.sort(key = lambda x: x[2])
            if len(best_lr_models) > 10:
                best_lr_models = best_lr_models[1:]


In [3]:
#Create varied models for Sklearn's Logistic Regression models that use Stochastic Average Gradients which do not allow for 
#user input learning rates
for i in range(len(features_list)):
    print("Learning Rate: N/A")
    print("Number of Iterations: N/A")
    print("Features list index: " + str(i))
    lr = LogisticRegression(max_iter = num)
    print("Fitting Model...")
    lr.fit(features[features_list[i]], target.values)
    preds = lr.predict(features[features_list[i]])
    final = list(zip(list(preds), list(target.values)))
    accuracy = 0
    for pair in final:
        if pair[0] == pair[1]:
            accuracy += 1
    print("Model Accuracy: " + str(accuracy/len(target)) + '\n')
    perc = accuracy/len(target)
    best_lr_models += [("Sklearn", lr, perc)]
    best_lr_models.sort(key = lambda x: x[2])
    if len(best_lr_models) > 10:
        best_lr_models = best_lr_models[1:]


NameError: name 'features_list' is not defined

In [None]:
#Sort top ten Logistic Regression models by accuracy with training set, allowing for competition between them.
best_lr_models.sort(key = lambda x: x[2])
for pair in best_lr_models:
    print(pair[0])
    print(pair[1])
    print("Accuracy: " + str(pair[2]) + '\n')
    print("---------------------------------")


0.8997
