## Perceptron Algorithm


In [None]:
# First, load in the appropriate python packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Perceptron
from sklearn.metrics import ( accuracy_score, precision_score, recall_score,
    f1_score, confusion_matrix, classification_report)
from sklearn.decomposition import PCA

In [None]:
# Then, make sure data is loaded
df = load_and_prepare_data("adult.csv")

The largest part of the machine learning is actually training the algorithm, which is done in the function below (the code of which is in perceptron.py in the src folder). This trains a perceptron model using the adult income dataset, and adjusts the weights of the various variables depending on how accurately the set threshold can predict an observation's binary income marker. 

In [None]:
model, X_test, y_test = train_perceptron(df)

Once the model has been trained, we can evaluate how well the algorithm was able to predict adult income. The evaluation tests the accuracy, precision, and recall of the perceptron. Here, we see that accuracy gets to 0.81 and Precision gets to 0.75.

In [None]:
evaluate_model(model, X_test, y_test)

## Visualizations
There are a number of visualizations that can be produced for a perceptron model. 

The first graph is a Principal Component Analysis (PCA). A PCA can help simplify the findings of a model by limiting the outside "noise" on the variable of interest and creating clusters to see how the binary outcomes can be predicted. In this case, there is a small cluster of lower income cases that can be seen as separate and can be used as an identifying indicator, however the bulk of the data is clumped together closely. While one side of the clump appears higher income and the other lower income, this indicates that it is difficult to determine factors that clearly indicate income status. 

In [None]:
plot_decision_boundary(model, X_test, y_test)

The second plot we can generate is a Confusion Heat Map. This shows the accuracy of the model's predictions by comparing what the model predicted of the binary income variable versus the actual values. The map shows that our model pretty accurately predicted when people were in the lower income category, although admittedly the majority of observations fit into that category. Nevertheless, this is a helpful visual to see the accuracy of the model.

In [None]:
plot_confusion_matrix(model, X_test, y_test)