## 1. Compare and Contrast Classifiers 

### *Preceptrons*
Preceptrons are a rudimentary implementation of artificial neurons with a learning rule for the automatic learning of weight coefficients.

* Function - Used for classification. Creates linear separations in data through the summation of datapoints and subsequent adjustment of coefficient multipliers to the datapoints.

* Data type - Can only be used if the data is linearally separable, but can be used with any datatype in which that is possible (text, images, or numerical).

* Best Use Case - (?) With the proliferation of machine learning techniques I don't think these are ever really useful in practice, but they are a good introduction to ANNs.
 
    
### *SVMs*

SVMs create hyperplanes to classify data in to appropriate bins. 

* Function - Used for classification. Optimizes the placement of hyperplanes to accurately separate classes of input.

* Data type - SVMs are very good at linearly separable data, such as images (MNIST), text (sentiment), and even protein classification.

* Best Use Case - Work best when the data has good distance between the data points, however interpretations exist that allow for "bendy" hyperplanes to work around closely related data.

### *Decision Trees*

Decision Trees area supervised learning model that attempts to create binary trees and filter the data into categories with each branch representing a decision based on a single feature (like flow charts). 

* Function - Good for classification and regression. Breaks the data into smaller and smaller branches of classes or probabilities.

* Data type - For use with multi-dimensional data, but could be used with uni-dimensional data. Works well with probabilitic data such as in predicting the most probable best response in chatbots. Can also be used with image data.

* Best Use Case - Works best when all the data features easily fall into easily (non-lazy human) separable categories. Decisions Trees often overfit, so we need to be aware of that possibility (can be solved through the use of Random Forests).

### *Random Forests*

Random Forest algorithms create a collection of Decision Trees which produce classifications collectively (the classification with the most "votes" wins).

* Function - Classification and regression. The classification can be thought of as the mode of the classifications produced by the ensemble, and regression can be the mean of the probabilites produced by the ensemble.

* Data type - Same as Decision Trees.

* Best Use Case - Random forests are good when we observe overfitting from a singular Decision Tree, so long as we take steps ensure diversity amoung the ensemble (which can be accomplished by simply dividing the training set differently for each of the trees).

__References:__

[1] Wikipedia for each

[2] Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning : Machine Learning and Deep Learning with Python, Scikit-learn, and Tensorflow 2, 3rd Edition. 3rd ed. Birmingham: Packt Publishing, Limited, 2019.

[3] https://towardsdatascience.com/understanding-random-forest-58381e0602d2

[4] https://towardsdatascience.com/the-complete-guide-to-decision-trees-28a4e3c7be14

[5] https://towardsdatascience.com/perceptron-learning-algorithm-d5db0deab975

## 2. Define Feature Types

* *Numerical* - Numerical data can be represented as any numerical datatype like integers, floats, or doubles.The Iris dataset from scikit-learn/UCI (https://archive.ics.uci.edu/ml/datasets/iris) features 4 float-type numerical features: sepal length, sepal width, petal length, and petal width.


* *Nominal* - Nominal data can be represented as strings or strings can be mapped to integers to reduce storage size. Nominal data is used to categorize data, as in the Iris dataset above which features a 'class' feature that is a integer which mapped to the string for the species of Iris flower.


* *Date* - Dates may be stored as either numericals (1982.300 for the the 300th day in 1982) or as strings ("9-10-2020"). I found a dataset of international football results that features the date of the match as a string here: https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017


* *Text* - Text data may be stored as strings if the text is short or non-repeating (for example name data), but this is usually not particularly useful if the text is large such as in a TV script for The Simpsons. Frequently it is neccessary to distribute text data as a text file but then preprocess that into a list of individual words and/or punctuations. Words may then be mapped to integers in order to reduce total memory usage if necessary (frequently useful). Example of text data can be found in the football results dataset mentioned above which records the home team, away team, and match location as strings. Another example for longer text would be say a collection of text files for the Harry Potter books.


* *Image* - Image data should be represented as multidimensional arrays of numericals. The dimensions of the text would correspond to different RGB values necessary to represent the pixel of the image. If the image is black-and-white we can just use a unidimentional array to represent the intensity of each pixel of the image.


* *Dependent Variable* - The dependent variable or target may be any of the above feature types. Any dataset with labeled categories (supervised learning datasets) will feature this ("Class" in the Iris dataset or "Chance to Admit" in the Graduate Admissions dataset).

## 3. Accuracy Metrics


### Confusion Matrix Metrics
There are four definable metrics other than accuracy that come from the confusion matrix.
* _Precision_ - Precision is the ratio of true positives to the sum of all positives. This tells us/the algo how many classifications where correctly identified in one category, as opposed to accuracy which tells us how many of all classifications were made to correct categories. You could also due the exact opposite of this with the negatives.

$$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$$

* _Recall or Sensitivity_ - Recall is the measure of true positives to all positives (true positives + false negatives). This metric tells us the proportion of hits to possible hits or how well the model identifies positives.

$$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$$

* _Specificity_ - Specificity is the exact opposite of Sensitivity. That is to say it is the measure of true negatives to  all negatives, which tells us how correctly the model identifies negatives.

$$\text{Specificity} = \frac{\text{True Negatives}}{\text{True Negatives} + \text{False Positives}}$$

* _F1 Score_ - The F1 Score is a combination of Precision and Recall using the Harmonic Mean of the two. This prevents a false sense of accomplishment when the model has a high recall but low precision (good true positives w/ bad false positives).

$$\text{F1} = \frac{2 * \text{Recall} * \text{Sensitivity}}{\text{Recall} + \text{Sensitivity}}$$

### Other Metrics

These metrics tend to work with values produced by a model rather than the absolute correctness of the model (no confusion matrix).

* _Logarithmic Loss_ - Log Loss penalizes false classifications and is especially useful for classifications on multiple classes (that is not just positive or negative, as in the Confusion Matrix Metrics). This metric takes the negative log of the likelihood that the model predicts the outcome that is observed in the data with a lower log loss leading to greater chance the model predicts the outcome correctly.

$$\text{Log Loss} = \frac{-1}{N} \sum_{i=1}^{N} \sum_{j=1}^{M} y_{ij} * log(p_{ij})$$

* _Mean Absolute Error_ - Mean Absolute Error is simply the mean of the difference of all observed values to all predicted values. As predicted values get closer to observed values this metric approaches zero. 

$$\text{Mean Absolute Error} = \frac{1}{N} \sum_{j=1}^{N} |predicted - observed|$$

* _Mean Squared Error_ - MSE is similar to Mean Absolute Error, but instead averages the square of the difference between predicted and observed.

$$\text{Mean Squared Error} = \frac{1}{N} \sum_{j=1}^{N} (predicted - observed)^2$$


__References:__

[1] https://medium.com/@MohammedS/performance-metrics-for-classification-problems-in-machine-learning-part-i-b085d432082b

[2] https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234

[3] https://www.kaggle.com/dansbecker/what-is-log-loss

## 4. Correlation in Admission Prediction Data

I tried using df.equals() to check the equality between the original dataframes corr() and the correlation i made, but that seems to not work despite the dataframes looking the same. My guess is this is some rounding difference. 

The code cell will first output my correlation dataframe and then pandas's correlation dataframe for easy comparison (it's 1:1).

### Questions
* The diagonals of the matrix is all one's because is it is calculating the correlation of a feature to itself (which it must correlate to exactly as the values are the same). 

* I found it interesting to note that research has very low correlation with the rest of the values as it seems to me a student with research experience would be more likely to want to get into grad. school, and would thus score better on better correlating metrics such as the GRE (though I could see research lowering CGPA which would may cause an issue for admissions).

* Based on the information available CGPA is the greatest predictor of admission to grad school followed by GRE. This seems valid as a student who applies themselves well before grad. school is probably more likely to apply themselves well in both grad. school and on the GRE.

In [1]:
import pandas as pd
import itertools as it
import numpy as np

from os.path import isfile

# path to downloaded data
data_path = './datasets/Admission_Predict.csv'

# initalize df for holding admission data
df = pd.DataFrame()

# if the data_path is wrong
if not isfile(data_path):
    print("Please make sure the data_path is correct and that the data is" + \
          "named appropriately")

# otherwise intialize the dataframe
else:
    df = pd.read_csv(data_path)
    df = df.drop(['Serial No.'], axis=1) # remove Serial No. as it has no correlation

# dict for holding calculated correlations
corr_dict = {}

# for each column pair possible
for f1, f2 in it.product(df.columns, repeat=2):
    # get the pearson correlation 
    # produces a 2x2 matrix with the correlation we want at [0][1] and [1][0]
    corr = np.corrcoef(df[f1], df[f2])[0][1]
    
    # if feature 1 already in dict, just append corr
    # else add feature 1 and apply corr as a list object
    if f1 in corr_dict:
        corr_dict[f1].append(corr)
    else:
        corr_dict[f1] = [corr]

# make a dataframe from the correlation dictionary
corr_df = pd.DataFrame(corr_dict)

# get the names of the columns as a dict
col_dict = {}
for idx, col in enumerate(corr_df.columns):
    col_dict[idx] = col

# rename the columns of the corr_df
corr_df = corr_df.rename(index=col_dict)

# output my results and pandas corr()
print(f"My correlation:\n{corr_df}")
print("\n\n")
print(f"Pandas's correlation:\n{df.corr()}")
  

My correlation:
                   GRE Score  TOEFL Score  University Rating       SOP  \
GRE Score           1.000000     0.835977           0.668976  0.612831   
TOEFL Score         0.835977     1.000000           0.695590  0.657981   
University Rating   0.668976     0.695590           1.000000  0.734523   
SOP                 0.612831     0.657981           0.734523  1.000000   
LOR                 0.557555     0.567721           0.660123  0.729593   
CGPA                0.833060     0.828417           0.746479  0.718144   
Research            0.580391     0.489858           0.447783  0.444029   
Chance of Admit     0.802610     0.791594           0.711250  0.675732   

                       LOR       CGPA  Research  Chance of Admit   
GRE Score          0.557555  0.833060  0.580391          0.802610  
TOEFL Score        0.567721  0.828417  0.489858          0.791594  
University Rating  0.660123  0.746479  0.447783          0.711250  
SOP                0.729593  0.718144  0.4440