#### Andrew Taylor
#### EN705.601.83 Applied Machine Learning
#### September 8, 2023

### Homework #2 Notebook

### Question #1: Classifiers

Let's answer this question with some descriptions, then in another section I'll compare and contrast.

#### Description of ML Techniques:

**Perceptron:**
The perceptron is one of the earliest and simplest artificial neural network architectures. It was introduced by Frank Rosenblatt in 1957. The perceptron consists of a single layer of neurons that make binary decisions. It takes a vector of inputs, multiplies them with its weights, sums the products, and then passes the sum through a step function (typically a unit step function) to produce an output of either 0 or 1.

**Support Vector Machines (SVM):**
SVM is a supervised machine learning algorithm used for classification or regression. Introduced in the 1990s, it works by finding the hyperplane that best divides a dataset into classes. The primary principle is to maximize the margin between the closest data points (support vectors) of two classes. SVMs can be linear or non-linear, depending on the kernel used.

**Decision Tree:**
A decision tree is a flowchart-like structure in which each internal node represents a feature(or attribute), each branch represents a decision rule, and each leaf node represents an outcome. The topmost node in a decision tree is known as the root node. It learns to partition based on the attribute value. Decision trees can be used for both classification and regression.

**Random Forest:**
Random Forest is an ensemble learning method that constructs a multitude of decision trees at training time and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It is particularly effective in avoiding overfitting by training on various subsets of the data and averaging the results.

---

#### Comparison and Contrast:

1. **Nature**:
   - *Perceptron:* Neural network-based.
   - *SVM:* Margin-based classifier.
   - *Decision Tree:* Rule-based.
   - *Random Forest:* Ensemble of decision trees.

2. **Model Complexity**:
   - *Perceptron:* Simple with a single layer.
   - *SVM:* Can be complex especially with non-linear kernels.
   - *Decision Tree:* Complexity varies with depth and branching.
   - *Random Forest:* More complex due to multiple trees.

3. **Handling Non-linear Data**:
   - *Perceptron:* Struggles with non-linearly separable data.
   - *SVM:* Can handle non-linearity using kernels.
   - *Decision Tree:* Can handle non-linear data inherently.
   - *Random Forest:* Naturally handles non-linearity due to the ensemble nature.

4. **Overfitting**:
   - *Perceptron:* Prone to overfitting on non-linearly separable data.
   - *SVM:* Less prone due to margin optimization, but choice of kernel and parameters matters.
   - *Decision Tree:* Can easily overfit if not pruned.
   - *Random Forest:* Reduces overfitting through ensemble learning.

5. **Training Speed**:
   - *Perceptron:* Fast as it is a simple model.
   - *SVM:* Slower, especially for large datasets or with complex kernels.
   - *Decision Tree:* Moderate, depends on the depth and branching.
   - *Random Forest:* Slower due to multiple trees, but can be parallelized.

6. **Interpretability**:
   - *Perceptron:* Moderately interpretable due to weights.
   - *SVM:* Less interpretable, especially with non-linear kernels.
   - *Decision Tree:* Highly interpretable as rules can be visualized.
   - *Random Forest:* Less interpretable than a single decision tree but provides feature importance.

---

Now let's answer the specific questions:

---

**Optimization Problem and Cost Function:**

1. **Perceptron:**
   - **Optimization Problem:** Yes.
   - **Cost Function:** Perceptron uses a simple misclassification rate. The algorithm tries to minimize the number of misclassified samples.
   
2. **Support Vector Machines (SVM):**
   - **Optimization Problem:** Yes.
   - **Cost Function:** SVM minimizes the hinge loss subject to margin constraints. The objective is to maximize the margin between the two classes.
   
3. **Decision Tree:**
   - **Optimization Problem:** Yes.
   - **Cost Function:** Decision trees don't have a traditional cost function like the above models. Instead, they use metrics like entropy, Gini impurity, or classification error to decide on splits.
   
4. **Random Forest:**
   - **Optimization Problem:** Yes, but at the individual tree level.
   - **Cost Function:** Like decision trees, random forests use metrics like entropy or Gini impurity for their individual trees.

---

**Speed, Strength, Robustness, and Statistical considerations:**

1. **Perceptron:**
   - **Speed:** Fast.
   - **Strength:** Good for linearly separable data.
   - **Robustness:** Sensitive to noisy data and outliers.
   - **Statistical:** Prone to overfitting on non-linearly separable data.
   
2. **SVM:**
   - **Speed:** Moderate to slow, depending on kernel and dataset size.
   - **Strength:** Effective for both linear and certain non-linear patterns.
   - **Robustness:** Robust against overfitting, especially in high-dimensional space.
   - **Statistical:** Effective, but can be sensitive to the choice of kernel and parameters.
   
3. **Decision Tree:**
   - **Speed:** Fast to moderate.
   - **Strength:** Can capture non-linear relationships.
   - **Robustness:** Prone to overfitting if not pruned.
   - **Statistical:** Can be unstable, small changes in data can lead to different trees.
   
4. **Random Forest:**
   - **Speed:** Slower due to multiple trees, but can be parallelized.
   - **Strength:** Can capture complex patterns and relationships.
   - **Robustness:** More robust against overfitting compared to individual decision trees.
   - **Statistical:** Provides a measure of feature importance and reduces variance.

---

**Feature Type Classifier Naturally Uses:**

1. **Perceptron:**
   - Linear combinations of features.
   
2. **SVM:**
   - Linear or non-linear transformations based on kernels.
   
3. **Decision Tree:**
   - Uses features directly to make decisions based on entropy, Gini impurity, or classification error.
   
4. **Random Forest:**
   - Uses features directly like decision trees but across multiple trees.

---

**Which One to Try First on a Dataset?**

The choice of which model to try first on a dataset depends on the nature and size of the dataset, as well as the specific problem at hand. However, as a general guideline:

- For linearly separable data, starting with a perceptron or linear SVM can be a good choice.
- For datasets with complex non-linear patterns but not too large in size, SVM with non-linear kernels can be effective.
- Decision trees can be a good starting point due to their interpretability and ability to handle non-linear data.
- Random Forest is often a good default choice for many datasets due to its robustness and ability to handle both linear and non-linear patterns.

---

In conclusion, the ideal model often varies with the nature of the data and problem. It's beneficial to start with a simpler model to establish a baseline and then explore more complex models as needed.

### Question #2: Definitions of Feature Types

##### 1. Numerical
**Definition:** Numerical features represent measurable quantities and can take any value within a range. They can be further divided into continuous (can take any value in a range) and discrete (can only take certain specific values).

**Example from Iris dataset:** 
- sepal length: This is a continuous numerical feature as it can take any value within a range to represent the length of the sepal in centimeters.

##### 2. Nominal
**Definition:** Nominal features are categorical features that don’t have a natural order or ranking. They can take two or more categories, but there's no intrinsic ordering to the categories.

**Example from Iris dataset:** 
- species: This is a nominal feature as it can take values like "setosa", "versicolor", or "virginica". There's no inherent order to these species names.

##### 3. Date
**Definition:** Date features represent specific days, months, years, or even timestamps. They can be used to track the progression of time.

**Example from a Air Quality dataset:** 
- **Air Quality Dataset**: This dataset contains daily readings of the air quality values from 2004 to 2005. A feature like Date in this dataset would indicate the specific day when the air quality was recorded.

##### 4. Text
**Definition:** Text features consist of words, sentences, or paragraphs. These are typically unstructured and require special preprocessing techniques to extract meaningful information.

**Example from Newsgroups dataset:** 
- **20 Newsgroups**: This is a dataset for text classification, containing newsgroup documents, organized into 20 different newsgroups. Each document is a collection of text, representing the content of a post or an article.

##### 5. Image
**Definition:** Image features are typically represented as matrices of pixel values. Each pixel can have one (for grayscale images) or multiple values (for color images).

**Example from a CIFAR-10 dataset:** 
- **CIFAR-10**: This dataset consists of 60,000 32x32 color images in 10 different classes, representing objects like 'airplane', 'automobile', 'bird', etc. Each image is represented as a 3-dimensional array of pixel values (32x32 pixels and 3 channels for RGB).


### Question 3 - Performance Metrics

Here's some of the common machine learning classifier performance metrics:

1. **Accuracy**:
   - **Definition**: It measures the proportion of correctly predicted classification assignments among the total instances in the dataset.
   - **Formula**:
     $$
     \text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions made}}
     $$
   - **Jargon**:
     - No specific jargon used in this definition.  

2. **Precision**:
   - **Definition**: Precision measures the proportion of correctly predicted positive instances out of all instances that were predicted as positive.
   - **Formula**:
     $$
     \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
     $$
   - **Jargon**:
     - **True Positives (TP)**: Number of positive instances correctly predicted as positive.
     - **False Positives (FP)**: Number of negative instances wrongly predicted as positive.  

3. **Recall (or Sensitivity)**:
   - **Definition**: Recall measures the proportion of correctly predicted positive instances out of all actual positive instances.
   - **Formula**:
     $$
     \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
     $$
   - **Jargon**:
     - **False Negatives (FN)**: Number of positive instances wrongly predicted as negative.  

4. **F1-Score**:
   - **Definition**: It is the harmonic mean of precision and recall, providing a balance between the two when there's an uneven class distribution.
   - **Formula**:
     $$
     F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
     $$
   - **Jargon**: 
     - No additional jargon apart from precision and recall.  

5. **Specificity**:
   - **Definition**: Measures the proportion of correctly predicted negative instances out of all actual negative instances.
   - **Formula**:
     $$
     \text{Specificity} = \frac{\text{True Negatives}}{\text{True Negatives} + \text{False Positives}}
     $$
   - **Jargon**:
     - **True Negatives (TN)**: Number of negative instances correctly predicted as negative.  

6. **Area Under the Receiver Operating Characteristic Curve (AUC-ROC)**:
   - **Definition**: AUC-ROC represents the likelihood of the model distinguishing between a randomly chosen positive instance and a randomly chosen negative instance. ROC is a probability curve, and AUC represents the degree or measure of separability.
   - **Formula**: AUC-ROC doesn't have a simple formula; it's computed by plotting True Positive Rate (Recall) vs False Positive Rate at various threshold settings and computing the area under this curve.
   - **Jargon**:
     - **True Positive Rate (TPR)**: Same as Recall.
     - **False Positive Rate (FPR)**:  $$\frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}}$$.  

7. **Area Under the Precision-Recall Curve (AUC-PR)**:
   - **Definition**: Similar to AUC-ROC, but it focuses on the performance of a classifier on the positive (minority) class. It's useful when the classes are imbalanced.
   - **Formula**: It's computed by plotting Precision vs Recall at various threshold settings and computing the area under this curve.
   - **Jargon**: No additional jargon apart from precision and recall.  

8. **Logarithmic Loss (Log Loss)**:
   - **Definition**: Measures the performance of a classification model where the prediction is a probability value between 0 and 1. Lower log loss indicates better performance.
   - **Formula**:
     $$
     \text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(p_i) + (1 - y_i) \log(1 - p_i)]
     $$
     Where ' N ' is the number of samples, ' y_i ' is the actual class (0 or 1), and ' p_i ' is the predicted probability of the instance belonging to class 1.
   - **Jargon**:
     -  y_i : Actual class label (0 or 1). 
     -  p_i : Predicted probability of the instance being in class 1.   

9. **Matthews Correlation Coefficient (MCC)**:
   - **Definition**: It is a measure of the quality of binary classifications. It returns a value between -1 and 1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction, and -1 indicates total disagreement between prediction and observation.
   - **Formula**:
     $$
     MCC = \frac{(TP \times TN) - (FP \times FN)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}
     $$
   - **Jargon**:
     - We've already defined TP, TN, FP, FN in the previous metrics.  

10. **Cohen's Kappa**:
   - **Definition**: It measures the agreement between two raters (in this context, the actual and the predicted labels). It considers the agreement occurring by chance. A Kappa of 1 indicates perfect agreement, while a Kappa of 0 indicates agreement equivalent to chance.
   - **Formula**: It's a bit more complex as it involves computing expected and observed agreement. The basic formula is:
     $$
     Kappa = \frac{P_o - P_e}{1 - P_e}
     $$
     Where ' P_o ' is the observed agreement, and ' P_e ' is the expected agreement.
   - **Jargon**:
     - P_o : Observed agreement. 
     - P_e : Expected agreement.   

11. **Balanced Accuracy**:
   - **Definition**: It calculates the arithmetic mean of sensitivity (recall) and specificity. It's especially useful for imbalanced datasets.
   - **Formula**:
     $$
     \text{Balanced Accuracy} = \frac{\text{Sensitivity} + \text{Specificity}}{2}
     $$
   - **Jargon**:
     - Sensitivity and Specificity were defined previously.  

12. **Hamming Loss**:
   - **Definition**: It is used for multilabel classification and represents the fraction of labels that are incorrectly predicted.
   - **Formula**:
     $$
     \text{Hamming Loss} = \frac{1}{N} \sum_{i=1}^{N} \text{xor}(y_i, \hat{y}_i)
     $$
     Where ' N ' is the number of samples, ' y_i ' is the actual label set, and $ \hat{y}_i $' is the predicted label set.   
   
   - **Jargon**:
     - $ \text{xor} $: Bitwise exclusive or operation.  

13. **Zero-One Loss**:
   - **Definition**: It represents the number of misclassifications. It's often normalized to be between 0 and 1, with 0 being a perfect classifier.
   - **Formula**:
     $$
     \text{Zero-One Loss} = \frac{\text{Number of misclassifications}}{\text{Total number of predictions}}
     $$
   - **Jargon**:
     - No specific jargon used in this definition.  

14. **Brier Score**:
   - **Definition**: Measures the mean squared difference between predicted probabilities and the actual outcomes. It's appropriate for binary classification tasks.
   - **Formula**:
     $$
     \text{Brier Score} = \frac{1}{N} \sum_{i=1}^{N} (p_i - o_i)^2
     $$
     Where ' N ' is the number of samples, ' p_i ' is the predicted probability, and ' o_i ' is the actual outcome (0 or 1).
   - **Jargon**:
     - *o_i*: Actual outcome for the i-th sample (either 0 or 1).



### Question 4: Correlation Program

The formula for correlation between two variables  X and  Y is:

$$
r_{XY} = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{n} (X_i - \bar{X})^2 \sum_{i=1}^{n} (Y_i - \bar{Y})^2}}
$$

Where:
- r_{XY} is the correlation coefficient between X and Y  
- X_i and Y_i are individual data points  
- $ \bar{X}  and  \bar{Y} $ are means of X and Y respectively  


In [1]:
# Question 4 Correlation Program from Scratch not using Numpy

import math

# Load the dataset using built-in Python functions.
with open("Admission_Predict_Ver1.1.csv", "r") as file:
    lines = file.readlines()
    header = lines[0].strip().split(",")
    data = [list(map(float, line.strip().split(","))) for line in lines[1:]]

# Extract the relevant columns.
columns = [row[1:9] for row in data]  # Extracting columns 2 to 9 (0-indexed).

# Define a function to compute the mean.
def mean(column):
    return sum(column) / len(column)

# Define a function to extract a specific column from the data.
def extract_column(data, col_idx):
    return [row[col_idx] for row in data]

# Define a function to compute the correlation coefficient between two columns.
def correlation(col1, col2):
    mean1, mean2 = mean(col1), mean(col2)
    numerator = sum([(col1[i] - mean1) * (col2[i] - mean2) for i in range(len(col1))])
    denominator = (sum([(col1[i] - mean1) ** 2 for i in range(len(col1))]) * 
                   sum([(col2[i] - mean2) ** 2 for i in range(len(col2))])) ** 0.5
    
    # Handle zero variance case
    if math.isclose(denominator, 0):
        return float('nan')
    
    return numerator / denominator

# Create the correlation matrix.
num_cols_corrected = len(columns[0])

# Recreate the correlation matrix using the num_cols value.
correlation_matrix_corrected = [
    [correlation(extract_column(columns, i), extract_column(columns, j)) 
     for j in range(num_cols_corrected)] 
    for i in range(num_cols_corrected)]

# Display the corrected manually computed correlation matrix.
print("Manual Correlation Matrix:")
for row in correlation_matrix_corrected:
    print(row)


# For verification using pandas:
import pandas as pd

data_df = pd.read_csv("Admission_Predict_Ver1.1.csv")
subset_data_df = data_df.iloc[:, 1:9]
correlation_matrix_df = subset_data_df.corr()

print("\nDataFrame.corr() Matrix:")
print(correlation_matrix_df)


Manual Correlation Matrix:
[1.0, 0.827200403531722, 0.6353762113239018, 0.6134976734624114, 0.5246793925817081, 0.8258779536403567, 0.5633981217777579, 0.8103506354632608]
[0.827200403531722, 1.0, 0.6497991951468062, 0.6444103878875822, 0.5415632950080242, 0.8105735363036228, 0.46701206060973394, 0.7922276143050834]
[0.6353762113239018, 0.6497991951468062, 1.0, 0.7280235718785817, 0.6086507072838143, 0.705254345086195, 0.4270474518133488, 0.6901323687886892]
[0.6134976734624114, 0.6444103878875822, 0.7280235718785817, 1.0, 0.663706852514935, 0.7121543243652508, 0.40811584579179266, 0.6841365241316726]
[0.5246793925817081, 0.5415632950080242, 0.6086507072838143, 0.663706852514935, 1.0, 0.6374692057544721, 0.3725256035105942, 0.6453645135280112]
[0.8258779536403567, 0.8105735363036228, 0.705254345086195, 0.7121543243652508, 0.6374692057544721, 1.0, 0.501311000534699, 0.8824125749045746]
[0.5633981217777579, 0.46701206060973394, 0.4270474518133488, 0.40811584579179266, 0.3725256035105942,

1. **Should we use 'Serial no'?**
   - The 'Serial no' is typically a unique identifier for each row or observation in the dataset. From a data analysis or modeling perspective, it does not carry any meaningful information about the variables of interest. Including it in the analysis or model could introduce noise without providing any genuine insights or predictive power. Therefore, it's recommended not to use 'Serial no' for analysis or modeling purposes.

2. **Observe that the diagonal of this matrix should have all 1's and explain why?**
   - The diagonal of the correlation matrix contains the correlation of each variable with itself. The Pearson correlation coefficient of a variable with itself is always 1. This is because the correlation coefficient measures the linear relationship between two variables, and a variable is perfectly linearly related to itself. Mathematically, the numerator and the denominator of the correlation formula will be identical when correlating a variable with itself, resulting in a value of 1.

3. **Since the last column can be used as the target (dependent) variable, what do you think about the correlations between all the variables?**
   - Observing the last column (or row, since the matrix is symmetric) gives the correlation values of each variable with the 'Chance of Admit' (our potential target variable).
   - Most of the variables have a high positive correlation with the 'Chance of Admit'. This suggests that as these variables increase, the chance of admission typically increases, and vice versa.
   - The variable with the highest positive correlation to the 'Chance of Admit' (excluding the 1 on the diagonal) is the one that is most linearly related to the chances of admission.
   - It's worth noting that even if variables are correlated with the target variable, it doesn't necessarily imply causation. It simply means there's a linear association.

4. **Which variable should be the most important for prediction of 'Chance of Admit'?**
   - The importance of a variable in predicting the 'Chance of Admit' can be initially gauged by its correlation magnitude with the 'Chance of Admit'. The variable with the highest absolute correlation value (closest to 1 or -1) would be the most linearly related to the 'Chance of Admit'.
   - From the computed correlation matrix, the variable that has the highest correlation with the 'Chance of Admit' (apart from itself) is the one at the 6th position (0-indexed), which corresponds to the 'CGPA' based on the structure of the dataset. This suggests that 'CGPA' might be the most important variable for predicting the 'Chance of Admit'.
   - However, it's essential to understand that correlation only captures linear relationships. There might be other non-linear associations or interactions between variables that can also be important. Advanced modeling techniques can further elucidate the importance and contribution of each variable to the prediction.
