<p align="center">
    <img src="JHU.png" width="200" alt="Johns Hopkins University logo">
</p>

# Hands-on Lab: Data Features & Model Evaluation

Estimated time needed: **60** minutes

### Overview:

The primary objective of this assignment is to explore and compare various machine learning classifiers and implement a correlation analysis program from scratch.

In this lab, we will:

- **Classify and Compare:** Discuss and compare classifiers such as Perceptron, SVM, Decision Tree, and Random Forest based on criteria like speed, robustness, and the type of features they utilize.

- **Correlation Analysis:** Implement a correlation matrix from scratch for the Admission_Predict.csv dataset, analyzing the relationships between features to determine which are most predictive of the target variable.

This lab is designed to deepen your understanding of machine learning concepts through practical application and comparison of different models and techniques.


### Learning Objectives:

In this lab, we aim to achieve the following objectives:

- **Explore and Compare Classifiers:** Provide a high-level overview of various machine learning classifiers including Perceptron, SVM, Decision Tree, and Random Forest. We will compare these classifiers based on criteria such as speed, robustness, and their suitability for different types of features.

- **Implement Correlation Analysis:** Develop a correlation matrix from scratch for the Admission_Predict.csv dataset. This analysis will help identify the relationships between different features and determine which variables are most predictive of the target variable, 'Chance of Admit'.

These objectives are designed to enhance your understanding of key machine learning concepts and their practical application in data analysis.

### Introduction:
Let us first explore and compare four essential machine learning classifiers: 
- Perceptron
- Support Vector Machine (SVM)
- Decision Tree
- Random Forest

The focus will be on providing a high-level understanding of these classifiers, highlighting their strengths, weaknesses, and typical use cases.

As we progress, We will guide you through the following aspects for each classifier:

- Speed: We will discuss how quickly each classifier processes data, particularly in large datasets.
- Strength: We’ll evaluate the effectiveness of each classifier in terms of its accuracy and ability to handle complex data.
- Robustness: We'll consider how well each classifier performs when dealing with noisy or previously unseen data.
- Feature Type: Understanding the type of features that each classifier naturally works with, whether they are numerical, categorical, etc.
- Statistical Nature: We’ll explore whether the classifier is based on statistical principles.
- Optimization Problem: Finally, we’ll examine whether each method solves an optimization problem and, if so, what the associated cost function is.

**Perceptron:**
- Speed: Fast to train, especially on linearly separable data.
- Strength: Effective for simple, linearly separable problems.
- Robustness: Not robust; struggles with non-linearly separable data and noise.
- Feature Type: Works naturally with numerical features, as it relies on calculating weighted sums (essentially a linear combination of inputs).
- Statistical: Not typically considered a statistical method.
- Optimization Problem: Yes, it solves an optimization problem by minimizing classification errors. The cost function is a simple linear loss function (often using the perceptron criterion, which adjusts weights for misclassified points).
- When to Use: It’s a good starting point for very simple problems or as a baseline in scenarios where you expect a linear relationship.

**Support Vector Machine(SVM):**
- Speed: Slower to train compared to the Perceptron, especially on large datasets or when using non-linear kernels.
- Strength: Highly effective for both linear and non-linear problems, especially in high-dimensional spaces.
- Robustness: Robust, particularly with well-separated data. It handles outliers better with the soft-margin version.
- Feature Type: Naturally uses numerical features, especially those that are scaled appropriately. With kernel tricks, it can handle more complex relationships.
- Statistical: Not primarily statistical, but it can be connected to statistical learning theory.
- Optimization Problem: Yes, it solves a quadratic optimization problem. The cost function is the hinge loss, which aims to maximize the margin between classes while minimizing classification errors.
- When to Use: Ideal when you expect complex boundaries between classes or when you have a high-dimensional dataset.

**Decision Tree:**
- Speed: Fast to train and make predictions, especially for smaller datasets.
- Strength: Intuitive and interpretable; can handle both numerical and categorical data naturally.
- Robustness: Prone to overfitting, especially with deep trees, but robustness can be improved with pruning or ensemble methods.
- Feature Type: Can handle both numerical and categorical features well. It splits data based on feature values that best separate classes.
- Statistical: Not typically statistical but can be interpreted in a probabilistic way.
- Optimization Problem: Implicitly solves an optimization problem by maximizing information gain (or minimizing impurity) at each split.
- When to Use: A good first choice when interpretability is important or when you have a mix of feature types.

**Random Forest:**
- Speed: Slower to train than a single decision tree but can be parallelized. Faster to predict than SVMs on large datasets.
- Strength: Very strong performance across various datasets, reducing overfitting compared to individual decision trees.
- Robustness: Highly robust due to averaging the results of many decision trees, which reduces variance and overfitting.
- Feature Type: Naturally handles both numerical and categorical features, similar to decision trees.
- Statistical: Can be considered statistical in the sense that it aggregates multiple models (decision trees) to improve prediction stability.
- Optimization Problem: Does not solve a traditional optimization problem in the way SVM or Perceptron does. It uses bagging (Bootstrap Aggregating) and random feature selection to build and aggregate multiple decision trees.
- When to Use: A great first choice for most real-world datasets due to its versatility, robustness, and strong performance.

**Comparison Summary:**
- Speed: Perceptron > Decision Tree > Random Forest > SVM
- Strength: Random Forest ≈ SVM > Decision Tree > Perceptron
- Robustness: Random Forest > SVM > Decision Tree > Perceptron
- Feature Type: Perceptron and SVM prefer numerical; Decision Tree and Random Forest handle both types well.
- Statistical Nature: Random Forest and Decision Trees can be viewed as more statistical in nature, while Perceptron and SVM are more optimization-focused.
- Optimization Problem: Perceptron and SVM explicitly solve optimization problems, while Decision Trees and Random Forests do not in a traditional sense.

**Which One to Try First?**

Random Forest is generally a strong first choice for most datasets. It is robust, handles both numerical and categorical data, and tends to perform well across a wide range of problems with minimal tuning. If you need something quick and simple for linearly separable data, you might start with a Perceptron, but for most real-world scenarios, Random Forest offers a good balance of performance, interpretability, and ease of use.

### Understanding Feature Types in Datasets:

In this section, we will delve into the various types of features that are commonly encountered in datasets. Understanding these feature types is crucial for selecting appropriate preprocessing techniques and machine learning models.

We will cover the following feature types:

- Numerical 
- Nominal 
- Date
- Text
- Image
- Dependent Variable

For each feature type, we will provide examples drawn from existing datasets, such as the Iris dataset, or create examples ourselves to illustrate how these features appear in practice.

**1. Numerical:** Numerical features are quantitative and represent measurable quantities. They can be continuous (e.g., any real number) or discrete (e.g., integers).

Example Values:

From the Iris dataset: Sepal Length (e.g., 5.1, 4.9, 4.7), Petal Width (e.g., 0.2, 0.4, 0.1).

A hypothetical dataset might include Age (e.g., 25, 42, 37) or Income (e.g., 45,000, 60,000, 30,000).


**2. Nominal:** Nominal features are categorical and represent distinct categories or labels without any inherent order.

Example Values:

From the Iris dataset: Species (e.g., Setosa, Versicolor, Virginica).

Another example might include Color (e.g., Red, Blue, Green) or City (e.g., New York, London, Tokyo).

**3. Date:**
Date features represent dates or times, typically in a standardized format.

Example Values:

In a sales dataset: Purchase Date (e.g., 2024-08-29, 2023-05-14).

In a weather dataset: Observation Time (e.g., 2024-08-29 14:30:00).

**4. Text:**
Text features consist of free-form text data, often used for natural language processing (NLP) tasks.

Example Values:

In a customer review dataset: Review Text (e.g., "Great product, highly recommend!", "The service was terrible.").

In a news dataset: Article Body (e.g., "The stock market experienced significant gains today...").

**5. Image:**
Image features are visual data, usually represented as arrays of pixel values. In machine learning, images are often used in computer vision tasks.

Example Values:

From the MNIST dataset: Handwritten digit images (e.g., an image representing the digit '5').

In a medical imaging dataset: X-ray Image (e.g., an image of a chest X-ray).

**6. Dependent Variable:**
The dependent variable, also known as the target or response variable, is the outcome that the model aims to predict.

Example Values:

From the Iris dataset: Species (e.g., Setosa, Versicolor, Virginica), where the goal is to predict the species based on other features.

In a housing price dataset: House Price (e.g., 300,000, 450,000, 250,000), where the model predicts the price based on features like size, location, etc.

Example Dataset Overview:
To illustrate all these feature types, here’s a hypothetical combined dataset:

| ID | Sepal Length (cm) | Species    | Purchase Date | Review Text    | Product Image           | Price (USD) |
|----|-------------------|------------|---------------|----------------|-------------------------|-------------|
| 1  | 5.1               | Setosa     | 2024-08-29    | Great product  | Image of product (file) | 300,000     |
| 2  | 4.9               | Versicolor | 2023-05-14    | Not satisfied  | Image of product (file) | 450,000     |
| 3  | 4.7               | Virginica  | 2022-03-22    | Would buy again | Image of product (file) | 250,000     |


In this example:

- Sepal Length is a numerical feature.
- Species is a nominal feature.
- Purchase Date is a date feature.
- Review Text is a text feature.
- Product Image is an image feature.
- Price is a numerical feature and could be the dependent variable if we were predicting housing prices.

### Problem statement

**Implement a correlation program from scratch to analyze the correlations between the features in the Admission_Predict.csv dataset file. This dataset contains 9 features and 500 data points, and it is sourced from Kaggle. Remember, you are not allowed to use NumPy functions such as mean(), stdev(), cov(), etc. You may use DataFrame.corr() only to verify the correctness of your manually implemented correlation matrix.**

### Implementation:

To implement a correlation program from scratch, you will need to calculate the correlation coefficients between the features of the dataset without relying on functions like mean(), stdev(), or cov() from libraries like NumPy.

**Steps to Calculate Pearson Correlation Coefficient:**
- Mean Calculation: Compute the mean of each feature.
- Standard Deviation Calculation: Compute the standard deviation of each feature.
- Covariance Calculation: Compute the covariance between each pair of features.
- Correlation Coefficient Calculation: Use the formula for Pearson correlation.

**Pearson Correlation Coefficient Formula:**

$$ \rho(X, Y) = \frac{\text{cov}(X, Y)}{\sigma_X \sigma_Y} $$

where 

**$ \text{cov}(X, Y) $** is the covariance of **$ X \text{ and } Y $**

**$ \sigma_X \text{ and } \sigma_Y $** are the standard deviations of **$ X \text{ and } Y $**



Let’s start by loading the dataset and implementing these steps in Python.

In [None]:
import pandas as pd

# Load the dataset
file_path = 'Admission_Predict.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataset to understand its structure
df.head()


The dataset contains the following features:

- Serial No.
- GRE Score
- TOEFL Score
- University Rating
- SOP (Statement of Purpose)
- LOR (Letter of Recommendation)
- CGPA
- Research
- Chance of Admit

**Step 1: We'll exclude the Serial No. column as it’s just an identifier and doesn't contribute to the analysis.**

In [None]:
# Write your code here


<details><summary>Click here for the solution</summary>
 
```python
df = df.drop(columns=['Serial No.'])
```
 
</details>

**Step 2: Calculate and print the mean for each feature**

In [None]:
# Write your code here
def calculate_mean(column):
    
    
# Calculate means for each column
    
    
    
# Print the means of each feature


<details><summary>Click here for the solution</summary>
 
```python
def calculate_mean(column):
    total_sum = sum(column)
    mean_value = total_sum / len(column)
    return mean_value

# Calculate means for each column
means = {column: calculate_mean(df[column]) for column in df.columns}

# Print the means of each feature
print("Means for each feature:")
for column, mean_value in means.items():
    print(f"{column}: {mean_value}")
```
 
</details>


**Step 3: Calculate and print the standard deviation for each feature**

In [None]:
# Write your code here
def calculate_std_dev(column, mean):
    
    
    
# Calculate standard deviations for each column



# Print the standard deviations for each feature


<details><summary>Click here for the solution</summary>
 
```python
def calculate_std_dev(column, mean):
    variance = sum((x - mean) ** 2 for x in column) / len(column)
    std_dev = variance ** 0.5
    return std_dev

# Calculate standard deviations for each column
std_devs = {column: calculate_std_dev(df[column], means[column]) for column in df.columns}

# Print the standard deviations for each feature
print("Standard deviations for each feature:")
for column, std_dev_value in std_devs.items():
    print(f"{column}: {std_dev_value}")

```
 
</details>

**Step 4: Calculate and print the covariance between each pair of features**

In [None]:
# Write your code here
def calculate_covariance(col1, col2, mean1, mean2):
    
    
# Calculate covariance matrix



# Print the covariance matrix


<details><summary>Click here for the solution</summary>
 
```python
def calculate_covariance(col1, col2, mean1, mean2):
    covariance = sum((x - mean1) * (y - mean2) for x, y in zip(col1, col2)) / len(col1)
    return covariance

# Calculate covariance matrix
cov_matrix = {}
for col1 in df.columns:
    cov_matrix[col1] = {}
    for col2 in df.columns:
        cov_matrix[col1][col2] = calculate_covariance(df[col1], df[col2], means[col1], means[col2])

# Print the covariance matrix
print("Covariance matrix:")
for col1, covs in cov_matrix.items():
    row = [f"{covs[col2]:.4f}" for col2 in df.columns]  # Format to 4 decimal places
    print(f"{col1}: {', '.join(row)}")

```
 
</details>

Now, let's proceed to compute the correlation matrix from scratch.

**Step 5: Calculate the correlation matrix**

In [None]:
# Calculate correlation matrix
corr_matrix = {}
for col1 in df.columns:
    

# Convert correlation matrix to DataFrame for readability


# Print the correlation matrix DataFrame


<details><summary>Click here for the solution</summary>
 
```python
corr_matrix = {}
for col1 in df.columns:
    corr_matrix[col1] = {}
    for col2 in df.columns:
        if std_devs[col1] == 0 or std_devs[col2] == 0:
            corr_matrix[col1][col2] = 0  # To handle division by zero
        else:
            corr_matrix[col1][col2] = cov_matrix[col1][col2] / (std_devs[col1] * std_devs[col2])

# Convert correlation matrix to DataFrame for readability
corr_df = pd.DataFrame(corr_matrix)

# Print the correlation matrix DataFrame
print("Correlation matrix:")
print(corr_df)

```
 
</details>

The correlation matrix has been calculated from scratch and presented above. Each value in the matrix represents the Pearson correlation coefficient between two features in the dataset. The values range between -1 and 1, where:

- 1 indicates a perfect positive correlation.
- −1 indicates a perfect negative correlation.
- 0 indicates no correlation.

### Verification:

To ensure the correctness of our implementation, we will compare this from-scratch correlation matrix with the one generated by DataFrame.corr().

In [None]:
# Verify the results using DataFrame.corr()
verification_corr_df = df.corr()
verification_corr_df

The correlation matrix calculated from scratch matches exactly with the one generated by DataFrame.corr(). This confirms that the from-scratch implementation is correct.

### Here are some self-check questions based on the above analysis:

**1. Should we use 'Serial no'? Why or why not?**

<details><summary>Click here for the answer</summary>

No, we should not use 'Serial no' in the correlation analysis. 'Serial no' is simply an identifier for each observation in the dataset and does not carry any meaningful information that could contribute to predicting the target variable, 'Chance of Admit'. Including it in the analysis would not add value and might even skew the results, as it doesn't have any correlation with the other features or the outcome.
 
</details>

**2. Observe that the diagonal of this matrix should have all 1's; why is this?**

<details><summary>Click here for the answer</summary>

The diagonal of the correlation matrix represents the correlation of each feature with itself. Since a feature is always perfectly correlated with itself, the correlation coefficient is 1. Therefore, all diagonal entries in the correlation matrix are 1 by definition. This is a standard property of correlation matrices.
 
</details>

**3. Since the last column can be used as the target (dependent) variable, what do you think about the correlations between all the variables?**

<details><summary>Click here for the answer</summary>

The correlations between the features and the target variable ('Chance of Admit') reveal how strongly each feature is related to the probability of admission. For example, features like 'CGPA' and 'GRE Score' show strong positive correlations with 'Chance of Admit', indicating that higher scores in these areas are associated with a higher chance of admission. Conversely, features with lower correlations, like 'Research', may still contribute to the prediction but are less strongly associated with the target variable.
 
</details>

**4. Which variable should be the most important to try to predict 'Chance of Admit'?**

<details><summary>Click here for the answer</summary>

Based on the correlation matrix, 'CGPA' appears to be the most important variable for predicting 'Chance of Admit' as it has the highest correlation coefficient with the target variable (0.882). This indicates that CGPA has a strong positive relationship with the likelihood of admission. Therefore, it should be given the most weight when developing a model to predict 'Chance of Admit'.

 
</details>

### Summary:
This lab covered key machine learning concepts, including classifier comparisons, feature types, and performance metrics. We implemented a correlation analysis from scratch, identifying 'CGPA' as the most important predictor for 'Chance of Admit'. The exercises reinforced the importance of understanding data and feature relationships in building effective predictive models.