#### Andrew Taylor
#### EN705.601.83 Applied Machine Learning
#### September 8, 2023

### Homework #2 Notebook

#### Question #1: Classifiers

Let's answer this question with some descriptions, then in another section I'll compare and contrast.

#### Description of ML Techniques:

**Perceptron:**
The perceptron is one of the earliest and simplest artificial neural network architectures. It was introduced by Frank Rosenblatt in 1957. The perceptron consists of a single layer of neurons that make binary decisions. It takes a vector of inputs, multiplies them with its weights, sums the products, and then passes the sum through a step function (typically a unit step function) to produce an output of either 0 or 1.

**Support Vector Machines (SVM):**
SVM is a supervised machine learning algorithm used for classification or regression. Introduced in the 1990s, it works by finding the hyperplane that best divides a dataset into classes. The primary principle is to maximize the margin between the closest data points (support vectors) of two classes. SVMs can be linear or non-linear, depending on the kernel used.

**Decision Tree:**
A decision tree is a flowchart-like structure in which each internal node represents a feature(or attribute), each branch represents a decision rule, and each leaf node represents an outcome. The topmost node in a decision tree is known as the root node. It learns to partition based on the attribute value. Decision trees can be used for both classification and regression.

**Random Forest:**
Random Forest is an ensemble learning method that constructs a multitude of decision trees at training time and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It is particularly effective in avoiding overfitting by training on various subsets of the data and averaging the results.

---

#### Comparison and Contrast:

1. **Nature**:
   - *Perceptron:* Neural network-based.
   - *SVM:* Margin-based classifier.
   - *Decision Tree:* Rule-based.
   - *Random Forest:* Ensemble of decision trees.

2. **Model Complexity**:
   - *Perceptron:* Simple with a single layer.
   - *SVM:* Can be complex especially with non-linear kernels.
   - *Decision Tree:* Complexity varies with depth and branching.
   - *Random Forest:* More complex due to multiple trees.

3. **Handling Non-linear Data**:
   - *Perceptron:* Struggles with non-linearly separable data.
   - *SVM:* Can handle non-linearity using kernels.
   - *Decision Tree:* Can handle non-linear data inherently.
   - *Random Forest:* Naturally handles non-linearity due to the ensemble nature.

4. **Overfitting**:
   - *Perceptron:* Prone to overfitting on non-linearly separable data.
   - *SVM:* Less prone due to margin optimization, but choice of kernel and parameters matters.
   - *Decision Tree:* Can easily overfit if not pruned.
   - *Random Forest:* Reduces overfitting through ensemble learning.

5. **Training Speed**:
   - *Perceptron:* Fast as it is a simple model.
   - *SVM:* Slower, especially for large datasets or with complex kernels.
   - *Decision Tree:* Moderate, depends on the depth and branching.
   - *Random Forest:* Slower due to multiple trees, but can be parallelized.

6. **Interpretability**:
   - *Perceptron:* Moderately interpretable due to weights.
   - *SVM:* Less interpretable, especially with non-linear kernels.
   - *Decision Tree:* Highly interpretable as rules can be visualized.
   - *Random Forest:* Less interpretable than a single decision tree but provides feature importance.

---

Now let's answer the specific questions:

---

**Optimization Problem and Cost Function:**

1. **Perceptron:**
   - **Optimization Problem:** Yes.
   - **Cost Function:** Perceptron uses a simple misclassification rate. The algorithm tries to minimize the number of misclassified samples.
   
2. **Support Vector Machines (SVM):**
   - **Optimization Problem:** Yes.
   - **Cost Function:** SVM minimizes the hinge loss subject to margin constraints. The objective is to maximize the margin between the two classes.
   
3. **Decision Tree:**
   - **Optimization Problem:** Yes.
   - **Cost Function:** Decision trees don't have a traditional cost function like the above models. Instead, they use metrics like entropy, Gini impurity, or classification error to decide on splits.
   
4. **Random Forest:**
   - **Optimization Problem:** Yes, but at the individual tree level.
   - **Cost Function:** Like decision trees, random forests use metrics like entropy or Gini impurity for their individual trees.

---

**Speed, Strength, Robustness, and Statistical considerations:**

1. **Perceptron:**
   - **Speed:** Fast.
   - **Strength:** Good for linearly separable data.
   - **Robustness:** Sensitive to noisy data and outliers.
   - **Statistical:** Prone to overfitting on non-linearly separable data.
   
2. **SVM:**
   - **Speed:** Moderate to slow, depending on kernel and dataset size.
   - **Strength:** Effective for both linear and certain non-linear patterns.
   - **Robustness:** Robust against overfitting, especially in high-dimensional space.
   - **Statistical:** Effective, but can be sensitive to the choice of kernel and parameters.
   
3. **Decision Tree:**
   - **Speed:** Fast to moderate.
   - **Strength:** Can capture non-linear relationships.
   - **Robustness:** Prone to overfitting if not pruned.
   - **Statistical:** Can be unstable, small changes in data can lead to different trees.
   
4. **Random Forest:**
   - **Speed:** Slower due to multiple trees, but can be parallelized.
   - **Strength:** Can capture complex patterns and relationships.
   - **Robustness:** More robust against overfitting compared to individual decision trees.
   - **Statistical:** Provides a measure of feature importance and reduces variance.

---

**Feature Type Classifier Naturally Uses:**

1. **Perceptron:**
   - Linear combinations of features.
   
2. **SVM:**
   - Linear or non-linear transformations based on kernels.
   
3. **Decision Tree:**
   - Uses features directly to make decisions based on entropy, Gini impurity, or classification error.
   
4. **Random Forest:**
   - Uses features directly like decision trees but across multiple trees.

---

**Which One to Try First on a Dataset?**

The choice of which model to try first on a dataset depends on the nature and size of the dataset, as well as the specific problem at hand. However, as a general guideline:

- For linearly separable data, starting with a perceptron or linear SVM can be a good choice.
- For datasets with complex non-linear patterns but not too large in size, SVM with non-linear kernels can be effective.
- Decision trees can be a good starting point due to their interpretability and ability to handle non-linear data.
- Random Forest is often a good default choice for many datasets due to its robustness and ability to handle both linear and non-linear patterns.

---

In conclusion, the ideal model often varies with the nature of the data and problem. It's beneficial to start with a simpler model to establish a baseline and then explore more complex models as needed.

#### Question #2: Definitions of Feature Types

##### 1. Numerical
**Definition:** Numerical features represent measurable quantities and can take any value within a range. They can be further divided into continuous (can take any value in a range) and discrete (can only take certain specific values).

**Example from Iris dataset:** 
- sepal length: This is a continuous numerical feature as it can take any value within a range to represent the length of the sepal in centimeters.

##### 2. Nominal
**Definition:** Nominal features are categorical features that don’t have a natural order or ranking. They can take two or more categories, but there's no intrinsic ordering to the categories.

**Example from Iris dataset:** 
- species: This is a nominal feature as it can take values like "setosa", "versicolor", or "virginica". There's no inherent order to these species names.

##### 3. Date
**Definition:** Date features represent specific days, months, years, or even timestamps. They can be used to track the progression of time.

**Example from a Air Quality dataset:** 
- **Air Quality Dataset**: This dataset contains daily readings of the air quality values from 2004 to 2005. A feature like Date in this dataset would indicate the specific day when the air quality was recorded.

##### 4. Text
**Definition:** Text features consist of words, sentences, or paragraphs. These are typically unstructured and require special preprocessing techniques to extract meaningful information.

**Example from Newsgroups dataset:** 
- **20 Newsgroups**: This is a dataset for text classification, containing newsgroup documents, organized into 20 different newsgroups. Each document is a collection of text, representing the content of a post or an article.

##### 5. Image
**Definition:** Image features are typically represented as matrices of pixel values. Each pixel can have one (for grayscale images) or multiple values (for color images).

**Example from a CIFAR-10 dataset:** 
- **CIFAR-10**: This dataset consists of 60,000 32x32 color images in 10 different classes, representing objects like 'airplane', 'automobile', 'bird', etc. Each image is represented as a 3-dimensional array of pixel values (32x32 pixels and 3 channels for RGB).
