# Howework 2

- Welcome to your second homework assignment! In this task, you will apply Support Vector Machines (SVM) along with any other model of your choice.

- You will compare the performance of these models, utilizing some of the techniques you've learned so far.

- For this homework, you will be using the 'Steel Plate Faults' dataset, and your goal will be to determine the faults and classify them by their type.

## Steel Plates Faults Dataset

- **Total Fields**: 34
- **Input Features**: 27 Fields (unknown specifics)
- **Class Labels**: 7 Fields (one-hot encoded)

### Features and Labels

#### Input Features (First 27 Fields)
- These fields describe various characteristics or indicators related to the geometric shape and outline of defects seen in images of stainless steel plates.
- The specific nature of these indicators is not disclosed in the provided information, implying that these features are likely derived from image processing techniques. They may encompass metrics related to contour, area, shape, or texture.

#### Class Labels (Last 7 Fields)
- The last seven columns of the dataset represent the one-hot encoded classes of the defects.
- Specifically, the classes are Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps, and Other_Faults.
- In one-hot encoding:
  - Each class is represented by a binary vector where only one element is '1' (True) indicating the presence of that class, and the others are '0' (False).
  - For example, if a steel plate has a "Stains" defect, the corresponding class column for "Stains" will have a 1, while all other class columns will have 0s.

### Considerations for Machine Learning

- **Preprocessing**: Understanding the nature of the first 27 fields is crucial. Feature scaling or normalization may be necessary depending on the distribution of these features.
  
- **Model Selection**: Different classification algorithms can be experimented with, such as:
  - Logistic Regression
  - Decision Trees
  - Random Forests
  - Support Vector Machines
  - Neural Networks
  
- **Evaluation**: Metrics such as accuracy, precision, recall, F1-score, and confusion matrices can be useful for assessing model performance, especially given the potential class imbalance if any classes have fewer instances than others.
  
- **Cross-validation**: Implementing techniques like k-fold cross-validation can help ensure that the model generalizes well across unseen data.

This dataset provides a unique opportunity to explore the intersection of image processing and machine learning for defect classification. Although the lack of detailed information about the 27 input features presents a challenge, the dataset can be effectively utilized to implement and test various machine learning strategies for classification tasks.


### Acknowledgements

- Buscema, M., Terzi, S., & Tastle, W. (2010). Steel Plates Faults [Dataset]. UCI Machine Learning Repository. [https://doi.org/10.24432/C5J88N](https://doi.org/10.24432/C5J88N).

- Lichman, M. (2013). UCI Machine Learning Repository. Retrieved from [http://archive.ics.uci.edu/ml](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

- The CSV version of the data has been sourced from the Kaggle profile of the UCI Machine Learning Repository, with contributions from collaborator Chris Crawford.

## Important Notice for Students

As you work on this homework assignment, it is imperative to complete your tasks independently and engage with the material actively. While tools like AI can provide assistance and guidance, relying too heavily on AI-generated content or directly copying code and solutions from the internet poses several risks:

1. **Academic Integrity**: Submitting work that is not your own can lead to serious consequences, including loss of credit or disciplinary action, if detected. Our department utilizes plagiarism detection tools that can identify content that is not original.

2. **Learning and Understanding**: This homework is designed to deepen your understanding of the subject matter. Engaging with the material personally fosters critical thinking and problem-solving skills that are essential for your academic and professional success.

3. **Mastery of Skills**: Completing assignments on your own allows you to practice and master the skills being taught. This understanding will benefit you far more than merely replicating answers or code generated by AI or found online.

Therefore, we strongly encourage you to approach this assignment with integrity, using resources to enhance your learning while ensuring that all work submitted reflects your own understanding and capabilities. Your educational journey is ultimately about growth and mastery—embrace it!

### Submitting Your Homework

_Please write your solutions **only** in this notebook, focusing on the related fields, without deleting any previously written content. **DO NOT** use any other formats such as ".pdf", ".docx" etc._

Do your coding in the related areas and provide comments **in Markdown**. If you're unfamiliar with Markdown, don’t worry—it's very easy! You can learn more about it [here](https://www.markdownguide.org/basic-syntax/).

Once you have completed your work, kindly **rename the file with your school number** (*i.e.* `10MECT1234.ipynb`) and upload it accordingly.

**Good luck**, and enjoy the process of exploring and applying your learning!

## Steel Plates Faults Detection

***Visualization is crucial for making meaningful comparisons; rather than relying solely on numerical values.***

***It's essential to interpret and report insights directly from the graphs to grasp the true performance of your models.***

### 0. Data Preparation
- Import the dataset into your working environment, usually with a library such as pandas.

You can find the dataset in the data folder under the name 'steel-plates-faults.csv'.
> ```python
> data = pd.read_csv('data/steel-plates-faults.csv')
> ```
 - Handle any missing values through appropriate imputation techniques or removal.
- Verify that data types are accurate, with a specific focus on dates and categorical variables.
- Normalize or standardize features as needed, particularly for Support Vector Classification (SVC).
- Implement any additional methods you've learned so far, if necessary.

### 1. Support Vector Machines
#### 1.1. Train/Test Split 
- Perform a train/test split (e.g., 70/30 or 80/20)
#### 1.2. Select Support Vector Classification (SVC) Model
- Justify the choice of SVC for the task at hand.
#### 1.3. Explore Different SVC Kernels
- Train SVC models using different kernels (linear, polynomial, Gaussian RBF).
#### 1.4. Evaluation Metrics
- Define and explain chosen metrics (accuracy, precision, recall, F1 score, ROC-AUC).
#### 1.5. Performance Analysis
- Evaluate the performance of both the SVC and the additional model using the test data.
- Utilize confusion matrices to visualize the classification performance.
#### 1.6. Compare Models
- Compare and contrast the performance of the SVC models and the additional model based on metrics.

### 2. Additional Classification Model
#### 2.1. Cross-Validation Setup
- Explain the concept of cross-validation and its significance in assessing model performance.
- Choose a cross-validation technique (e.g., k-fold) and provide a justification for selecting a specific number of folds (e.g., 5-fold or 10-fold) for obtaining reliable performance metrics.
#### 2.2. Selecting and Justifying a Different Classification Model
- Select and justify a different classification model (e.g., Decision Trees, Random Forest, or Gradient Boosting) based on the characteristics of the dataset and the problem domain.
#### 2.3. Hyperparameter Tuning for the Selected Model
- Identify key hyperparameters that could improve the model’s performance and explain their importance.
#### 2.4. Evaluation Metrics
- Define and explain the evaluation metrics you will use (accuracy, precision, recall, F1 score, ROC-AUC).
#### 2.5. Performance Analysis of the Selected Model
- Evaluate the performance of the selected classification model using the test data.
- Generate confusion matrices to visualize the classification outcomes, breaking down the true positives, true negatives, false positives, and false negatives.
#### 2.6. Compare Models
- Present the evaluation metrics for both models in a side-by-side table or with visualizations, facilitating a direct comparison.

## 3. Conclusion

- Conduct a comprehensive comparison of the performance between the chosen model and Support Vector Machines (SVC).
- Discuss the strengths and weaknesses of each model based on performance metrics, indicating scenarios where one model may outperform the other.
- Summarize the findings from the performance analysis and the comparative evaluation of the models.
- Provide insights into practical considerations when selecting a classification model for similar classification tasks.