# Wine Quality Clasification Notebook

## Dataset Description
This dataset consists of 1,000 wine samples, where each sample has been assessed using four chemical features and assigned a quality label (low, medium, or high). The dataset is intended for classification tasks—the objective is to build a model that can predict the quality_label of a wine based on its chemical attributes.

Wines can have their quality influenced by various physicochemical factors (e.g., sugar content, acidity, alcohol). Understanding these can help producers and researchers assess wine characteristics and potentially improve production quality.

## Data Collection
The dataset is hosted on Kaggle by user sahideseker and described simply as containing 1,000 labeled wine samples with key attributes. 
[Wine Quality Classification](https://www.kaggle.com/datasets/sahideseker/wine-quality-classification?fbclid=IwY2xjawKri85leHRuA2FlbQIxMABicmlkETFwSmpnYkdhVDFlTnB5Nm9lAR5XrBjEEytHpU0oDCEo4Fs1W97qrXJQrLufftsi_Ro4KXY7ZrGxk17P3rU7Ug_aem_HhBigX5iAqh3X31BoKJfeA)


The structure closely matches the [UCI Wine Quality dataset](https://archive.ics.uci.edu/dataset/186/wine%2Bquality?utm_source=chatgpt.com), which derives from lab-based physicochemical testing of Portuguese Vinho Verde wines, along with sensory (expert) quality scores 
archive.ics.uci.edu


Likely a sample or curated excerpt from the UCI dataset—though no explicit mention is made of whether the data are from red or white varieties, or how the quality labels were assigned.

### Potential Implications of Data Collection
* Label Origin Ambiguity: If labels derive from expert scores versus rule-based thresholds, this can affect model interpretability and reliability.
* Sampling Bias: Without clear provenance, it's unclear whether all wine types or regions are equally represented.
* Limited Feature Scope: Only four chemical attributes are included, so models trained on this may miss nuance present in broader feature sets.



## Structure of the Data
This is a structured, tabular dataset with:
* Rows: Each row corresponds to one wine sample.
* Columns: Each column is either a measurable physicochemical feature or the class label.
* Total Observations: 1000 rows (wine samples)
* Total Features: 5 columns (4 features + 1 label)

## Column-by-Column Breakdown
| Column Name      | Data Type            | Description                                                                                                                     |
| ---------------- | -------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| `fixed_acidity`  | Float                | Refers to acids that do not evaporate readily. Affects flavor and stability. Higher values often mean higher tartness.          |
| `residual_sugar` | Float                | The amount of sugar left after fermentation. Impacts sweetness.                                                                 |
| `alcohol`        | Float                | Alcohol percentage by volume. Higher alcohol often contributes to a better quality perception.                                  |
| `density`        | Float                | Density of the wine, related to sugar and alcohol content. Important for quality and fermentation monitoring.                   |
| `quality_label`  | Object (Categorical) | The assigned quality class for each wine: `low`, `medium`, or `high`. This is the **target variable** for classification tasks. |





## Data Cleaning