# Wine Quality Prediction Project
## Group 22 

___
### INTRODUCTION
___

This project seeks to answer the following question:   
**"Can we predict the quality of wine using its physiochemical properties, such as acidity, sugar content, and alcohol level?"** 

To answer this question, we are using a [data set](https://archive.ics.uci.edu/dataset/186/wine+quality) from the Vinho Verde region in Portugal . These datasets, publicly available through the UCI Machine Learning Repository, include 11 physiochemical attributes for each wine sample—such as fixed acidity, volatile acidity, pH, and alcohol content—along with a sensory-based quality score ranging from 0 to 10. The red wine dataset contains 1,599 samples, while the white wine dataset includes 4,898 samples, enabling a comprehensive analysis across different types of wine.

The ability to accurately classify wine color using machine learning could offer several advantages. For instance, winemakers and researchers could efficiently analyze large datasets, identify trends, and optimize production processes. Furthermore, consumers or wine retailers might use such tools to assess or verify wine characteristics without requiring advanced laboratory equipment. By developing a robust classification model, we aim to contribute to the wine industry’s growing adoption of data-driven methods, enhancing efficiency and accuracy in identifying and categorizing wines. Ultimately, this approach could lead to scalable, reproducible, and cost-effective methods for wine analysis.

___  
### METHODS
___

<!-- loads data from the original source on the web
wrangles and cleans the data from it’s original (downloaded) format to the format necessary for the planned classification or clustering analysis
performs a summary of the data set that is relevant for exploratory data analysis related to the planned classification analysis
creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned classification analysis
performs classification or regression analysis
creates a visualization of the result of the analysis
note: all tables and figure should have a figure/table number and a legend -->

The dataset was initially loaded into a single dataframe, and an exploratory data analysis was conducted to understand the distribution of features. This included checking for class imbalance in the target variable, examining collinearity between input features, and identifying the types of features present in the dataset. This analysis informed decisions regarding feature encoding in subsequent steps.

To `classify wine` samples as either red or white, a logistic regression algorithm was employed. All variables from the original dataset were included in the model. The dataset was split into training and test sets, with 70% of the data allocated to training and 30% to testing. Prior to model fitting, all features were standardized, except for wine quality, which was encoded as an ordinal variable to preserve its inherent order.

The regularization hyperparameter `C` was optimized using 20-fold cross-validation with the F1 score as the evaluation metric. The `F1 score` was chosen to balance precision and recall, given that both false positives and false negatives have approximately equal negative impacts in this application. The optimal `C` value was determined to be `1.85` based on this process.


The analysis was conducted using the Python programming language (Van Rossum and Drake, 2009) and the following libraries:  
- requests (Reitz, 2011) for data retrieval,
- zipfile (Van Rossum and Drake, 2009) for handling compressed files,
- numpy (Harris et al., 2020) for numerical operations,
- pandas (McKinney, 2010) for data manipulation,
- altair (VanderPlas, 2018) for data visualization, and
- scikit-learn (Pedregosa et al., 2011) for model implementation and evaluation.

The code used for analysis and to generate this report is available on GitHub: <https://github.com/Farhan-Faisal/wine-quality-522.git>.

In [1]:
import pandas as pd 
import numpy as np 
import altair as alt
from ucimlrepo import fetch_ucirepo 

In [2]:
wine_quality = fetch_ucirepo(id=186) 
wine = pd.DataFrame(wine_quality.data.original) 

In [3]:
wine.describe()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0
mean,7.215307,0.339666,0.318633,5.443235,0.056034,30.525319,115.744574,0.994697,3.218501,0.531268,10.491801,5.818378
std,1.296434,0.164636,0.145318,4.757804,0.035034,17.7494,56.521855,0.002999,0.160787,0.148806,1.192712,0.873255
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5,5.0
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3,6.0
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3,6.0
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9,9.0


In [4]:
import altair_ally as aly
aly.alt.data_transformers.enable('vegafusion')
aly.dist(wine, color = "type")

ValueError: Unable to determine data type for the field "type"; verify that the field name is not misspelled. If you are referencing a field from a transform, also confirm that the data type is specified correctly.

alt.ConcatChart(...)

In [5]:
aly.corr(wine)