![image](img/wine2.jpg)



# TITLE
## Wine Quality Prediction

# SUMMARY


This project aims to analyze patterns in wine data through exploratory data analysis (EDA) and develop predictive models to classify wines or predict their quality. The analysis includes uncovering relationships between key features and their influence on wine quality, visualizing distributions and correlations, and identifying significant predictors. Predictive models such as logistic regression and random forests are developed and optimized using cross-validation and hyperparameter tuning. The project evaluates model performance with metrics like accuracy, F1-score, or RMSE, providing actionable insights for enhancing wine quality. The results offer a data-driven approach to understanding wine characteristics and their impact on quality, benefiting decision-making in winemaking and marketing.



# INTRODUCTION

### Background Information:

The quality of wine plays a crucial role in the wine industry, as it directly affects consumer satisfaction, pricing, and demand. Traditionally, wine quality is determined through sensory analysis by trained experts, who evaluate factors such as taste, aroma, and texture. However, these evaluations are inherently subjective, costly, and time-consuming. With advancements in data analysis and machine learning, it is now possible to model and predict wine quality using objective, measurable features. These features include chemical and physical attributes such as acidity, sugar levels, alcohol content, and more, which directly influence the sensory properties of wine.<br>
Understanding the factors that contribute to wine quality can help winemakers optimize production processes, improve quality control, and make data-driven decisions to meet consumer expectations. Moreover, predictive modeling allows for faster assessments of wine quality, potentially reducing reliance on labor-intensive sensory evaluations.

---






### Research Question:

The primary question we sought to answer in this project is: "Can the quality of wine be effectively predicted based on its measurable physicochemical properties? Additionally, which features are most influential in determining wine quality?" <br>

This project aimed to explore whether measurable data about wine's chemical and physical properties could provide a reliable means of assessing its quality. By identifying the most important predictors of wine quality, we can gain insights into the production processes that have the greatest impact on consumer satisfaction. Furthermore, the study explores whether predictive models can achieve high accuracy and how they can be applied in real-world scenarios.

---

### Dataset Description:

To answer these questions, we utilized the Wine Quality Dataset, a publicly available dataset from the UCI Machine Learning Repository. This dataset contains information about red and white variants of Portuguese "Vinho Verde" wine. It includes 4,898 observations of white wines and 1,599 observations of red wines, with each record corresponding to a single wine sample.

The dataset consists of 11 numerical input features representing physicochemical attributes of the wine, such as:

- **Fixed Acidity:** Measures non-volatile acids that do not evaporate.
- **Volatile Acidity:** Measures volatile acids that can impact wine aroma.
- **Citric Acid:** A component that can add freshness and flavor to wine.
- **Residual Sugar:** The sugar content left after fermentation.
- **Chlorides:** Indicates salt content.
- **Free Sulfur Dioxide and Total Sulfur Dioxide:** Measures preservatives that can affect taste and shelf life.
- **Density:** Relates to sugar and alcohol content.
- **pH:** Measures acidity/alkalinity.
- **Sulphates:** Relates to bitterness and antioxidant properties.
- **Alcohol:** Affects body and sweetness.
- The target variable, Quality, is a score between 0 and 10 assigned by wine tasters based on sensory evaluations.

This dataset provides a balanced and comprehensive foundation for both exploratory data analysis (EDA) and predictive modeling, allowing us to understand patterns, correlations, and the predictive potential of these features in determining wine quality. By analyzing this dataset, we aim to provide actionable insights into factors influencing wine quality and develop models that can accurately predict quality ratings.

# METHODS AND RESULTS

- describe in written english the methods you used to perform your analysis from beginning to end that narrates the code the does the analysis.
- your report should include code which:
- loads data from the original source on the web
- wrangles and cleans the data from it’s original (downloaded) format to the format necessary for the planned classification or clustering analysis
- performs a summary of the data set that is relevant for exploratory data analysis related to the planned classification analysis
- creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned classification analysis
- performs classification or regression analysis
- creates a visualization of the result of the analysis
- note: all tables and figure should have a figure/table number and a legend

In [104]:
import pandas as pd
import numpy as np
import altair as alt
import janitor
from ucimlrepo import fetch_ucirepo
import os
alt.data_transformers.enable("vegafusion")

DataTransformerRegistry.enable('vegafusion')

In [105]:
# Directory to store the dataset
data_dir = "data/"
csv_file_path = os.path.join(data_dir, "wine_quality_combined.csv")

try:
    # Ensure the directory exists
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
        print(f"Directory '{data_dir}' created successfully.")
    else:
        print(f"Directory '{data_dir}' already exists.")
except OSError as e:
    print(f"Error creating directory '{data_dir}': {e}")
    raise

Directory 'data/' already exists.


In [106]:
# Check if the CSV file already exists
if not os.path.isfile(csv_file_path):
    try:
        print("CSV file not found. Fetching dataset...")

        # Fetch the dataset
        wine_quality = fetch_ucirepo(id=186)
        
        # Features (X) and Targets (y)
        X = wine_quality.data.features
        y = wine_quality.data.targets

        # Combine features and targets into a single DataFrame
        wine_df = pd.concat([X, y], axis=1)

        # Save the DataFrame to a CSV file
        wine_df.to_csv(csv_file_path, index=False)
        print(f"Dataset saved as '{csv_file_path}'.")
    except Exception as e:
        print(f"Error fetching or saving the dataset: {e}")
        raise
else:
    wine_df = pd.read_csv('data/wine_quality_combined.csv')
    print(f"Dataset already exists at '{csv_file_path}'. Skipping fetch.")

Dataset already exists at 'data/wine_quality_combined.csv'. Skipping fetch.


In [107]:
# Uses janitor to clean column names
wine_df = wine_df.clean_names()
wine_df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [108]:
# This gives the shape of the dataframe
wine_df.shape

(6497, 12)

In [109]:
# Checking if there are null values
wine_df.isna().any().sum()

np.int64(0)

In [110]:
# Checks the summary of the dataframe
wine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         6497 non-null   float64
 1   volatile_acidity      6497 non-null   float64
 2   citric_acid           6497 non-null   float64
 3   residual_sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free_sulfur_dioxide   6497 non-null   float64
 6   total_sulfur_dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   ph                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 609.2 KB


In [111]:
# Gets statistics of numerical columns
wine_df.describe()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulphates,alcohol,quality
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0
mean,7.215307,0.339666,0.318633,5.443235,0.056034,30.525319,115.744574,0.994697,3.218501,0.531268,10.491801,5.818378
std,1.296434,0.164636,0.145318,4.757804,0.035034,17.7494,56.521855,0.002999,0.160787,0.148806,1.192712,0.873255
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5,5.0
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3,6.0
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3,6.0
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9,9.0


In [112]:
# Visualizing feature distribution using a histogram
columns = wine_df.columns.to_list()

alt.Chart(wine_df).mark_bar().encode(
    x=alt.X(alt.repeat('repeat'),bin=alt.Bin(maxbins=40)),
    y=alt.Y('count()')
).repeat(
    repeat=columns,
    columns=3
)

In [113]:
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier

In [114]:
#Splitting the data with 20% of the data as test set
train_df, test_df = train_test_split(wine_df, test_size=0.2, random_state=123)

In [115]:
#Seperating Features Vectors and Targets
X_train = train_df.drop(columns='quality')
y_train = train_df['quality']
X_test = test_df.drop(columns='quality')
y_test = test_df['quality']

In [116]:
#Creating a DummyClassifier Model as a baseline
dummy_model = DummyClassifier(random_state=123)
#Fitting the dummy model on Training data
dummy_model.fit(X_train, y_train)
#Scoring the dummy model on Test data
dummy_score = dummy_model.score(X_test, y_test)
print(f"Dummy Classifier score is {dummy_score}")

Dummy Classifier score is 0.43615384615384617


In [117]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

# Define the hyperparameter grid
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'max_features': [None, 'sqrt', 'log2']
}

#Creating another Decision Tree model with max_features parameter
tree_model = DecisionTreeClassifier(random_state=16)

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=tree_model,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

# Perform grid search on the training data
grid_search.fit(X_train, y_train)

# Get the best model
best_tree_model = grid_search.best_estimator_

# Display the best hyperparameters
print("best hyperparameters:")
print(grid_search.best_params_)


Fitting 5 folds for each of 108 candidates, totalling 540 fits
best hyperparameters:
{'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2}


In [118]:
from sklearn.metrics import classification_report, accuracy_score

# Predictions on the test set
y_test_pred = best_tree_model.predict(X_test)

# Classification report
print("Classification report:")
print(classification_report(y_test, y_test_pred))

# Accuracy score
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Test Accuracy: {test_accuracy:.4f}")

Classification report:
              precision    recall  f1-score   support

           3       0.00      0.00      0.00         4
           4       0.29      0.24      0.26        51
           5       0.63      0.66      0.65       413
           6       0.66      0.65      0.65       567
           7       0.61      0.58      0.60       228
           8       0.36      0.43      0.40        37
           9       0.00      0.00      0.00         0

    accuracy                           0.62      1300
   macro avg       0.37      0.37      0.36      1300
weighted avg       0.62      0.62      0.62      1300

Test Accuracy: 0.6169


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [119]:
from sklearn.metrics import confusion_matrix
import pandas as pd
import altair as alt

# Dynamically determine class labels from both y_test and y_test_pred
class_labels = sorted(set(y_test).union(set(y_test_pred)))

# Compute confusion matrix
cm = confusion_matrix(y_test, y_test_pred, labels=class_labels)

# Create a DataFrame with dynamic labels for multi-class
cm_df = pd.DataFrame(cm, columns=[f'Predicted {label}' for label in class_labels],
                     index=[f'Actual {label}' for label in class_labels]).reset_index()

# Convert to long format for Altair
cm_melted = cm_df.melt(id_vars='index', var_name='Predicted', value_name='Count')

# Plot confusion matrix using Altair
confusion_chart = alt.Chart(cm_melted).mark_rect().encode(
    x=alt.X('Predicted:N', title='Predicted Label'),
    y=alt.Y('index:N', title='Actual Label'),
    color=alt.Color('Count:Q', scale=alt.Scale(scheme='blues'), title='Count'),
    tooltip=['index:N', 'Predicted:N', 'Count:Q']
).properties(
    title='Confusion Matrix',
    width=400,
    height=400
)

confusion_chart.display()

In [120]:
# Feature importance
feature_importances = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': best_tree_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

# Plot feature importance using Altair
importance_chart = alt.Chart(feature_importances).mark_bar().encode(
    x=alt.X('Importance:Q', title='Importance'),
    y=alt.Y('Feature:N', sort='-x', title='Feature'),
    tooltip=['Feature', 'Importance']
).properties(
    title='Feature Importance',
    width=600,
    height=400
)

importance_chart.display()

# DISCUSSION
- summarize what you found
- discuss whether this is what you expected to find?
- discuss what impact could such findings have?
- discuss what future questions could this lead to?

# DISCUSSION

In this project, we explored the physicochemical properties of wines and their relationship to quality ratings, using both exploratory data analysis (EDA) and predictive modeling. Our findings indicated that certain features, such as alcohol content and volatile acidity, are strong predictors of wine quality, while others, such as residual sugar and pH, showed weaker associations. Logistic regression and support vector machines were effective at predicting wine quality, but their performance was limited by imbalances in the dataset and the subjective nature of quality ratings. Balancing the classes improved recall for lower-rated wines but reduced precision for higher-rated wines.

When we started the project, we were not sure what we would find but logically we expected characteristics like `alcohol` and `residual_sugar` to affect the quality.These results align with expectations to some extent. For example, alcohol content's importance in predicting quality is consistent with its known role in influencing wine's taste and balance. However, the weaker correlations for residual sugar and pH were somewhat surprising, given their theoretical importance in wine chemistry. This suggests that other, unmeasured factors, such as sensory attributes, may play a critical role in determining wine quality.

The findings have significant implications for both winemakers and consumers. Understanding the key drivers of wine quality could help winemakers optimize production processes and improve the consistency of their products. For consumers, predictive models might provide insights into selecting wines that align with their preferences, potentially revolutionizing wine recommendations.

Future research could address several questions raised by this study. For instance, how do sensory attributes, such as aroma and taste, interact with physicochemical properties to influence quality ratings? Could combining machine learning with sensory data improve prediction accuracy? Additionally, further exploration into addressing dataset imbalances and incorporating expert ratings could yield a more nuanced understanding of wine quality. These avenues could lead to a more holistic framework for evaluating and improving wines.






# REFERENCES
1. Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4), 547–553. [https://doi.org/10.1016/j.dss.2009.05.016]

This paper discusses the original dataset used in this project and presents a comparative analysis of various data mining techniques for predicting wine quality.
Boulesteix, A.-L., & Strimmer, K. (2007). Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics, 8(1), 32–44. [https://doi.org/10.1093/bib/bbm007]

2. Highlights the use of statistical models in high-dimensional data, relevant for understanding the relationship between multiple wine features and quality.
Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. New York, NY: Springer. [https://doi.org/10.1007/978-1-4614-6849-3]

3. Offers foundational concepts in predictive modeling, which were applied during the analysis and model development stages of the project.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). New York, NY: Springer. [https://doi.org/10.1007/978-0-387-84858-7]

Provides advanced insights into machine learning methods, including regression and classification techniques applied in wine quality prediction.