<a href="https://www.kaggle.com/code/yarlagaddasaimanoj/wine-quality-ml-models?scriptVersionId=143486639" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Wine Quality Prediction**

## Introduction

The **Wine Quality Prediction** aims to analyze and predict the quality of red variants of the Portuguese "Vinho Verde" wine based on physicochemical attributes. The dataset provides valuable insights into the chemical composition of wines and how these attributes influence the perceived quality.


![image.png](https://scopelliti1887.com/wp-content/uploads/2022/03/baccarat-wine-glasses.jpg.webp)

## Dataset

The dataset consists of the following attributes:

- **Input Variables (based on physicochemical tests):**
  1. Fixed acidity
  2. Volatile acidity
  3. Citric acid
  4. Residual sugar
  5. Chlorides
  6. Free sulfur dioxide
  7. Total sulfur dioxide
  8. Density
  9. pH
  10. Sulphates
  11. Alcohol

- **Output Variable (based on sensory data):**
  12. Quality (score between 0 and 10)

The project involves building classification models to predict wine quality based on these attributes. The dataset presents a challenging task due to its imbalanced and relatively small sample size.

## **Importing DataSet:**


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df=pd.read_csv('/kaggle/input/wine-quality-dataset/WineQT.csv')

In [None]:
df

In [None]:
df.head(10)

In [None]:
df.tail(10)

In [None]:
df.dtypes

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
set(df['quality'])

In [None]:
df.describe()

In [None]:
df.describe(include='all')

In [None]:
df.isnull()

In [None]:
df.isnull().sum()

In [None]:
df.corr()

In [None]:
df.groupby('quality').mean()

In the dataset, each column represents a specific attribute related to wine quality. The values in these columns have no missing data, with no null values present (0 missing values) in any of the attributes. This clean dataset is ready for analysis and modeling.

Due to the absence of missing data in the dataset's attributes, we have skipped the following steps:

1. **Data Imputation:** There was no need to fill or impute missing values since all columns are complete, saving time and complexity in data preprocessing.

2. **Missing Data Analysis:** The step of analyzing patterns or causes of missing data was unnecessary, as the dataset was devoid of any missing values.

3. **Handling Missing Values:** With no null values to address, there was no requirement for strategies such as imputation, removal of incomplete rows, or advanced techniques to manage missing data.

## **Exploratory Data Analysis (EDA):**

### 1. Summary Statistics:

 Function used:`DataFrame.describe`

In [None]:
# Display summary stats for all numerical columns:

summary_stats=df.describe(include='all')
print(summary_stats)

### 2. Histograms:

Function used: `seaborn.histplot()`

Histogram with KDE for the 'density' attribute from the provided DataFrame (df), visualizing its distribution. The x-axis represents the density values, the y-axis represents the frequency of those values, and a smooth KDE curve is overlaid on the histogram. The plot is presented with appropriate labels and a title for better interpretation.

In [None]:
plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='density', kde=True, color='blue')
plt.xlabel('Density', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Density', fontsize=14)
plt.show()

In [None]:
from seaborn.widgets import color_palette
plt.figure(figsize=(12, 7))

# Set the color palette to 'deep'
sns.set_palette("PiYG")

sns.histplot(data=df, x='alcohol', kde=True)
plt.xlabel('Alchol Content', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distibution of Alchol Content', fontsize=15)
plt.show()

In [None]:
# Import necessary libraries
from seaborn.widgets import color_palette
import matplotlib.pyplot as plt

# Create a new figure for plotting with specified dimensions
plt.figure(figsize=(12, 7))

# Set the color palette to 'Set2' (a specific Seaborn color palette)
sns.set_palette("Set2")

# Create a histogram with KDE (Kernel Density Estimation) for the 'quality' attribute
sns.histplot(data=df, x='quality', kde=True)

# Add label for the x-axis
plt.xlabel('Quality', fontsize=12)

# Add label for the y-axis
plt.ylabel('Frequency', fontsize=12)

# Add a title to the plot
plt.title('Distribution of Quality', fontsize=15)

# Display the plot
plt.show()

In [None]:
df.hist(figsize=(12,12), bins=45)
plt.show()

### 3. Box Plots

Function: `seaborn.boxplot()`

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='quality', y='alcohol', palette='Set2')
plt.xlabel('Wine Quality', fontsize=12)
plt.ylabel('Alcohol Content', fontsize=12)
plt.title('Alcohol Content vs. Wine Quality', fontsize=14)
plt.show()

### 4. Correlation Matrix (Heatmap)

Function: `seaborn.heatmap()`

In [None]:
cm=df.corr()
plt.figure(figsize=(12, 7))
sns.heatmap(cm, annot=True, cmap='viridis', linewidths=0.8)
plt.title('Correlation Heatmap', fontsize=16)
plt.show()

### 5. Pair Plots

Function: `seaborn.pairplot()`

In [None]:
plt.figure(figsize=(14, 8))

# Create a pair plot with specified variables and color palette
pair_plot = sns.pairplot(data=df, vars=['alcohol', 'volatile acidity', 'residual sugar'], hue='quality', palette='viridis')

# Add a title to the plot
pair_plot.fig.suptitle('Pairwise Relationships', fontsize=17)

# Display the pair plot
plt.show()

In [None]:
sns.pairplot(df)

In [None]:
# Set a default color palette
sns.set_palette('viridis')

# Create a pair plot with all columns from the DataFrame df
sns.pairplot(data=df, diag_kind='kde')

# Add a title to the pair plot
plt.suptitle("Pairwise Relationships of All Columns", y=1.02, fontsize=16)

# Display the pair plot
plt.show()

### 6. Countplots and Bar Charts

Function: `seaborn.countplot()`

In [None]:
# Create a new figure for the plot with specified dimensions
plt.figure(figsize=(8, 5))

# Create the count plot using Seaborn
sns.countplot(data=df, x='quality', palette='viridis')

# Add a label for the x-axis
plt.xlabel('Wine Quality', fontsize=12)

# Add a label for the y-axis
plt.ylabel('Count', fontsize=12)

# Add a title to the plot
plt.title('Distribution of Wine Quality', fontsize=14)

# Display the plot
plt.show()

In [None]:
plt.figure(figsize=(12,5))
sns.countplot(df)

In [None]:
# Create subplots for each column in the DataFrame
fig, axes = plt.subplots(1, len(df.columns), figsize=(25, 15))

# Iterate through the columns and create count plots
for i, column in enumerate(df.columns):
    sns.countplot(data=df, x=column, ax=axes[i])
    axes[i].set_title(f'Count Plot of {column}')
    axes[i].set_xlabel('')
    axes[i].set_ylabel('Count')

# Adjust spacing between subplots
plt.tight_layout()

# Show the count plots
plt.show()

## **Ridge and Lasso Regression Model:**

#### Load and Prepare the Data

In [None]:
 X=df.drop(columns='quality',axis=1)
 y=df['quality']

In [None]:
print(X.shape,y.shape)

In [None]:
X.columns

#### Split the features into (X,Y)

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

we use `train_test_split` from scikit-learn to split the data into training and testing sets. In this example, 80% of the data is used for training, and 20% for testing.

#### Standardize the Features

In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Ridge Regression:


In [None]:
from sklearn.linear_model import Ridge

# Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)

### Lasso Regression

In [None]:
from sklearn.linear_model import Lasso

# Lasso Regression
lasso = Lasso(alpha=1.0)  # You can adjust the alpha (regularization strength) as needed
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)

In [None]:
lasso_predictions=y_pred_lasso
ridge_predictions=y_pred_ridge

### Evaluate Ridge Regression

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Evaluate Ridge Regression
ridge_mse = mean_squared_error(y_test, ridge_predictions)
ridge_rmse = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
ridge_r2 = r2_score(y_test, y_pred_ridge)
print("Ridge Regression MSE:", ridge_mse)
print("Ridge Regression RMSE:", ridge_rmse)
print("Ridge Regression R^2:", ridge_r2)

 ###  Evaluate Lasso Regression

In [None]:
# Evaluate Lasso Regression
lasso_rmse = np.sqrt(mean_squared_error(y_test, y_pred_lasso))
lasso_r2 = r2_score(y_test, y_pred_lasso)
lasso_mse = mean_squared_error(y_test, lasso_predictions)
print("Lasso Regression MSE:", lasso_mse)
print("Lasso Regression RMSE:", lasso_rmse)
print("Lasso Regression R^2:", lasso_r2)

In [None]:
# Actual target values
y_actual = y_test

# Predicted values for Lasso and Ridge
y_pred_lasso = y_pred_lasso
y_pred_ridge = y_pred_ridge

In [None]:
plt.figure(figsize=(8, 6))

# Plot Lasso predictions
plt.scatter(y_actual, y_pred_lasso, color='blue', label='Lasso Regression', alpha=0.5)

# Plot Ridge predictions
plt.scatter(y_actual, y_pred_ridge, color='red', label='Ridge Regression', alpha=0.5)

# Add a reference line for perfect predictions (y_actual = y_pred)
plt.plot([min(y_actual), max(y_actual)], [min(y_actual), max(y_actual)], linestyle='--', color='gray', label='Perfect Predictions')

# Customize the plot
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values for Lasso and Ridge Regression')
plt.legend()

# Show the plot
plt.show()

## Decision Tree Model

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

#### Data Modeling:

In [None]:
from sklearn.tree import DecisionTreeRegressor
model= DecisionTreeRegressor()
model.fit(X_train, y_train)

In [None]:
decision_tree_model = DecisionTreeRegressor()
decision_tree_model.fit(X_train, y_train)
decision_tree_predictions = decision_tree_model.predict(X_test)

# Calculate MSE and R-squared for Decision Tree Regressor
dt_mse = mean_squared_error(y_test, decision_tree_predictions)
dt_r2 = r2_score(y_test, decision_tree_predictions)

#### Model Predictions:

In [None]:
y_pred = model.predict(X_test)

In [None]:
y_pred

#### Decision Tree:

In [None]:
from sklearn import tree
plt.figure(figsize=(20,20))
tree.plot_tree(model, filled=True)

## SVM Model:

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#### Create and Train the SVM Model

In [None]:
svm_regressor = SVR(kernel='linear', C=1.0)  # You can choose a different kernel if needed
svm_regressor.fit(X_train, y_train)

In [None]:
y_pred = svm_regressor.predict(X_test)

In [None]:
svm_mse = mean_squared_error(y_test, y_pred)
svm_r2 = r2_score(y_test, y_pred)


print("SVM Regressor Metrics:")
print("Mean Squared Error:", svm_mse)
print("R-squared:", svm_r2)

In [None]:
# Print the metrics for all models
print("Lasso Regression Metrics:")
print("Mean Squared Error:", lasso_mse)
print("R-squared:", lasso_r2)
print()

print("Ridge Regression Metrics:")
print("Mean Squared Error:", ridge_mse)
print("R-squared:", ridge_r2)
print()

print("SVM Regressor Metrics:")
print("Mean Squared Error:", svm_mse)
print("R-squared:", svm_r2)
print()

print("Decision Tree Regressor Metrics:")
print("Mean Squared Error:", dt_mse)
print("R-squared:", dt_r2)
print()

In [None]:
# Compare models and select the best one based on MSE and R-squared
models = ['Lasso Regression', 'Ridge Regression', 'SVM Regressor', 'Decision Tree Regressor']
mse_scores = [lasso_mse, ridge_mse, svm_mse, dt_mse]
r2_scores = [lasso_r2, ridge_r2, svm_r2, dt_r2]

best_model_index = mse_scores.index(min(mse_scores))
best_model_name = models[best_model_index]

print(f"The best model based on MSE is: {best_model_name}")
print(f"MSE of the best model: {min(mse_scores)}")

best_model_index = r2_scores.index(max(r2_scores))
best_model_name = models[best_model_index]

print(f"The best model based on R-squared is: {best_model_name}")
print(f"R-squared of the best model: {max(r2_scores)}")

## Comparison based on MSE and R-squared values:

In [None]:
data = {
    'Model': models,
    'MSE': mse_scores,
    'R-squared': r2_scores
}

results = pd.DataFrame(data)

results = results.sort_values(by='MSE', ascending=True)

results = results.set_index('MSE')

print(results)

after evaluating several regression models on the *Wine Quality dataset*, we have determined the best model based on two key metrics: Mean Squared Error (MSE) and R-squared (R²).

1. **Best Model for MSE:** Ridge Regression
   - The Ridge Regression model achieved the lowest Mean Squared Error (MSE) of approximately 0.376. This indicates that Ridge Regression provides the most accurate predictions among the models we tested in terms of minimizing prediction errors.

2. **Best Model for R-squared:** SVM Regressor
   - The SVM Regressor model achieved the highest R-squared (R²) value of approximately 0.686. R-squared measures the proportion of variance in the target variable that is explained by the model. A higher R² indicates that the SVM Regressor explains a larger portion of the variance in wine quality.

It's important to note that the choice of the "best" model depends on the specific goals and requirements of the prediction task. Ridge Regression excels in minimizing prediction errors (MSE), while the SVM Regressor captures a higher degree of variance in wine quality (R²). Therefore, the selection between these models should consider the trade-off between prediction accuracy and model interpretability.