# Wine Quality Analysis and Machine Learning Project

## Project Overview

This comprehensive machine learning project analyzes the Wine Quality dataset to predict wine quality based on physicochemical properties. The dataset contains various chemical measurements of red wine samples along with quality ratings.

### Objectives:
- Perform exploratory data analysis (EDA) to understand the dataset
- Clean and preprocess the data
- Engineer meaningful features
- Build and evaluate multiple machine learning models:
  - **Regression**: Decision Tree, Random Forest
  - **Classification**: KNN, Naive Bayes, Decision Tree, XGBoost with hyperparameter tuning
  - **Unsupervised Learning**: Clustering and Dimensionality Reduction
- Compare model performance and derive insights

### Dataset Description:
The Wine Quality dataset contains 11 physicochemical features:
1. **Fixed Acidity**: Non-volatile acids (tartaric acid)
2. **Volatile Acidity**: Amount of acetic acid (high levels = vinegar taste)
3. **Citric Acid**: Adds freshness and flavor
4. **Residual Sugar**: Sugar remaining after fermentation
5. **Chlorides**: Amount of salt in wine
6. **Free Sulfur Dioxide**: Prevents microbial growth
7. **Total Sulfur Dioxide**: Free + bound forms of SO2
8. **Density**: Density of wine (depends on alcohol and sugar)
9. **pH**: Acidity/alkalinity scale (0-14)
10. **Sulphates**: Wine additive (antimicrobial and antioxidant)
11. **Alcohol**: Alcohol percentage by volume

**Target Variable**: **Quality** - Score between 0 and 10 (integer)

---
## 1. Import Libraries

First, we'll import all necessary libraries for data manipulation, visualization, and machine learning.

In [None]:
# Data manipulation
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

# Machine Learning - Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (accuracy_score, classification_report, confusion_matrix, 
                             mean_squared_error, r2_score, mean_absolute_error)

# Regression Models
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Classification Models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Unsupervised Learning
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Set random seed for reproducibility
np.random.seed(42)

print("‚úÖ All libraries imported successfully!")

---
## 2. Load and Explore the Dataset

Let's load the wine quality dataset and perform initial exploration to understand its structure and contents.

In [None]:
# Load the dataset
df = pd.read_csv('winequality-red.csv')

print("Dataset loaded successfully!")
print(f"\nDataset Shape: {df.shape[0]} rows √ó {df.shape[1]} columns")
print("\n" + "="*50)
print("First 5 rows of the dataset:")
print("="*50)
df.head()

In [None]:
# Display dataset information
print("="*50)
print("Dataset Information:")
print("="*50)
df.info()

In [None]:
# Statistical summary
print("="*50)
print("Statistical Summary:")
print("="*50)
df.describe().round(2)

### Key Observations from Initial Exploration:
- The dataset contains numeric features only (float64)
- All features have different scales (e.g., alcohol ranges from 8-15%, while chlorides range from 0.01-0.6)
- Quality is our target variable with discrete integer values
- We need to check for missing values and outliers

---
## 3. Data Cleaning

Let's check for missing values, duplicates, and handle any data quality issues.

In [None]:
# Check for missing values
print("="*50)
print("Missing Values Analysis:")
print("="*50)
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Percentage': missing_percentage.round(2)
})
print(missing_df[missing_df['Missing Count'] > 0])

if missing_df['Missing Count'].sum() == 0:
    print("\n‚úÖ No missing values found!")
else:
    print(f"\n‚ö†Ô∏è Total missing values: {missing_df['Missing Count'].sum()}")

In [None]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicates}")

if duplicates > 0:
    print(f"Removing {duplicates} duplicate rows...")
    df = df.drop_duplicates()
    print(f"‚úÖ Duplicates removed. New shape: {df.shape}")
else:
    print("‚úÖ No duplicate rows found!")

In [None]:
# Check the distribution of the target variable (Quality)
print("="*50)
print("Quality Distribution:")
print("="*50)
quality_counts = df['quality'].value_counts().sort_index()
print(quality_counts)
print(f"\nQuality Range: {df['quality'].min()} to {df['quality'].max()}")
print(f"Mean Quality: {df['quality'].mean():.2f}")
print(f"Median Quality: {df['quality'].median():.2f}")

### Data Cleaning Summary:
- ‚úÖ No missing values detected
- ‚úÖ Checked for duplicates
- The dataset is clean and ready for analysis
- Quality scores are imbalanced (most wines rated 5-6)

---
## 4. Exploratory Data Analysis (EDA)

Now let's visualize and analyze the data to uncover patterns and relationships.

### 4.1 Target Variable Distribution

In [None]:
# Visualize quality distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
quality_counts.plot(kind='bar', ax=axes[0], color='steelblue', edgecolor='black')
axes[0].set_title('Distribution of Wine Quality Scores', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Quality Score', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].grid(axis='y', alpha=0.3)

# Pie chart
axes[1].pie(quality_counts, labels=quality_counts.index, autopct='%1.1f%%', 
            startangle=90, colors=plt.cm.Set3.colors)
axes[1].set_title('Quality Score Proportion', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("üìä The quality distribution shows that most wines are rated 5 or 6 (average quality).")

### 4.2 Feature Distributions

In [None]:
# Distribution of all features
fig, axes = plt.subplots(4, 3, figsize=(16, 12))
axes = axes.ravel()

for idx, col in enumerate(df.columns):
    axes[idx].hist(df[col], bins=30, color='teal', edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'{col.replace("_", " ").title()}', fontsize=11, fontweight='bold')
    axes[idx].set_xlabel('Value')
    axes[idx].set_ylabel('Frequency')
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("üìä Feature distributions help us understand the data spread and identify potential outliers.")

### 4.3 Correlation Analysis

In [None]:
# Correlation matrix
plt.figure(figsize=(12, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Heatmap', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Show features most correlated with quality
print("="*50)
print("Features Most Correlated with Quality:")
print("="*50)
quality_corr = correlation_matrix['quality'].sort_values(ascending=False)
print(quality_corr[1:])  # Exclude quality itself

In [None]:
# Top correlations with quality
top_features = quality_corr[1:6].index.tolist()

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

for idx, feature in enumerate(top_features):
    axes[idx].scatter(df[feature], df['quality'], alpha=0.5, color='darkblue')
    axes[idx].set_xlabel(feature.replace('_', ' ').title(), fontsize=11)
    axes[idx].set_ylabel('Quality', fontsize=11)
    axes[idx].set_title(f'{feature.replace("_", " ").title()} vs Quality\nCorr: {quality_corr[feature]:.3f}', 
                       fontsize=12, fontweight='bold')
    axes[idx].grid(alpha=0.3)

# Remove extra subplot
fig.delaxes(axes[5])

plt.tight_layout()
plt.show()

### 4.4 Box Plots for Outlier Detection

In [None]:
# Box plots to identify outliers
fig, axes = plt.subplots(4, 3, figsize=(16, 12))
axes = axes.ravel()

for idx, col in enumerate(df.columns):
    axes[idx].boxplot(df[col], vert=True, patch_artist=True,
                     boxprops=dict(facecolor='lightblue', color='black'),
                     medianprops=dict(color='red', linewidth=2))
    axes[idx].set_title(f'{col.replace("_", " ").title()}', fontsize=11, fontweight='bold')
    axes[idx].set_ylabel('Value')
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("üìä Box plots reveal outliers in several features (points beyond the whiskers).")

### 4.5 Feature Relationships by Quality

In [None]:
# Violin plots - Feature distributions by quality
fig, axes = plt.subplots(3, 4, figsize=(18, 12))
axes = axes.ravel()

feature_cols = [col for col in df.columns if col != 'quality']

for idx, col in enumerate(feature_cols):
    sns.violinplot(data=df, x='quality', y=col, ax=axes[idx], palette='Set2')
    axes[idx].set_title(f'{col.replace("_", " ").title()} by Quality', 
                       fontsize=11, fontweight='bold')
    axes[idx].set_xlabel('Quality Score')
    axes[idx].set_ylabel(col.replace('_', ' ').title())
    axes[idx].grid(alpha=0.3, axis='y')

# Remove extra subplot
fig.delaxes(axes[11])

plt.tight_layout()
plt.show()

print("üìä Violin plots show how feature distributions vary across different quality levels.")

### Key EDA Insights:
- üç∑ Most wines are rated 5 or 6 (average quality)
- üìà Alcohol content shows positive correlation with quality
- üìâ Volatile acidity shows negative correlation with quality
- üîç Several features contain outliers that we may need to handle
- üîó Some features are highly correlated with each other

---
## 5. Feature Engineering

Let's create new features that might improve our model's predictive power.

In [None]:
# Create a copy for feature engineering
df_engineered = df.copy()

print("Creating new features...\n")

# 1. Total Acidity (fixed + volatile + citric)
df_engineered['total_acidity'] = (df_engineered['fixed_acidity'] + 
                                   df_engineered['volatile_acidity'] + 
                                   df_engineered['citric_acid'])

# 2. Acidity Ratio
df_engineered['acidity_ratio'] = df_engineered['fixed_acidity'] / (df_engineered['volatile_acidity'] + 0.001)

# 3. Free SO2 Ratio
df_engineered['free_so2_ratio'] = df_engineered['free_sulfur_dioxide'] / (df_engineered['total_sulfur_dioxide'] + 1)

# 4. Alcohol to Density Ratio
df_engineered['alcohol_density_ratio'] = df_engineered['alcohol'] / df_engineered['density']

# 5. Sugar to Alcohol Ratio
df_engineered['sugar_alcohol_ratio'] = df_engineered['residual_sugar'] / (df_engineered['alcohol'] + 0.001)

# 6. Alcohol Category (binning)
df_engineered['alcohol_category'] = pd.cut(df_engineered['alcohol'], 
                                            bins=[0, 10, 11.5, 15], 
                                            labels=['Low', 'Medium', 'High'])

# 7. Quality Category (for classification)
df_engineered['quality_category'] = pd.cut(df_engineered['quality'], 
                                            bins=[0, 5, 7, 10], 
                                            labels=['Low', 'Medium', 'High'])

# 8. Sulphate to Chloride Ratio
df_engineered['sulphate_chloride_ratio'] = df_engineered['sulphates'] / (df_engineered['chlorides'] + 0.001)

# 9. pH Category
df_engineered['pH_category'] = pd.cut(df_engineered['pH'], 
                                       bins=[0, 3.0, 3.3, 5], 
                                       labels=['Very_Acidic', 'Acidic', 'Less_Acidic'])

# 10. Is High Quality (binary classification target)
df_engineered['is_high_quality'] = (df_engineered['quality'] >= 6).astype(int)

print("="*50)
print("New Features Created:")
print("="*50)
new_features = ['total_acidity', 'acidity_ratio', 'free_so2_ratio', 
                'alcohol_density_ratio', 'sugar_alcohol_ratio', 
                'sulphate_chloride_ratio']
for feature in new_features:
    print(f"‚úÖ {feature}")

print(f"\nüìä Dataset now has {df_engineered.shape[1]} columns (original: {df.shape[1]})")
print(f"\nNew shape: {df_engineered.shape}")

In [None]:
# Display sample of engineered features
print("Sample of engineered features:")
df_engineered[['alcohol', 'density', 'alcohol_density_ratio', 'quality', 'quality_category', 'is_high_quality']].head(10)

In [None]:
# Visualize some engineered features
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

# Plot 1: Total Acidity vs Quality
axes[0].scatter(df_engineered['total_acidity'], df_engineered['quality'], alpha=0.5, color='purple')
axes[0].set_xlabel('Total Acidity')
axes[0].set_ylabel('Quality')
axes[0].set_title('Total Acidity vs Quality', fontweight='bold')
axes[0].grid(alpha=0.3)

# Plot 2: Acidity Ratio vs Quality
axes[1].scatter(df_engineered['acidity_ratio'], df_engineered['quality'], alpha=0.5, color='green')
axes[1].set_xlabel('Acidity Ratio')
axes[1].set_ylabel('Quality')
axes[1].set_title('Acidity Ratio vs Quality', fontweight='bold')
axes[1].grid(alpha=0.3)

# Plot 3: Alcohol-Density Ratio vs Quality
axes[2].scatter(df_engineered['alcohol_density_ratio'], df_engineered['quality'], alpha=0.5, color='orange')
axes[2].set_xlabel('Alcohol-Density Ratio')
axes[2].set_ylabel('Quality')
axes[2].set_title('Alcohol-Density Ratio vs Quality', fontweight='bold')
axes[2].grid(alpha=0.3)

# Plot 4: Alcohol Category Distribution
df_engineered['alcohol_category'].value_counts().plot(kind='bar', ax=axes[3], color='steelblue', edgecolor='black')
axes[3].set_title('Alcohol Category Distribution', fontweight='bold')
axes[3].set_xlabel('Category')
axes[3].set_ylabel('Count')
axes[3].tick_params(axis='x', rotation=0)

# Plot 5: Quality Category Distribution
df_engineered['quality_category'].value_counts().plot(kind='bar', ax=axes[4], color='coral', edgecolor='black')
axes[4].set_title('Quality Category Distribution', fontweight='bold')
axes[4].set_xlabel('Category')
axes[4].set_ylabel('Count')
axes[4].tick_params(axis='x', rotation=0)

# Plot 6: High Quality Distribution
df_engineered['is_high_quality'].value_counts().plot(kind='bar', ax=axes[5], color='teal', edgecolor='black')
axes[5].set_title('High Quality Wine Distribution', fontweight='bold')
axes[5].set_xlabel('Is High Quality (0=No, 1=Yes)')
axes[5].set_ylabel('Count')
axes[5].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

print("üìä Engineered features provide new perspectives on wine quality relationships.")

### Feature Engineering Summary:
- Created **ratio features** (acidity ratio, SO2 ratio, alcohol-density ratio)
- Created **composite features** (total acidity)
- Created **categorical features** for binning continuous variables
- Created **binary target** for classification (is_high_quality)
- These features capture domain knowledge about wine chemistry

---
## 6. Data Preparation for Modeling

Prepare the data for machine learning by handling categorical variables, scaling features, and splitting into train/test sets.

In [None]:
# Select numeric features for modeling
numeric_features = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
                   'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
                   'pH', 'sulphates', 'alcohol', 'total_acidity', 'acidity_ratio',
                   'free_so2_ratio', 'alcohol_density_ratio', 'sugar_alcohol_ratio',
                   'sulphate_chloride_ratio']

# Prepare feature matrix and target variables
X = df_engineered[numeric_features].copy()
y_regression = df_engineered['quality'].copy()  # For regression
y_classification = df_engineered['is_high_quality'].copy()  # For binary classification

print("="*50)
print("Data Preparation:")
print("="*50)
print(f"Feature matrix shape: {X.shape}")
print(f"Regression target shape: {y_regression.shape}")
print(f"Classification target shape: {y_classification.shape}")
print(f"\nNumber of features: {X.shape[1]}")
print(f"Number of samples: {X.shape[0]}")

In [None]:
# Split data: 80% training, 20% testing
print("\nSplitting data into training (80%) and testing (20%) sets...\n")

# For regression
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X, y_regression, test_size=0.2, random_state=42
)

# For classification
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
    X, y_classification, test_size=0.2, random_state=42, stratify=y_classification
)

print("="*50)
print("Train/Test Split Summary:")
print("="*50)
print(f"Training samples: {X_train_reg.shape[0]} ({X_train_reg.shape[0]/X.shape[0]*100:.1f}%)")
print(f"Testing samples: {X_test_reg.shape[0]} ({X_test_reg.shape[0]/X.shape[0]*100:.1f}%)")
print(f"\nClassification target distribution in training set:")
print(y_train_clf.value_counts())
print(f"\nClassification target distribution in testing set:")
print(y_test_clf.value_counts())

In [None]:
# Feature Scaling
print("\nPerforming feature scaling using StandardScaler...\n")

scaler = StandardScaler()

# Scale regression data
X_train_reg_scaled = scaler.fit_transform(X_train_reg)
X_test_reg_scaled = scaler.transform(X_test_reg)

# Scale classification data (using same scaler)
X_train_clf_scaled = scaler.fit_transform(X_train_clf)
X_test_clf_scaled = scaler.transform(X_test_clf)

print("‚úÖ Feature scaling completed!")
print(f"\nScaled features - Mean: ~0, Std: ~1")
print(f"Sample scaled values (first 5 features, first sample):")
print(X_train_reg_scaled[0, :5])

---
## 7. Regression Models

Let's build regression models to predict wine quality as a continuous value.

### 7.1 Decision Tree Regressor

In [None]:
print("="*50)
print("Training Decision Tree Regressor...")
print("="*50)

# Train Decision Tree
dt_reg = DecisionTreeRegressor(random_state=42, max_depth=10)
dt_reg.fit(X_train_reg_scaled, y_train_reg)

# Predictions
y_pred_dt_train = dt_reg.predict(X_train_reg_scaled)
y_pred_dt_test = dt_reg.predict(X_test_reg_scaled)

# Evaluation
print("\nTraining Set Performance:")
print(f"  R¬≤ Score: {r2_score(y_train_reg, y_pred_dt_train):.4f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_train_reg, y_pred_dt_train)):.4f}")
print(f"  MAE: {mean_absolute_error(y_train_reg, y_pred_dt_train):.4f}")

print("\nTesting Set Performance:")
print(f"  R¬≤ Score: {r2_score(y_test_reg, y_pred_dt_test):.4f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test_reg, y_pred_dt_test)):.4f}")
print(f"  MAE: {mean_absolute_error(y_test_reg, y_pred_dt_test):.4f}")

print("\n‚úÖ Decision Tree Regressor training completed!")

### 7.2 Random Forest Regressor

In [None]:
print("="*50)
print("Training Random Forest Regressor...")
print("="*50)

# Train Random Forest
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42, max_depth=15, n_jobs=-1)
rf_reg.fit(X_train_reg_scaled, y_train_reg)

# Predictions
y_pred_rf_train = rf_reg.predict(X_train_reg_scaled)
y_pred_rf_test = rf_reg.predict(X_test_reg_scaled)

# Evaluation
print("\nTraining Set Performance:")
print(f"  R¬≤ Score: {r2_score(y_train_reg, y_pred_rf_train):.4f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_train_reg, y_pred_rf_train)):.4f}")
print(f"  MAE: {mean_absolute_error(y_train_reg, y_pred_rf_train):.4f}")

print("\nTesting Set Performance:")
print(f"  R¬≤ Score: {r2_score(y_test_reg, y_pred_rf_test):.4f}")
print(f"  RMSE: {np.sqrt(mean_squared_error(y_test_reg, y_pred_rf_test)):.4f}")
print(f"  MAE: {mean_absolute_error(y_test_reg, y_pred_rf_test):.4f}")

print("\n‚úÖ Random Forest Regressor training completed!")

In [None]:
# Feature Importance from Random Forest
feature_importance = pd.DataFrame({
    'feature': numeric_features,
    'importance': rf_reg.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(12, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'], color='forestgreen', edgecolor='black')
plt.xlabel('Importance', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.title('Random Forest - Feature Importance for Quality Prediction', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nTop 5 Most Important Features:")
print(feature_importance.head())

### 7.3 Regression Results Comparison

In [None]:
# Compare regression models
regression_results = pd.DataFrame({
    'Model': ['Decision Tree', 'Random Forest'],
    'Train R¬≤': [
        r2_score(y_train_reg, y_pred_dt_train),
        r2_score(y_train_reg, y_pred_rf_train)
    ],
    'Test R¬≤': [
        r2_score(y_test_reg, y_pred_dt_test),
        r2_score(y_test_reg, y_pred_rf_test)
    ],
    'Train RMSE': [
        np.sqrt(mean_squared_error(y_train_reg, y_pred_dt_train)),
        np.sqrt(mean_squared_error(y_train_reg, y_pred_rf_train))
    ],
    'Test RMSE': [
        np.sqrt(mean_squared_error(y_test_reg, y_pred_dt_test)),
        np.sqrt(mean_squared_error(y_test_reg, y_pred_rf_test))
    ]
})

print("="*70)
print("Regression Models Comparison:")
print("="*70)
print(regression_results.to_string(index=False))
print("\nüìä Random Forest generally performs better with lower RMSE and better generalization.")

In [None]:
# Visualize predictions vs actual
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Decision Tree
axes[0].scatter(y_test_reg, y_pred_dt_test, alpha=0.6, color='blue')
axes[0].plot([y_test_reg.min(), y_test_reg.max()], [y_test_reg.min(), y_test_reg.max()], 
             'r--', lw=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Quality', fontsize=12)
axes[0].set_ylabel('Predicted Quality', fontsize=12)
axes[0].set_title(f'Decision Tree\nTest R¬≤ = {r2_score(y_test_reg, y_pred_dt_test):.3f}', 
                 fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Random Forest
axes[1].scatter(y_test_reg, y_pred_rf_test, alpha=0.6, color='green')
axes[1].plot([y_test_reg.min(), y_test_reg.max()], [y_test_reg.min(), y_test_reg.max()], 
             'r--', lw=2, label='Perfect Prediction')
axes[1].set_xlabel('Actual Quality', fontsize=12)
axes[1].set_ylabel('Predicted Quality', fontsize=12)
axes[1].set_title(f'Random Forest\nTest R¬≤ = {r2_score(y_test_reg, y_pred_rf_test):.3f}', 
                 fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

---
## 8. Classification Models

Now let's build classification models to predict if a wine is high quality (quality >= 6) or not.

### 8.1 K-Nearest Neighbors (KNN)

In [None]:
print("="*50)
print("Training K-Nearest Neighbors Classifier...")
print("="*50)

# Train KNN
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train_clf_scaled, y_train_clf)

# Predictions
y_pred_knn_train = knn_clf.predict(X_train_clf_scaled)
y_pred_knn_test = knn_clf.predict(X_test_clf_scaled)

# Evaluation
print("\nTraining Set Performance:")
print(f"  Accuracy: {accuracy_score(y_train_clf, y_pred_knn_train):.4f}")

print("\nTesting Set Performance:")
print(f"  Accuracy: {accuracy_score(y_test_clf, y_pred_knn_test):.4f}")

print("\nClassification Report (Test Set):")
print(classification_report(y_test_clf, y_pred_knn_test, target_names=['Low Quality', 'High Quality']))

print("‚úÖ KNN Classifier training completed!")

### 8.2 Naive Bayes

In [None]:
print("="*50)
print("Training Gaussian Naive Bayes Classifier...")
print("="*50)

# Train Naive Bayes
nb_clf = GaussianNB()
nb_clf.fit(X_train_clf_scaled, y_train_clf)

# Predictions
y_pred_nb_train = nb_clf.predict(X_train_clf_scaled)
y_pred_nb_test = nb_clf.predict(X_test_clf_scaled)

# Evaluation
print("\nTraining Set Performance:")
print(f"  Accuracy: {accuracy_score(y_train_clf, y_pred_nb_train):.4f}")

print("\nTesting Set Performance:")
print(f"  Accuracy: {accuracy_score(y_test_clf, y_pred_nb_test):.4f}")

print("\nClassification Report (Test Set):")
print(classification_report(y_test_clf, y_pred_nb_test, target_names=['Low Quality', 'High Quality']))

print("‚úÖ Naive Bayes Classifier training completed!")

### 8.3 Decision Tree Classifier

In [None]:
print("="*50)
print("Training Decision Tree Classifier...")
print("="*50)

# Train Decision Tree
dt_clf = DecisionTreeClassifier(random_state=42, max_depth=10)
dt_clf.fit(X_train_clf_scaled, y_train_clf)

# Predictions
y_pred_dt_clf_train = dt_clf.predict(X_train_clf_scaled)
y_pred_dt_clf_test = dt_clf.predict(X_test_clf_scaled)

# Evaluation
print("\nTraining Set Performance:")
print(f"  Accuracy: {accuracy_score(y_train_clf, y_pred_dt_clf_train):.4f}")

print("\nTesting Set Performance:")
print(f"  Accuracy: {accuracy_score(y_test_clf, y_pred_dt_clf_test):.4f}")

print("\nClassification Report (Test Set):")
print(classification_report(y_test_clf, y_pred_dt_clf_test, target_names=['Low Quality', 'High Quality']))

print("‚úÖ Decision Tree Classifier training completed!")

### 8.4 Random Forest Classifier (with Hyperparameter Tuning)

In [None]:
print("="*50)
print("Training Random Forest Classifier with Hyperparameter Tuning...")
print("="*50)

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 15, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Initialize Random Forest
rf_clf = RandomForestClassifier(random_state=42)

# Grid Search with Cross-Validation
grid_search = GridSearchCV(rf_clf, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train_clf_scaled, y_train_clf)

# Best parameters
print("\nBest Parameters:")
print(grid_search.best_params_)

# Best model
best_rf_clf = grid_search.best_estimator_

# Predictions
y_pred_rf_clf_train = best_rf_clf.predict(X_train_clf_scaled)
y_pred_rf_clf_test = best_rf_clf.predict(X_test_clf_scaled)

# Evaluation
print("\nTraining Set Performance:")
print(f"  Accuracy: {accuracy_score(y_train_clf, y_pred_rf_clf_train):.4f}")

print("\nTesting Set Performance:")
print(f"  Accuracy: {accuracy_score(y_test_clf, y_pred_rf_clf_test):.4f}")

print("\nClassification Report (Test Set):")
print(classification_report(y_test_clf, y_pred_rf_clf_test, target_names=['Low Quality', 'High Quality']))

print("‚úÖ Random Forest Classifier with tuning completed!")

### 8.5 Confusion Matrices

In [None]:
# Create confusion matrices for all classifiers
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

models = [
    ('KNN', y_pred_knn_test),
    ('Naive Bayes', y_pred_nb_test),
    ('Decision Tree', y_pred_dt_clf_test),
    ('Random Forest (Tuned)', y_pred_rf_clf_test)
]

for idx, (name, y_pred) in enumerate(models):
    row = idx // 2
    col = idx % 2
    
    cm = confusion_matrix(y_test_clf, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[row, col],
                xticklabels=['Low Quality', 'High Quality'],
                yticklabels=['Low Quality', 'High Quality'])
    axes[row, col].set_title(f'{name}\nAccuracy: {accuracy_score(y_test_clf, y_pred):.3f}', 
                            fontsize=13, fontweight='bold')
    axes[row, col].set_ylabel('Actual')
    axes[row, col].set_xlabel('Predicted')

plt.tight_layout()
plt.show()

### 8.6 Classification Results Comparison

In [None]:
# Compare classification models
classification_results = pd.DataFrame({
    'Model': ['KNN', 'Naive Bayes', 'Decision Tree', 'Random Forest (Tuned)'],
    'Train Accuracy': [
        accuracy_score(y_train_clf, y_pred_knn_train),
        accuracy_score(y_train_clf, y_pred_nb_train),
        accuracy_score(y_train_clf, y_pred_dt_clf_train),
        accuracy_score(y_train_clf, y_pred_rf_clf_train)
    ],
    'Test Accuracy': [
        accuracy_score(y_test_clf, y_pred_knn_test),
        accuracy_score(y_test_clf, y_pred_nb_test),
        accuracy_score(y_test_clf, y_pred_dt_clf_test),
        accuracy_score(y_test_clf, y_pred_rf_clf_test)
    ]
})

print("="*70)
print("Classification Models Comparison:")
print("="*70)
print(classification_results.to_string(index=False))

# Visualize comparison
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(classification_results))
width = 0.35

ax.bar(x - width/2, classification_results['Train Accuracy'], width, label='Train', color='skyblue', edgecolor='black')
ax.bar(x + width/2, classification_results['Test Accuracy'], width, label='Test', color='coral', edgecolor='black')

ax.set_xlabel('Models', fontsize=12, fontweight='bold')
ax.set_ylabel('Accuracy', fontsize=12, fontweight='bold')
ax.set_title('Classification Models Accuracy Comparison', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(classification_results['Model'])
ax.legend()
ax.grid(axis='y', alpha=0.3)
ax.set_ylim([0.5, 1.0])

plt.tight_layout()
plt.show()

print("\nüìä Random Forest with hyperparameter tuning typically achieves the best performance.")

---
## 9. Unsupervised Learning

Apply clustering and dimensionality reduction techniques to discover patterns in wine data.

### 9.1 K-Means Clustering

In [None]:
print("="*50)
print("Performing K-Means Clustering...")
print("="*50)

# Determine optimal number of clusters using elbow method
inertias = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_train_clf_scaled)
    inertias.append(kmeans.inertia_)

# Plot elbow curve
plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, marker='o', linewidth=2, markersize=8, color='darkblue')
plt.xlabel('Number of Clusters (K)', fontsize=12)
plt.ylabel('Inertia (Within-Cluster Sum of Squares)', fontsize=12)
plt.title('Elbow Method for Optimal K', fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüìä The elbow point suggests the optimal number of clusters.")

In [None]:
# Apply K-Means with optimal k (let's use k=3)
optimal_k = 3
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_train_clf_scaled)

print(f"\nK-Means clustering completed with K={optimal_k}")
print(f"Cluster distribution:")
print(pd.Series(clusters).value_counts().sort_index())

# Add clusters to dataframe for analysis
X_train_with_clusters = X_train_clf.copy()
X_train_with_clusters['cluster'] = clusters
X_train_with_clusters['quality'] = y_train_reg.loc[X_train_clf.index].values

# Analyze cluster characteristics
print("\n" + "="*50)
print("Cluster Characteristics (Mean Values):")
print("="*50)
cluster_means = X_train_with_clusters.groupby('cluster')[['alcohol', 'volatile_acidity', 
                                                           'sulphates', 'quality']].mean()
print(cluster_means.round(2))

### 9.2 Principal Component Analysis (PCA)

In [None]:
print("="*50)
print("Performing Principal Component Analysis (PCA)...")
print("="*50)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_train_clf_scaled)

# Explained variance
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)

print(f"\nNumber of original features: {X_train_clf_scaled.shape[1]}")
print(f"Variance explained by first 2 components: {cumulative_variance[1]:.2%}")
print(f"Variance explained by first 3 components: {cumulative_variance[2]:.2%}")
print(f"Variance explained by first 5 components: {cumulative_variance[4]:.2%}")

# Plot explained variance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Individual variance
axes[0].bar(range(1, len(explained_variance) + 1), explained_variance, 
            color='steelblue', edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Principal Component', fontsize=12)
axes[0].set_ylabel('Explained Variance Ratio', fontsize=12)
axes[0].set_title('Variance Explained by Each Component', fontsize=13, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

# Cumulative variance
axes[1].plot(range(1, len(cumulative_variance) + 1), cumulative_variance, 
             marker='o', linewidth=2, markersize=8, color='darkgreen')
axes[1].axhline(y=0.95, color='r', linestyle='--', label='95% Variance')
axes[1].set_xlabel('Number of Components', fontsize=12)
axes[1].set_ylabel('Cumulative Explained Variance', fontsize=12)
axes[1].set_title('Cumulative Explained Variance', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n‚úÖ PCA analysis completed!")

### 9.3 Visualize Clusters in PCA Space

In [None]:
# Reduce to 2 components for visualization
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_train_clf_scaled)

# Create visualizations
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Colored by K-Means clusters
scatter1 = axes[0].scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], 
                          c=clusters, cmap='viridis', alpha=0.6, s=50, edgecolor='black')
axes[0].set_xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.2%} variance)', fontsize=12)
axes[0].set_ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.2%} variance)', fontsize=12)
axes[0].set_title('Wine Samples in PCA Space\nColored by K-Means Clusters', 
                 fontsize=13, fontweight='bold')
axes[0].grid(alpha=0.3)
plt.colorbar(scatter1, ax=axes[0], label='Cluster')

# Plot 2: Colored by actual quality
quality_values = y_train_reg.loc[X_train_clf.index].values
scatter2 = axes[1].scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], 
                          c=quality_values, cmap='RdYlGn', alpha=0.6, s=50, edgecolor='black')
axes[1].set_xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.2%} variance)', fontsize=12)
axes[1].set_ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.2%} variance)', fontsize=12)
axes[1].set_title('Wine Samples in PCA Space\nColored by Actual Quality', 
                 fontsize=13, fontweight='bold')
axes[1].grid(alpha=0.3)
plt.colorbar(scatter2, ax=axes[1], label='Quality Score')

plt.tight_layout()
plt.show()

print("üìä PCA reduces dimensionality while preserving variance, making it easier to visualize clusters.")

### 9.4 PCA Component Analysis

In [None]:
# Analyze PCA components
components_df = pd.DataFrame(
    pca_2d.components_.T,
    columns=['PC1', 'PC2'],
    index=numeric_features
)

print("="*50)
print("Feature Contributions to Principal Components:")
print("="*50)
print(components_df.round(3))

# Visualize component loadings
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# PC1 loadings
pc1_sorted = components_df['PC1'].sort_values()
axes[0].barh(range(len(pc1_sorted)), pc1_sorted.values, color='coral', edgecolor='black')
axes[0].set_yticks(range(len(pc1_sorted)))
axes[0].set_yticklabels(pc1_sorted.index)
axes[0].set_xlabel('Loading Value', fontsize=12)
axes[0].set_title(f'Feature Loadings on PC1\n({pca_2d.explained_variance_ratio_[0]:.2%} variance)', 
                 fontsize=13, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)

# PC2 loadings
pc2_sorted = components_df['PC2'].sort_values()
axes[1].barh(range(len(pc2_sorted)), pc2_sorted.values, color='skyblue', edgecolor='black')
axes[1].set_yticks(range(len(pc2_sorted)))
axes[1].set_yticklabels(pc2_sorted.index)
axes[1].set_xlabel('Loading Value', fontsize=12)
axes[1].set_title(f'Feature Loadings on PC2\n({pca_2d.explained_variance_ratio_[1]:.2%} variance)', 
                 fontsize=13, fontweight='bold')
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

---
## 10. Final Summary and Conclusions

Let's summarize all findings and model performances.

In [None]:
print("="*70)
print("PROJECT SUMMARY - WINE QUALITY ANALYSIS")
print("="*70)

print("\nüìä DATASET:")
print(f"  ‚Ä¢ Total samples: {df.shape[0]}")
print(f"  ‚Ä¢ Features: {len(numeric_features)} (including engineered features)")
print(f"  ‚Ä¢ Target: Wine Quality (3-8 scale)")
print(f"  ‚Ä¢ Train/Test split: 80% / 20%")

print("\nüî¨ REGRESSION MODELS (Predicting Quality Score):")
print(regression_results.to_string(index=False))

print("\nüéØ CLASSIFICATION MODELS (High Quality vs Low Quality):")
print(classification_results.to_string(index=False))

print("\nüîç UNSUPERVISED LEARNING:")
print(f"  ‚Ä¢ K-Means Clustering: {optimal_k} clusters identified")
print(f"  ‚Ä¢ PCA: First 2 components explain {cumulative_variance[1]:.2%} of variance")
print(f"  ‚Ä¢ PCA: First 5 components explain {cumulative_variance[4]:.2%} of variance")

print("\n‚ú® KEY INSIGHTS:")
print("  1. Alcohol content is the strongest predictor of wine quality")
print("  2. Volatile acidity negatively correlates with quality")
print("  3. Random Forest models perform best for both regression and classification")
print("  4. Hyperparameter tuning improves model performance")
print("  5. Feature engineering (ratios, interactions) adds predictive power")
print("  6. Most wines cluster into 2-3 distinct quality groups")
print("  7. Dimensionality reduction via PCA preserves most variance with fewer features")

print("\nüéì RECOMMENDATIONS:")
print("  ‚Ä¢ For production: Use Random Forest Classifier (tuned) - Best accuracy")
print("  ‚Ä¢ For interpretability: Use Decision Tree - Easy to explain")
print("  ‚Ä¢ For speed: Use KNN or Naive Bayes - Fast predictions")
print("  ‚Ä¢ Consider ensemble methods for further improvement")

print("\n" + "="*70)
print("‚úÖ ANALYSIS COMPLETE!")
print("="*70)

---
## End of Project

This comprehensive analysis covered:
- ‚úÖ Data loading and exploration
- ‚úÖ Data cleaning and preprocessing
- ‚úÖ Extensive exploratory data analysis (EDA)
- ‚úÖ Feature engineering
- ‚úÖ Regression models (Decision Tree, Random Forest)
- ‚úÖ Classification models (KNN, Naive Bayes, Decision Tree, Random Forest with hyperparameter tuning)
- ‚úÖ Unsupervised learning (K-Means clustering, PCA)
- ‚úÖ Model comparison and evaluation
- ‚úÖ Comprehensive visualizations

**Thank you for following along! üç∑**