# 🍷 Predicting Wine Quality with AWS SageMaker  
*An End-to-End Machine Learning Workflow*  

## 📌 Overview  
This project demonstrates a **complete ML pipeline** using AWS SageMaker to predict wine quality (classification task). Key steps include:  
- **Data preprocessing** and feature engineering.  
- **Model training** with SageMaker's built-in XGBoost.  
- **Deployment** of a real-time inference endpoint.  
- **Integration** with AWS S3 for data storage.  

**Tools Used**:  
<p align="left">  
  <img src="https://img.shields.io/badge/Python-3776AB?logo=python&logoColor=white" alt="Python">  
  <img src="https://img.shields.io/badge/Amazon_SageMaker-FF9900?logo=amazonaws&logoColor=white" alt="SageMaker">  
  <img src="https://img.shields.io/badge/AWS_S3-569A31?logo=amazons3&logoColor=white" alt="S3">  
</p>  

🗂️ Dataset  
**Source**: [UCI Wine Quality Dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality)  
- **11 Features**: Fixed acidity, volatile acidity, citric acid, ... (*chemical properties*).  
- **Target**: Wine quality (0 to 10). 

## 🛠️ Environment Setup  
### **Python Libraries** 

In [1]:
import boto3
import numpy as np
import pandas as pd
import sagemaker
from sagemaker import get_execution_role  
from sagemaker import image_uris
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split  
from sagemaker.inputs import TrainingInput



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


### Load credentials

In [2]:
from dotenv import load_dotenv
import os

load_dotenv()

S3_BUCKET = os.getenv("S3_BUCKET")
SAGEMAKER_ROLE = os.getenv("SAGEMAKER_ROLE")

## 🔄 Workflow
### 1️⃣ Data Loading & Preprocessing

In [3]:
# Load data
df = pd.read_csv("winequality-red.csv", delimiter=";")

df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [4]:
# Process data
target= "quality"
features = [col for col in df.columns if col != target]

In [5]:
# Split data
X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Save datasets
pd.concat([y_train, X_train], axis=1).to_csv("train.csv", index=False, header=False)
pd.concat([y_test, X_test], axis=1).to_csv("test.csv", index=False, header=False)

In [6]:
X.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4


### 2️⃣ Upload Data to S3
#### Train/test files stored in S3 for SageMaker access:

In [7]:
# Initialize S3 client
s3 = boto3.client('s3')

# Define data subpaths
train_subpath= "data/train/train.csv"
test_subpath= "data/test/test.csv"

# Upload data to s3
s3.upload_file("train.csv", s3_bucket, train_subpath)  
s3.upload_file("test.csv", s3_bucket, test_subpath)  

### 3️⃣ Model Training with SageMaker
**Algorithm**: XGBoost (built-in SageMaker container).

In [8]:
session = sagemaker.Session()  
role = get_execution_role()  
region = session.boto_region_name

# Define data paths
train_path= f"s3://{s3_bucket}/{train_subpath}"
test_path= f"s3://{s3_bucket}/{test_subpath}"

# Get XGBoost image  
image_uri = image_uris.retrieve(framework="xgboost", region=region, version="1.5-1")  

# Configure estimator  
xgb = sagemaker.estimator.Estimator(  
    image_uri=image_uri,  
    role=role,  
    instance_count=1,  
    instance_type="ml.m5.large",  
    output_path=f"s3://{s3_bucket}/output",  
    sagemaker_session=session
)  

xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    verbosity=0,
    objective="reg:squarederror",
    num_round=500,
)

xgb.fit(
    inputs={
        "train": TrainingInput(
            train_path,
            content_type="text/csv"  # ✅ Critical
        )
    }
)

2025-03-21 01:32:12 Starting - Starting the training job...
..25-03-21 01:32:26 Starting - Preparing the instances for training.
..25-03-21 01:32:49 Downloading - Downloading input data.
.....03-21 01:33:35 Downloading - Downloading the training image.
2025-03-21 01:34:41 Training - Training image download completed. Training in progress.
  from pandas import MultiIndex, Int64Index[0m
[34m[2025-03-21 01:34:31.044 ip-10-2-230-194.ec2.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2025-03-21 01:34:31.071 ip-10-2-230-194.ec2.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2025-03-21:01:34:31:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2025-03-21:01:34:31:INFO] Failed to parse hyperparameter objective value reg:squarederror to Json.[0m
[34mReturning the value itself[0m
[34m[2025-03-21:01:34:31:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2025-03-21:01:34:31:INFO] Running XGB

### 4️⃣ Model Deployment
**Endpoint**: wine-quality-prediction.

**Instance**: ml.m5.large

In [9]:
from datetime import datetime

timestamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

# Deploy the model
predictor = xgb.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",
    endpoint_name = f"wine-quality-prediction-{timestamp}"
)

----------!

## 📊 Results & Evaluation
### Model Performance

In [10]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

# Attach serializers
predictor.serializer = CSVSerializer()
predictor.deserializer = JSONDeserializer()

# Predict directly with numpy array/DataFrame
test_preds = predictor.predict(X_test.values)
y_pred = [entry['score'] for entry in test_preds['predictions']]
# Calculate metrics
rmse = np.sqrt(mean_squared_error(list(y_test.values),y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.2f}")

RMSE: 0.59
R² Score: 0.47
