# Wine Quality Prediction - Linear Regression

This notebook uses the Wine Quality dataset from the UCI Machine Learning Repository.  
The goal is to build a linear regression model to predict wine quality (score 0–10)  
based on 11 physicochemical features such as acidity, sugar, pH, and alcohol.  

Steps in this notebook:
1. Import and explore the dataset.
2. Train/test split and build a linear regression model.
3. Evaluate performance using metrics like R² and MSE.
4. Save the trained model for deployment in a Flask app (Heroku).


In [1]:
# Import core libraries
import pandas as pd
import numpy as np

# Scikit-learn for ML
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# For saving the model
import pickle


In [9]:
# Download directly from UCI repo (red wine)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"

# Load into pandas
data = pd.read_csv(url, sep=";")

print("Dataset shape:", data.shape)
data.head()


Dataset shape: (1599, 12)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [10]:
# Basic exploration
print("Shape of dataset:", data.shape)  # Expect (1599, 12) for red wine
print("\nMissing values per column:\n", data.isnull().sum())
print("\nSummary statistics:\n", data.describe())


Shape of dataset: (1599, 12)

Missing values per column:
 fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

Summary statistics:
        fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000       

In [11]:
# Features = first 11 columns
X = data.drop("quality", axis=1)

# Target = wine quality
y = data["quality"]

X.head(), y.head()


(   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
 0            7.4              0.70         0.00             1.9      0.076   
 1            7.8              0.88         0.00             2.6      0.098   
 2            7.8              0.76         0.04             2.3      0.092   
 3           11.2              0.28         0.56             1.9      0.075   
 4            7.4              0.70         0.00             1.9      0.076   
 
    free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
 0                 11.0                  34.0   0.9978  3.51       0.56   
 1                 25.0                  67.0   0.9968  3.20       0.68   
 2                 15.0                  54.0   0.9970  3.26       0.65   
 3                 17.0                  60.0   0.9980  3.16       0.58   
 4                 11.0                  34.0   0.9978  3.51       0.56   
 
    alcohol  
 0      9.4  
 1      9.8  
 2      9.8  
 3      9.8  
 4

In [12]:
# Split data: 80% training, 20% testing
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training samples:", X_train.shape[0])
print("Testing samples:", X_test.shape[0])


Training samples: 1279
Testing samples: 320


In [13]:
from sklearn.linear_model import LinearRegression

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

print("Model training complete.")


Model training complete.


In [14]:
from sklearn.metrics import mean_squared_error, r2_score

# Predictions
y_pred = model.predict(X_test)

# Metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared (R²):", r2)


Mean Squared Error (MSE): 0.3900251439639543
R-squared (R²): 0.40318034127962277


In [15]:
import pickle

# Save trained model to file
with open("wine_quality_model.pkl", "wb") as f:
    pickle.dump(model, f)

print("Model saved as wine_quality_model.pkl")


Model saved as wine_quality_model.pkl


In [16]:
app_code = """
import pickle
from flask import Flask, request, jsonify, render_template

# Load trained model
with open("wine_quality_model.pkl", "rb") as f:
    model = pickle.load(f)

app = Flask(__name__)

@app.route('/')
def home():
    return "Wine Quality Prediction API is running!"

@app.route('/predict', methods=['POST'])
def predict():
    try:
        # Input features from JSON request
        data = request.get_json(force=True)
        features = [
            data['fixed_acidity'],
            data['volatile_acidity'],
            data['citric_acid'],
            data['residual_sugar'],
            data['chlorides'],
            data['free_sulfur_dioxide'],
            data['total_sulfur_dioxide'],
            data['density'],
            data['pH'],
            data['sulphates'],
            data['alcohol']
        ]
        prediction = model.predict([features])
        return jsonify({'predicted_quality': float(prediction[0])})
    except Exception as e:
        return jsonify({'error': str(e)})

if __name__ == "__main__":
    app.run(debug=True)
"""

with open("app.py", "w") as f:
    f.write(app_code)

print("✅ app.py created")


✅ app.py created


In [17]:
requirements = """flask
pandas
scikit-learn
gunicorn
"""
with open("requirements.txt", "w") as f:
    f.write(requirements)

print("✅ requirements.txt created")


✅ requirements.txt created


In [18]:
with open("runtime.txt", "w") as f:
    f.write("python-3.9.18")

print("✅ runtime.txt created")


✅ runtime.txt created
