# **Red Wine Quality Data Pipeline**
## Summary
In this project I:
- Learned how to use AWS S3, Databricks, and PySpark.
- Setup a Data Pipeline from AWS S3 to Databricks.
- Created a Histogram to Explore the Distribution of Wine Quality
- Used PySpark MLlib to Build and Evaluate Linear Regression, Random Forest, and Gradient Boosted Tree Models
- Investigated the Worse than Expected Performance of GBT and Tuned Parameters.
- Interpreted the Models and the Results

## 1. Setup AWS S3, Databricks, and Data
1. Created an AWS Account, an S3 Bucket, and Uploaded Data to the Bucket
2. Created a Databricks Account, connecting it to AWS and Github.
3. Imported Libraries and Loaded Data from S3

In [0]:
# Import Libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.regression import LinearRegression, RandomForestRegressor, GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator



In [0]:
# Getting Data from S3
df = spark.read.csv("s3://red-wine-quality-data-pipeline/raw/winequality/winequality-red.csv",header=True,inferSchema=True)
display(df)

##2. Exploratory Data Analysis
1. Displayed the Data in a Dataframe
2. Created a Histogram to Visualize Wine Quality

In [0]:
# Display the Data
display(df)

In [0]:
# Creating a Histogram to Visualize the Distribution of Wine Quality.
plt.hist(df.select('quality').toPandas()['quality'], bins=np.arange(0.5,10.5,1), edgecolor='black')
plt.title('Distribution of Wine Quality')
plt.xlabel('Quality')
plt.ylabel('Count')
plt.show()

## 3. Build Machine Learning Models
1. Used an Assembler to Prepare the Data for PySpark
2. Split the Data into Training and Test
3. Built a Linear Regression, Random Forest, and Gradient Boosted Trees Model using PySpark MLlib.
4. Evaluated and Printed the Results
5. Experimented with different maxIters for GBT

**Notes:**
Linear Regression: Y = B_0 + B_1\*X_1 + B_2\*X_2 + ...
- No Interaction Terms
Random Forest: Takes the average prediction of Multiple Decision Trees trained on different subsets of the data.
Gradient Boosted Trees: Trains an Initial Tree and then Trains additional Trees on the error.


In [0]:
# Collapse the Features into a Vector so PySpark can use it.
assembler = VectorAssembler(inputCols=df.drop('quality').columns, outputCol='raw_features')
df2 = assembler.transform(df)

# Standardize the Data so LR's Coefficients are Comparable Intra-Model
scaler = StandardScaler(inputCol='raw_features', outputCol='features', withStd=True, withMean=True)
scaler_model = scaler.fit(df2)
df2 = scaler_model.transform(df2)

# Train Test Split
train_df, test_df = df2.randomSplit([0.8, 0.2], seed=9)

# Initialize Evaluator
rmse_evaluator = RegressionEvaluator(labelCol='quality', predictionCol='prediction', metricName='rmse')
r2_evaluator = RegressionEvaluator(labelCol='quality', predictionCol='prediction', metricName='r2')

# 1. Linear Regression
lr = LinearRegression(featuresCol='features', labelCol='quality')
lr_model = lr.fit(train_df)
lr_predictions = lr_model.transform(test_df)
lr_rmse = rmse_evaluator.evaluate(lr_predictions)
lr_r2 = r2_evaluator.evaluate(lr_predictions)

# 2. Random Forest
rf = RandomForestRegressor(featuresCol='features', labelCol='quality', numTrees=20)
rf_model = rf.fit(train_df)
rf_predictions = rf_model.transform(test_df)
rf_rmse = rmse_evaluator.evaluate(rf_predictions)
rf_r2 = r2_evaluator.evaluate(rf_predictions)

# 3. Gradient Boosted Trees
gbt = GBTRegressor(featuresCol='features', labelCol='quality', maxIter=10)
gbt_model = gbt.fit(train_df)
gbt_predictions = gbt_model.transform(test_df)
gbt_rmse = rmse_evaluator.evaluate(gbt_predictions)
gbt_r2 = r2_evaluator.evaluate(gbt_predictions)

# Compare Results
print("\n=== Model Comparison ===")
print("[RMSE]")
print(f"Linear Regression: {lr_rmse:.4f}")
print(f"Random Forest: {rf_rmse:.4f}")
print(f"GBT: {gbt_rmse:.4f}")
print("[R2]")
print(f"Linear Regression: {lr_r2:.4f}")
print(f"Random Forest: {rf_r2:.4f}")
print(f"GBT: {gbt_r2:.4f}")


Note: I noticed GBT had significantly lower than expected performance so I decided to look at its performance on the training data.

In [0]:
# How do the models perform on the training data?
# 1. Linear Regression
lr_predictions_train = lr_model.transform(train_df)
lr_r2_train = r2_evaluator.evaluate(lr_predictions_train)

# 2. Random Forest
rf_predictions_train = rf_model.transform(train_df)
rf_r2_train = r2_evaluator.evaluate(rf_predictions_train)

# 3. Gradient Boosted Trees
gbt_predictions_train = gbt_model.transform(train_df)
gbt_r2_train = r2_evaluator.evaluate(gbt_predictions_train)

# Compare Results
print("\n=== Model Training Data Performance ===")
print("[R2]")
print(f"Linear Regression: {lr_r2_train:.4f}")
print(f"Random Forest: {rf_r2_train:.4f}")
print(f"GBT: {gbt_r2_train:.4f}")


Comment: It looks like it may have overfit the data. What if I changed the number of trees?

In [0]:
#MaxIter = 20
gbt20 = GBTRegressor(featuresCol='features', labelCol='quality', maxIter=20)
gbt20_model = gbt20.fit(train_df)
gbt20_predictions = gbt20_model.transform(test_df)
gbt20_r2 = r2_evaluator.evaluate(gbt20_predictions)

#MaxIter = 5
gbt5 = GBTRegressor(featuresCol='features', labelCol='quality', maxIter=5)
gbt5_model = gbt5.fit(train_df)
gbt5_predictions = gbt5_model.transform(test_df)
gbt5_r2 = r2_evaluator.evaluate(gbt5_predictions)


print("\n=== Model Comparison ===")
print("[R2]")
print(f"GBT: {gbt_r2:.4f}")
print(f"GBT20: {gbt20_r2:.4f}")
print(f"GBT5: {gbt5_r2:.4f}")

It looks like the original model was the best! It's possible to find the best value for maxIter (number of trees used). However, based on what we've seen so far, I believe the other models to be more suitable for predicting wine quality.

## 4.Results
The models all point to alcohol being the most useful predictor of wine quality.

The Linear Regression model expects one standard deviation increase in alcohol to lead to a 0.30 increase in rating.

The Random Forest Model and Gradient Boosted Trees Model both agree with alochol accounting for 33% and 15% of the predictive power of the model.

While GBT has the highest predictive power on the training data, it performs poorly on the test data. After examining the models and tuning the parameters, I believe it is overfitting the training data due to the following factors:
- Small Dataset Size -> Easy for GBT to memorize individual wines.
- More Noise and Less Signal -> Subjective Ratings
- Imbalanced Classes -> GBT memorizes the edge cases

In [0]:
# Look at Model Coefficients
feature_names = df.drop('quality').columns

# Linear Regression Coefficients
lr_coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Linear Regression Coefficient': [round(c, 4) for c in lr_model.coefficients]
})
print('Linear Regression Coefficients')
display(lr_coef_df)

# Random Forest Feature Importances
rf_imp_df = pd.DataFrame({
    'Feature': feature_names,
    'Random Forest Importance': [round(f, 4) for f in rf_model.featureImportances.toArray()]
})
print('Random Forest Feature Importances')
display(rf_imp_df)

# GBT Feature Importances
gbt_imp_df = pd.DataFrame({
    'Feature': feature_names,
    'GBT Importance': [round(f, 4) for f in gbt_model.featureImportances.toArray()]
})
print('GBT Feature Importances')
display(gbt_imp_df)

Data Source:
https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009?resource=download
AI Disclosure:
Used Databricks' automated code completion, Claude for code suggestions, and ChatGPT to brainstorm ideas.