<a href="https://colab.research.google.com/github/dharaa12/Red-Wine-Quality-ML-Project/blob/main/Red_Wine_Quality_ML_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Predicting Red Wine Quality**


### **Dataset Description:**

The dataset contains the following attributes:

* `Fixed Acidity`: The amount of tartaric acid in the wine (g/dm³).
* `Volatile Acidity`: The amount of acetic acid in the wine (g/dm³).
* `Citric Acid`: The amount of citric acid in the wine (g/dm³).
* `Residual Sugar`: The amount of residual sugar in the wine (g/dm³).
* `Chlorides`: The amount of chlorides in the wine (g/dm³).
* `Free Sulfur Dioxide`: The amount of free sulfur dioxide (mg/dm³).
* `Total Sulfur Dioxide`: The total amount of sulfur dioxide (mg/dm³).
* `Density`: The density of the wine (g/cm³).
* `pH`: The pH of the wine.
* `Sulphates`: The amount of sulphates in the wine (g/dm³).
* `Alcohol`: The alcohol content of the wine (%).
* `Quality`: The quality rating of the wine (on a scale from 3 to 8, with higher values indicating better quality).


#### **Data Download**

In [None]:
!wget https://raw.githubusercontent.com/aniruddhachoudhury/Red-Wine-Quality/master/winequality-red.csv

#### **Data preprocessing steps**

In [None]:
# YOUR CODE HERE
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

wine_df = pd.read_csv('winequality-red.csv')
wine_df


In [None]:
wine_df.info()
#data description shows we should have numeric values and Dtype shows us floats and ints - good sign


In [None]:
wine_df.drop_duplicates(inplace=True) #having duplicates can skew the math
wine_df
#.dropna() - if you had null values you drop that column
#inplace=True - modifies the original df

In [None]:
wine_df.describe()#stats of each column

#if being through, youd look at each stat (except quality coz its a discrete value) and see if it makes sense (ie is correct math being calculated) and see if you have good data/line in with other datasets
#so this is where the stuff you learned this sem in college comes in
#for discrete values you want to check if anything is misspelled etc so in this case if someoone put in a value that isnt in the specified range of 3-8

In [None]:
wine_df['quality'].unique()
#so nobody gave a value not in the range
#unique is used to make sure that two things that mean the same thing but written differently arent tracked as different in this case 3 and three = 3
#unique is used on discrete values (no decimals) and not continuous (decimals)

#### **EDA findings, including visualizations**

In [None]:
# YOUR CODE HERE
#Goal: build a ML model to predict the quality based on wines' chemical attributes.
corr_matrix = wine_df.corr()
plt.figure(figsize=(15,10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show() #we are interested in seeing how things correlate to quality
#strong positive correlations (close to 1), strong negative correlations (close to -1)
#we can see that high alcohol percentanges yeilds greater scores for quality while high volatile acidity yields lower scores for quality





#### **Data Splitting**

In [None]:
# YOUR CODE HERE
from sklearn.model_selection import train_test_split #split into train and testing part
#dont train on target value (in this case, quality)
features = wine_df.drop('quality', axis=1) #features = training off of
target = wine_df['quality']
#X - features, x - what im finding
#testing w 20% of our data, training w 80% of data.
#random state makes sure we have the same test/train split everything we run (usually ppl put 42 - insider joke)
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

#### **Model selection, training, and evaluation details**

In [None]:
# YOUR CODE HERE
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

data = pd.concat([pd.DataFrame(y_test.values), pd.DataFrame(y_pred_test)], axis=1)
data.columns = ['Actual', 'Predicted']
data
#predicted is giving continous values (regressions) instead of discrete values coz were using categorical model (regression is not categorical)
#to fix that you use thresholding to round to nearest int

In [None]:
data['Predicted'] = data['Predicted'].apply(lambda x: round(x))
data

#### **Model performance metrics and insights**

In [None]:
# YOUR CODE HERE
#mean square error is used for normal distributions
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_percentage_error as mape
from math import sqrt
print(f'train error: {round(sqrt(mse(y_train, y_pred_train)),3)}')
print(f'test error: {round(sqrt(mse(y_test, y_pred_test)),3)}')

In [None]:
#show errors
plt.figure(figsize=(15,5))
sns.histplot(y_test - y_pred_test, bins = 10, kde = True)
plt.title('Residuals Distribution')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()
#we can see that we get errors close to 0 which is good

In [None]:

print(f'train error: {round(mape(y_train, y_pred_train),3)*100}')
print(f'test error: {round(mape(y_test, y_pred_test),3)*100}')
#anything below 10% is excellent error lvl
#anything over 30% is rlly bad

In [None]:
from sklearn.metrics import r2_score
print(f'train error: {round(r2_score(y_train, y_pred_train),3)}')
print(f'test error: {round(r2_score(y_test, y_pred_test),3)}')


In [None]:
#if used a different model other than linear regression, we could do better