<a href="https://colab.research.google.com/github/elenachau/arduinowatercooler/blob/main/redwinequality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Predicting Red Wine Quality**

### **Problem Statement:**

The dataset contains various chemical attributes of red wines, such as acidity, alcohol content, and pH, along with a quality rating. Your task is to build a machine learning model to predict the quality of red wines based on their chemical attributes.

### **Dataset Description:**

The dataset contains the following attributes:

* `Fixed Acidity`: The amount of tartaric acid in the wine (g/dm³).
* `Volatile Acidity`: The amount of acetic acid in the wine (g/dm³).
* `Citric Acid`: The amount of citric acid in the wine (g/dm³).
* `Residual Sugar`: The amount of residual sugar in the wine (g/dm³).
* `Chlorides`: The amount of chlorides in the wine (g/dm³).
* `Free Sulfur Dioxide`: The amount of free sulfur dioxide (mg/dm³).
* `Total Sulfur Dioxide`: The total amount of sulfur dioxide (mg/dm³).
* `Density`: The density of the wine (g/cm³).
* `pH`: The pH of the wine.
* `Sulphates`: The amount of sulphates in the wine (g/dm³).
* `Alcohol`: The alcohol content of the wine (%).
* `Quality`: The quality rating of the wine (on a scale from 3 to 8, with higher values indicating better quality).

### **Expectations:**

- Data preprocessing steps.
- EDA findings, including visualizations.
- Data splitting
- Model selection, training, and evaluation details.
- Model performance metrics and insights.
- Conclusion and recommendations.



#### **Data Download**

In [None]:
!wget https://raw.githubusercontent.com/aniruddhachoudhury/Red-Wine-Quality/master/winequality-red.csv

#### **Data preprocessing steps**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy
from math import sqrt
import seaborn as sns

In [None]:
df = pd.read_csv('winequality-red.csv')

In [None]:
df.head()

In [None]:
df.info() #check for null values

In [None]:
df.describe() #check for data deviation

In [None]:
df.drop_duplicates(subset=None, inplace=True)
df

#### **EDA findings, including visualizations**

In [None]:
#categorical data
sns.pairplot(df, hue='quality', palette='Set2')
plt.show()

In [None]:
#correlation matrix and heatmap
correlation_matrix = df.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()

In [None]:
#boxplot for quality vs attributes
plt.figure(figsize=(20,20))
for i, col in enumerate(df.columns[:-1], 1):
  plt.subplot(4, 3, i)
  sns.boxplot(x='quality', y=col, data=df, palette='Set3')
  plt.title(f'{col} vs quality')

#### **Data Splitting**

In [None]:
from sklearn.model_selection import train_test_split

features = df.drop(['quality'], axis=1)
target = df['quality']
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.25, random_state=42)

#### **Model selection, training, and evaluation details**

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(x_train, y_train)

y_train_pred = lr.predict(x_train)
y_test_pred = lr.predict(x_test)

In [None]:
data = pd.concat([pd.DataFrame(y_test.values), pd.DataFrame(y_test_pred)], axis=1)
data.columns = ['actuals', 'predicted']
data['predicted'] = data['predicted'].apply(lambda x: round(x))
data

#### **Model performance metrics and insights**

In [None]:
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_percentage_error as mape

print(f"Train error: {round(sqrt(mse(y_train, y_train_pred)), 3)}")
print(f"Test error: {round(sqrt(mse(y_test, y_test_pred)), 3)}")

In [None]:
plt.figure(figsize=(12, 10))
sns.histplot(y_test - y_test_pred, kde=1, bins=30)
plt.show()

In [None]:
print(f"Train MAPE: {round(mape(y_train, y_train_pred), 3)*100}")
print(f"Test MAPE: {round(mape(y_test, y_test_pred), 3)*100}")

In [None]:
from sklearn.metrics import r2_score

print(f"Train r2_score: {round(r2_score(y_train, y_train_pred), 3)}")
print(f"Test r2_score: {round(r2_score(y_test, y_test_pred), 3)}")