<a href="https://colab.research.google.com/github/amritavarshini04/Finlatics_ml_projects/blob/main/wine_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

importing libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

importing dataset

In [3]:
df = pd.read_csv("wine_data.csv")

In [4]:
print(df.head())

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5 

Check for Missing Values

In [5]:
print("\nMissing values in each column:\n", df.isnull().sum())


Missing values in each column:
 fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64


Check for Duplicates

In [6]:
print("\nNumber of duplicate rows:", df.duplicated().sum())



Number of duplicate rows: 240


Remove Duplicate Rows

In [7]:
df_cleaned = df.drop_duplicates()

In [8]:
print("\nStatistical Summary:\n", df_cleaned.describe())


Statistical Summary:
        fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1359.000000       1359.000000  1359.000000     1359.000000   
mean        8.310596          0.529478     0.272333        2.523400   
std         1.736990          0.183031     0.195537        1.352314   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.430000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1359.000000          1359.000000           1359.000000  1359.000000   
mean      0.088124            15.893304             46.825975     0.996709   
std       0.049377            10.447270             33.408946     0.001869   
min       0.012000       

Most frequently occurring wine quality



In [9]:
most_freq_quality = df_cleaned['quality'].mode()[0]
print("Most Frequent Wine Quality:", most_freq_quality)


Most Frequent Wine Quality: 5


Highest and lowest wine quality

In [10]:
max_quality = df_cleaned['quality'].max()
min_quality = df_cleaned['quality'].min()
print("Highest Quality:", max_quality)
print("Lowest Quality:", min_quality)


Highest Quality: 8
Lowest Quality: 3


Correlation of features with quality

In [11]:
correlations = df_cleaned.corr(numeric_only=True)['quality'].sort_values(ascending=False)

print("\nCorrelations with Quality:")
print("Alcohol:", correlations['alcohol'])
print("Fixed Acidity:", correlations['fixed acidity'])
print("Free Sulfur Dioxide:", correlations['free sulfur dioxide'])
print("Volatile Acidity:", correlations['volatile acidity'])



Correlations with Quality:
Alcohol: 0.48034289800155505
Fixed Acidity: 0.11902366561349675
Free Sulfur Dioxide: -0.050462766805025684
Volatile Acidity: -0.39521368900984055


Average residual sugar for best and worst quality wine


In [12]:
avg_sugar_best = df_cleaned[df_cleaned['quality'] == max_quality]['residual sugar'].mean()
avg_sugar_worst = df_cleaned[df_cleaned['quality'] == min_quality]['residual sugar'].mean()

print("\nAverage Residual Sugar:")
print(f"Best Quality (Quality = {max_quality}): {avg_sugar_best:.2f} g/L")
print(f"Worst Quality (Quality = {min_quality}): {avg_sugar_worst:.2f} g/L")



Average Residual Sugar:
Best Quality (Quality = 8): 2.58 g/L
Worst Quality (Quality = 3): 2.64 g/L


Does volatile acidity affect wine quality

In [13]:
print("\nVolatile Acidity vs Quality Correlation:", correlations['volatile acidity'])



Volatile Acidity vs Quality Correlation: -0.39521368900984055


Build Decision Tree and Random Forest models, compare accuracy


In [14]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Features and target
X = df_cleaned.drop('quality', axis=1)
y = df_cleaned['quality']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_preds = dt.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_preds)

# Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_preds)

print("\nModel Accuracies:")
print(f"Decision Tree: {dt_accuracy * 100:.2f}%")
print(f"Random Forest: {rf_accuracy * 100:.2f}%")



Model Accuracies:
Decision Tree: 50.74%
Random Forest: 65.44%
