# White Wine classification

In [39]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV,cross_val_score

In [2]:
df=pd.read_csv('winequality-white.csv',delimiter=';')

In [3]:
df.head(5)  #First 5 rows

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [4]:
df.describe()  #Understanding the data 

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0
mean,6.854788,0.278241,0.334192,6.391415,0.045772,35.308085,138.360657,0.994027,3.188267,0.489847,10.514267,5.877909
std,0.843868,0.100795,0.12102,5.072058,0.021848,17.007137,42.498065,0.002991,0.151001,0.114126,1.230621,0.885639
min,3.8,0.08,0.0,0.6,0.009,2.0,9.0,0.98711,2.72,0.22,8.0,3.0
25%,6.3,0.21,0.27,1.7,0.036,23.0,108.0,0.991723,3.09,0.41,9.5,5.0
50%,6.8,0.26,0.32,5.2,0.043,34.0,134.0,0.99374,3.18,0.47,10.4,6.0
75%,7.3,0.32,0.39,9.9,0.05,46.0,167.0,0.9961,3.28,0.55,11.4,6.0
max,14.2,1.1,1.66,65.8,0.346,289.0,440.0,1.03898,3.82,1.08,14.2,9.0


## Gathering train & testing dataset

In [8]:
#To make it easier for us and the model, lets map the quality ratings to low(-1), mid(0) and high(2)
bins=[0,5.5,7.5,10]
labels=[-1,0,1]
df['quality']=pd.cut(df['quality'],bins=bins,labels=labels)

In [9]:
df.head(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,0
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,0
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,0
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,0
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,0


In [11]:
x=df[df.columns[:-1]] #features
y=df['quality']   #target

In [15]:
sc=StandardScaler() #This creates a StandardScaler object
#standardising the data helps in ensuring that the classifier perfomrs optimally and efficiently

In [16]:
'''Usually when we use Standardisation: whens its sensitive to scale like k-NNm SVM, PCA
Also, important in  Linear regression, logistic regression and neural networks.Standardization typically applies to the features (input variables) in your dataset. 
Algorithms that include regularization terms, such as Lasso (L1) Regression and Ridge (L2) Regression, benefit from standardization because it ensures that the regularization term penalizes each feature equally.'''

'Usually when we use Standardisation: whens its sensitive to scale like k-NNm SVM, PCA\nAlso, important in  Linear regression, logistic regression and neural networks.Standardization typically applies to the features (input variables) in your dataset. \nAlgorithms that include regularization terms, such as Lasso (L1) Regression and Ridge (L2) Regression, benefit from standardization because it ensures that the regularization term penalizes each feature equally.'

In [20]:
x=sc.fit_transform(x) #Fit to data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
#We are training with 80% of data and testing with 20% of data
#the random_state parameter ensures that the split of the data into training and testing sets is the same every time you run the code.

In [21]:
for data in [y_train, y_test]:
    print(data.describe())


count     3918
unique       3
top          0
freq      2454
Name: quality, dtype: int64
count     980
unique      3
top         0
freq      624
Name: quality, dtype: int64


### KNN- looks at the closest point and then decides

In [24]:
n3=KNeighborsClassifier(n_neighbors=3)   #Uses Knn keeping in mind the 3 neighbors
n3.fit(x_train,y_train)                  #fit the model with the training data
pred_n3=n3.predict(x_test)                #predict labels using the x test
print(classification_report(y_test,pred_n3)) #comparing ytest (think of this as you already have answers to the test and comparing it with someone who just did the test) 


              precision    recall  f1-score   support

          -1       0.62      0.62      0.62       321
           0       0.77      0.79      0.78       624
           1       0.39      0.26      0.31        35

    accuracy                           0.72       980
   macro avg       0.60      0.56      0.57       980
weighted avg       0.71      0.72      0.71       980



In [25]:
cross_val=cross_val_score(estimator=n3, X=x_train, y=y_train,cv=10) 
print(cross_val.mean())
#Cross validation randomly divides trained and evaluated 10 times,and aggregates scores 

0.7292016806722689


#One can say the acuracy is 72.9%
**Precision** measures the accuracy of the positive predictions. It is the ratio of true positive predictions to the total positive predictions (both true positives and false positives).
 **Intuition** Precision tells us, out of all the instances that the model predicted as positive, how many were actually positive. High precision means that there are very few false positives.
 **Recall** measures the ability of the model to capture all the positive instances. It is the ratio of true positive predictions to the total actual positives (both true positives and false negatives).
 **Intuition**Recall tells us, out of all the actual positive instances, how many were correctly predicted by the model. High recall means that there are very few false negatives.
 The**F1 score** is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall.
 The F1 score is useful when you need a balance between precision and recall and when you have an uneven class distribution. A high F1 score indicates that both precision and recall are reasonably high.
 **Support** refers to the number of actual occurrences of each class in the dataset. It is the number of true instances for each label in the dataset.
  **Intuition** Support helps us understand the distribution of the different classes in the dataset. It is not a performance metric but provides context for interpreting precision, recall, and F1 score.
  


Think of difference bw precison and recall as:
Out of all the emails, predicted as spam, 91% were actually spam.
Out of all the actual spam emails, the model correctly identified 83% of them.

In [30]:
n5=KNeighborsClassifier(n_neighbors=5)   #Uses Knn keeping in mind the 3 neighbors
n5.fit(x_train,y_train)                  #fit the model with the training data
pred_n5=n5.predict(x_test)                #predict labels using the x test
print(classification_report(y_test,pred_n5)) #comparing ytest (think of this as you already have answers to the test and comparing it with someone who just did the test) 
cross_val=cross_val_score(estimator=n3, X=x_train, y=y_train,cv=10) 
print(cross_val.mean())

              precision    recall  f1-score   support

          -1       0.64      0.59      0.62       321
           0       0.76      0.82      0.79       624
           1       0.33      0.11      0.17        35

    accuracy                           0.72       980
   macro avg       0.58      0.51      0.52       980
weighted avg       0.71      0.72      0.71       980

0.7292016806722689


### Tree Classifiers: Random Forest Classifiers

Tree classifier asks us a bunch of questions and based on that predicts. its only a single decison classifier

In [36]:
rf = RandomForestClassifier()
rf.fit(x_train,y_train)                  
pred_rf=rf.predict(x_test)                
print(classification_report(y_test,pred_rf))
cross_val=cross_val_score(estimator=rf, X=x_train, y=y_train,cv=10) 
print(cross_val.mean())

              precision    recall  f1-score   support

          -1       0.78      0.73      0.75       321
           0       0.84      0.89      0.86       624
           1       0.88      0.40      0.55        35

    accuracy                           0.82       980
   macro avg       0.83      0.67      0.72       980
weighted avg       0.82      0.82      0.81       980

0.806785322824782


### Decision Tree Classifier

In [37]:
dt = DecisionTreeClassifier()
dt.fit(x_train,y_train)                  
pred_dt=dt.predict(x_test)                
print(classification_report(y_test,pred_dt))
cross_val=cross_val_score(estimator=dt, X=x_train, y=y_train,cv=10) 
print(cross_val.mean())

              precision    recall  f1-score   support

          -1       0.67      0.68      0.67       321
           0       0.81      0.80      0.80       624
           1       0.40      0.49      0.44        35

    accuracy                           0.75       980
   macro avg       0.63      0.65      0.64       980
weighted avg       0.75      0.75      0.75       980

0.7332839657602171


### Stochastic Classsifier

In [40]:
sgd= SGDClassifier()
sgd.fit(x_train,y_train)
pred_sgd=sgd.predict(x_test)
print(classification_report(y_test,pred_sgd))
cross_val=cross_val_score(estimator=sgd, X=x_train, y=y_train,cv=10) 
print(cross_val.mean())

              precision    recall  f1-score   support

          -1       0.63      0.37      0.47       321
           0       0.70      0.89      0.79       624
           1       0.33      0.03      0.05        35

    accuracy                           0.69       980
   macro avg       0.56      0.43      0.44       980
weighted avg       0.67      0.69      0.66       980

0.7021471632131113


Random forest is very prone to overfitting.

A **Decision Tree classifier** is a tree-like structure where each internal node represents a decision based on a feature, each branch represents the outcome of that decision, and each leaf node represents a class label (in classification tasks).The tree is constructed by recursively splitting the data into subsets based on the feature that results in the best spli

A **Random Forest classifier** is an ensemble method that uses multiple decision trees to make predictions. Each tree is trained on a random subset of the data (with replacement, known as bootstrapping), and a random subset of features is considered for splitting at each node.
The final prediction is made by aggregating the predictions of all the individual trees (usually by majority voting in classification tasks).

**Stochastic Gradient Descent** is an optimization algorithm used to minimize the loss function in various machine learning models, including linear models (like linear regression and logistic regression) and neural networks.
Unlike traditional gradient descent, which computes the gradient of the loss function over the entire dataset, SGD updates the model parameters for each training example or a small batch of examples, making it faster for large datasets.

A loss function (also known as a cost function or objective function) quantifies how well a model's predictions match the actual outcomes. 

In [41]:
#Clearly the best result we have received is from Random Forest at 80.6% 