# sklearn tutorials by Aditya

sklearn is advanced machine learning library which can be used for the following purposes :-
*   Regression
*   Classification
*   CLustering
*   Prediction



In [1]:
#Importing required packages.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
#from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
%matplotlib inline

In [4]:
#Loading dataset
wine = pd.read_csv('WineQT.csv')
wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,4


In [5]:
wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1143 entries, 0 to 1142
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1143 non-null   float64
 1   volatile acidity      1143 non-null   float64
 2   citric acid           1143 non-null   float64
 3   residual sugar        1143 non-null   float64
 4   chlorides             1143 non-null   float64
 5   free sulfur dioxide   1143 non-null   float64
 6   total sulfur dioxide  1143 non-null   float64
 7   density               1143 non-null   float64
 8   pH                    1143 non-null   float64
 9   sulphates             1143 non-null   float64
 10  alcohol               1143 non-null   float64
 11  quality               1143 non-null   int64  
 12  Id                    1143 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 116.2 KB


In [6]:
wine.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
Id                      0
dtype: int64

In [7]:
wine['quality'].unique()

array([5, 6, 7, 4, 8, 3])

In [8]:
#Now lets assign a labels to our quality variable
label_quality = LabelEncoder()

In [9]:
'''
In scikit-learn (sklearn), LabelEncoder is a utility class that helps encode categorical labels into numerical labels. Many machine learning algorithms require numerical input, and LabelEncoder provides a simple way to convert categorical labels into numeric labels.

Here's a brief overview of how LabelEncoder works:

Fit: First, you instantiate a LabelEncoder object and then call its fit method, passing in the array of categorical labels you want to encode. This method computes the unique classes in the input data and assigns a unique integer to each class.

Transform: Once fitted, you can use the transform method to transform the original categorical labels into their corresponding numerical labels.
'''
wine['quality'] = label_quality.fit_transform(wine['quality'])

In [10]:
wine['quality'].value_counts()

2    483
3    462
4    143
1     33
5     16
0      6
Name: quality, dtype: int64

In [31]:
#Now seperate the dataset as response variable and feature variabes
X = wine.drop(['quality'], axis = 1)
y = wine['quality']

In [32]:
#Train and Test splitting of data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [33]:
'''
In scikit-learn (sklearn), standard scaling refers to a preprocessing technique used to standardize features by removing the mean and scaling them to unit variance. This technique is also known as z-score normalization.

The StandardScaler class in scikit-learn provides a convenient way to perform standard scaling on features. Here's how it works:

Fit: First, you instantiate a StandardScaler object and then call its fit method, passing in the array of features you want to scale. This method computes the mean and standard deviation of each feature in the training data.

Transform: Once fitted, you can use the transform method to transform the original features into standardized features. This involves subtracting the mean and dividing by the standard deviation for each feature.
'''
#Applying Standard scaling to get optimized result

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

# Random Forest Classifier

In [34]:
rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train, y_train)
pred_rfc = rfc.predict(X_test)


In [35]:
#Let's see how our model performed
print(classification_report(y_test, pred_rfc))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00         6
           2       0.73      0.77      0.75        96
           3       0.63      0.74      0.68        99
           4       0.85      0.42      0.56        26
           5       0.00      0.00      0.00         2

    accuracy                           0.69       229
   macro avg       0.44      0.39      0.40       229
weighted avg       0.68      0.69      0.67       229



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [36]:
#Confusion matrix for the random forest classification
print(confusion_matrix(y_test, pred_rfc))

[[ 0  3  3  0  0]
 [ 0 74 22  0  0]
 [ 0 24 73  2  0]
 [ 0  0 15 11  0]
 [ 0  0  2  0  0]]


In [37]:
Xnew = [[7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,1]]
ynew = rfc.predict(Xnew)

In [38]:
print('The quality of wine with given parameters is:')
print(ynew)

The quality of wine with given parameters is:
[2]
