* For each of the below Classifiers, do the following:

1. Evaluate its performance on the sample 'wine' dataset built into scikit-learn
2. Learn and explain how the model works, and if it is binary / multi classification.
3. Attempt to explain why the model performed how it did with the given dataset.

* Logistic Regression
* Decision Trees
* Random Forest
* Support Vector Machines (SVM) (Both with linear kernels and non-linear kernels!)
* Naive Bayes
* K-Nearest Neighbors (KNN)
* Gradient Boosting Machines (GBM)
* Linear Discriminant Analysis (LDA)

In [23]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
from sklearn.metrics.pairwise import distance_metrics
from sklearn.linear_model import LogisticRegression 
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import pandas as pd

In [6]:
data = load_wine()

X = data.data   #input parameters 
y = data.target #classification output
df = pd.DataFrame(X, columns = data.feature_names)
df["target"] = y

#outputs is 0, 1, 2, no other specification
#13 input parameters 
print(df['target'].unique())
print(data.target_names)
df.head()


[0 1 2]
['class_0' 'class_1' 'class_2']


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 16)

In [24]:
classifiers = [LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, SVC, SVC, GaussianNB, KNeighborsClassifier,GradientBoostingClassifier,LinearDiscriminantAnalysis]
names = ["logistic", "decision tree", "random forest", "SVM linear", "SVM nonlinear", "naive bayes", "k neighbors", "gradient boosting", "LDA"]
model_stats = pd.DataFrame(columns = ['model', 'accuracy', 'precision', 'recall', 'sum'])

In [25]:
for i in range(len(classifiers)):
    print(names[i])
    match names[i]:
        case "logistic":
            model = classifiers[i](max_iter = 5000)
        
        case "SVM linear":
            model = classifiers[i](kernel = 'linear')
        
        case "SVM nonlinear":
            model = classifiers[i](kernel = 'rbf')
        
        case _: 
            print(f"Making {classifiers[i]}")
            model = classifiers[i]()
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    a = accuracy_score(y_test, pred) #% of correct predictions (True Pos + True Neg)/Total   
    p = precision_score(y_test, pred, average = 'weighted') #how many true positives 
    r= recall_score(y_test, pred, average = 'weighted')
    model_stats.loc[len(model_stats.index)] = [names[i], a, p, r, a + p + r]

logistic
decision tree
Making <class 'sklearn.tree._classes.DecisionTreeClassifier'>
random forest
Making <class 'sklearn.ensemble._forest.RandomForestClassifier'>
SVM linear
SVM nonlinear
naive bayes
Making <class 'sklearn.naive_bayes.GaussianNB'>
k neighbors
Making <class 'sklearn.neighbors._classification.KNeighborsClassifier'>
gradient boosting
Making <class 'sklearn.ensemble._gb.GradientBoostingClassifier'>


  _warn_prf(average, modifier, msg_start, len(result))


LDA
Making <class 'sklearn.discriminant_analysis.LinearDiscriminantAnalysis'>


In [44]:
model_stats = model_stats.sort_values('sum')
max_sum = model_stats["sum"].max()
sum_max = model_stats.index[model_stats['sum'] == max_sum].tolist()
pre_max = model_stats["precision"].idxmax()
acc_max = model_stats["accuracy"].idxmax()
re_max = model_stats["recall"].idxmax()
best = model_stats.iloc[sum_max]
best_p = model_stats.iloc[pre_max]
best_a = model_stats.iloc[acc_max]
best_r = model_stats.iloc[re_max]
print(f"The  model with best overall performace is \n{best}")
print(f"The  model with best precision is \n{best_p}")
print(f"The  model with best accuracy is \n{best_a}")
print(f"The  model with best recall is \n{best_r}")

for i in range(len(model_stats.index)):
    print(model_stats.iloc[i])

The  model with best overall performace is 
           model  accuracy  precision    recall       sum
2  random forest  0.977778   0.979012  0.977778  2.934568
8            LDA  1.000000   1.000000  1.000000  3.000000
The  model with best precision is 
model        random forest
accuracy          0.977778
precision         0.979012
recall            0.977778
sum               2.934568
Name: 2, dtype: object
The  model with best accuracy is 
model        random forest
accuracy          0.977778
precision         0.979012
recall            0.977778
sum               2.934568
Name: 2, dtype: object
The  model with best recall is 
model        random forest
accuracy          0.977778
precision         0.979012
recall            0.977778
sum               2.934568
Name: 2, dtype: object
model        SVM nonlinear
accuracy          0.577778
precision         0.411207
recall            0.577778
sum               1.566762
Name: 4, dtype: object
model        k neighbors
accuracy        0.688889

* Logistic Regression
1. Evaluate its performance on the sample 'wine' dataset built into scikit-learn
Name: 1, dtype: object
model        logistic
accuracy     0.933333
precision     0.93287
recall       0.933333
2. Learn and explain how the model works, and if it is binary / multi classification.
    * Fits the data along a logistic curve
    * For multinomial data, uses multiple regressions
3. Attempt to explain why the model performed how it did with the given dataset.

* Decision Tree
1. Evaluate its performance on the sample 'wine' dataset built into scikit-learn
2. Learn and explain how the model works, and if it is binary / multi classification.
    * Uses trees to classify data, where each node is a state and each branch is a condition
3. Attempt to explain why the model performed how it did with the given dataset.

* Random Forest
1. Evaluate its performance on the sample 'wine' dataset built into scikit-learn
2. Learn and explain how the model works, and if it is binary / multi classification.
    * 
3. Attempt to explain why the model performed how it did with the given dataset.

* Support Vector Machines (SVM) (Both with linear kernels and non-linear kernels!)
1. Evaluate its performance on the sample 'wine' dataset built into scikit-learn
2. Learn and explain how the model works, and if it is binary / multi classification.
    * 
3. Attempt to explain why the model performed how it did with the given dataset.

* Naive Bayes
1. Evaluate its performance on the sample 'wine' dataset built into scikit-learn
2. Learn and explain how the model works, and if it is binary / multi classification.
    * Assumes all input features are independent and of equal importance
3. Attempt to explain why the model performed how it did with the given dataset.

* K-Nearest Neighbors (KNN)
1. Evaluate its performance on the sample 'wine' dataset built into scikit-learn
2. Learn and explain how the model works, and if it is binary / multi classification.
    * 
3. Attempt to explain why the model performed how it did with the given dataset.

* Gradient Boosting Machines (GBM)
1. Evaluate its performance on the sample 'wine' dataset built into scikit-learn
2. Learn and explain how the model works, and if it is binary / multi classification.
    * 
3. Attempt to explain why the model performed how it did with the given dataset.

* Linear Discriminant Analysis (LDA)
1. Evaluate its performance on the sample 'wine' dataset built into scikit-learn
2. Learn and explain how the model works, and if it is binary / multi classification.
    * 
3. Attempt to explain why the model performed how it did with the given dataset.

In [40]:
test = LinearDiscriminantAnalysis()
test.fit(X_train, y_train)
pred = test.predict(X_test)
a = accuracy_score(y_test, pred) #% of correct predictions (True Pos + True Neg)/Total   
p = precision_score(y_test, pred, average = 'weighted') #how many true positives 
r= recall_score(y_test, pred, average = 'weighted')


[2 2 0 0 2 2 0 2 2 2 0 2 0 0 1 0 2 1 1 2 0 2 0 1 2 2 1 1 2 1 2 1 2 0 1 1 1
 2 1 0 0 0 0 1 0]
[2 2 0 0 2 2 0 2 2 2 0 2 0 0 1 0 2 1 1 2 0 2 0 1 2 2 1 1 2 1 2 1 2 0 1 1 1
 2 1 0 0 0 0 1 0]
