3-Classification
================

**Author:** Cumhur Erkut



## 1. Introduction to classification <code>[0/6]</code>



Best resources for ML Algorithms:

-   [https://saturncloud.io/glossary/](https://saturncloud.io/glossary/)
-   [2 Machine Learning General - Google Slides](https://docs.google.com/presentation/d/1qSOwBrjEmZTXQqNqB9XRAV7QsB6SJrLZ4pZBCkpvzyA/edit#slide=id.gf297669038_0_129)

This section is all about balancing and resaving the data.



### Pre-lecture quiz



Q: Multiclass. Given ingredients, which cousine? 

-   [X] Classification, regression relationship
-   [X] First step: analyze and balance your data\*



### Hello classifier



-   **multiclass:** Given a batch of indgedient, which of these cuisines (multiple class) will the data fit?

-   [ ] Define classification
-   [ ] Take a moment to imagine a dataset about cuisines
    -   [ ] What can a binary model answer?
        
            given a present of a grocery bag full of star anise, artichokes, cauliflower, and horseradish, can we create a typical Indian dish?
    -   [ ] What can a multiclass model answer?
        
            Which cuisine is likely to use fenugreek?



### Exercise: (Clean and balance) your data



In [None]:
    #! pip install imblearn --yes
    import pandas as pd
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    import numpy as np
    from imblearn.over_sampling import SMOTE

    df = pd.read_csv('../data/cuisines.csv')
    df.head()

### Exercise: Learning about cuisines



In [None]:
df.cuisine.value_counts().plot.barh()

In [None]:
    thai_df = df[(df.cuisine == "thai")]
    japanese_df = df[(df.cuisine == "japanese")]
    chinese_df = df[(df.cuisine == "chinese")]
    indian_df = df[(df.cuisine == "indian")]
    korean_df = df[(df.cuisine == "korean")]
    
    print(f'thai df: {thai_df.shape}')
    print(f'japanese df: {japanese_df.shape}')
    print(f'chinese df: {chinese_df.shape}')
    print(f'indian df: {indian_df.shape}')
    print(f'korean df: {korean_df.shape}')
    

What are the typical ingredients per cuisine? 
First, clean out recurrent data that creates confusion between cuisines.

In [None]:
    def create_ingredient_df(df):
        ingredient_df = df.T.drop(['cuisine','Unnamed: 0']).sum(axis=1).to_frame('value')
        ingredient_df = ingredient_df[(ingredient_df.T != 0).any()]
        ingredient_df = ingredient_df.sort_values(by='value', ascending=False, inplace=False)
        return ingredient_df

    thai_ingredient_df = create_ingredient_df(thai_df)
    thai_ingredient_df.head(10).plot.barh()


In [None]:

    japanese_ingredient_df = create_ingredient_df(japanese_df)
#    japanese_ingredient_df.head(10).plot.barh()
    chinese_ingredient_df = create_ingredient_df(chinese_df)
#    chinese_ingredient_df.head(10).plot.barh()
    indian_ingredient_df = create_ingredient_df(indian_df)
#    indian_ingredient_df.head(10).plot.barh()
    korean_ingredient_df = create_ingredient_df(korean_df)
#    korean_ingredient_df.head(10).plot.barh()

Drop Unnamed column and most frequent ingredients

In [None]:

    feature_df = df.drop(['cuisine', 'Unnamed: 0','rice', 'garlic', 'ginger'], axis=1)
    labels_df = df.cuisine #.unique()
    feature_df.head()


### Balance the dataset using SMOTE



Balance data with SMOTE oversampling to the highest class. Read more here: [https://imbalanced-learn.org/dev/references/generated/imblearn.over_sampling.SMOTE.html](https://imbalanced-learn.org/dev/references/generated/imblearn.over_sampling.SMOTE.html)


In [None]:

    oversample = SMOTE()
    transformed_feature_df, transformed_label_df = oversample.fit_resample(feature_df, labels_df)

    print(f'new label count: {transformed_label_df.value_counts()}')
    print(f'old label count: {df.cuisine.value_counts()}')


In [None]:

    transformed_df = pd.concat([transformed_label_df,transformed_feature_df],axis=1, join='outer')
    transformed_df.head()
    transformed_df.info()
    transformed_df.to_csv("../data/cleaned_cuisines.csv")

### 🚀Challenge, Post-lecture quiz, Review & Self Study



## 2. More classifiers: [Logistic Regression](https://paperswithcode.com/method/logistic-regression) (and [Support Vector](https://paperswithcode.com/method/svm) Classifiers)

    Assumption: a cleaned_cuisines.csv file exists in the root /data folder for these four lessons.



### Exercise: predict a national cuisine



In [None]:
    import pandas as pd
    cuisines_df = pd.read_csv("../data/cleaned_cuisines.csv")
    cuisines_df.head()

    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split, cross_val_score
    from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
    from sklearn.svm import SVC
    import numpy as np

    cuisines_label_df = cuisines_df['cuisine']
    cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)

    X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)


In [None]:
    # Logistic Regression
    lr = LogisticRegression(multi_class='ovr',solver='liblinear')
    model = lr.fit(X_train, np.ravel(y_train))
    
    accuracy = model.score(X_test, y_test)
    print ("Accuracy is {}".format(accuracy))


In [None]:
    # Test classification instance 
    print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
    print(f'cuisine: {y_test.iloc[50]}')

    test= X_test.iloc[50].values.reshape(-1, 1).T
    proba = model.predict_proba(test)
    classes = model.classes_
    resultdf = pd.DataFrame(data=proba, columns=classes)
    
    topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
    topPrediction.head()


In [None]:
    y_pred = model.predict(X_test)
    print(classification_report(y_test,y_pred))

### 🚀Challenge, Post-lecture quiz, Review & Self Study



## 3. Yet other classifiers



We assume that you have completed the previous lessons and have a cleaned dataset in your \`data\` folder called <u>cleaned\_cuisines.csv</u>.



### A classification map ([interactive version in your browser](https://scikit-learn.org/stable/tutorial/machine_learning_map/)): The plan, hacked code



![img](../3-Classifiers-2/images/map.png)

-   [ ] Linear SVC
-   [ ] KNN
-   [ ] SVC
-   [ ] Ensemble (FRST, ADA)


In [None]:
    import pandas as pd
    cuisines_df = pd.read_csv("../data/cleaned_cuisines.csv")
    cuisines_label_df = cuisines_df['cuisine']
    cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
    
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC
    from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
    from sklearn.model_selection import train_test_split, cross_val_score
    from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
    import numpy as np
    
    X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)


In [None]:
    C = 10
    # Create different classifiers.
    classifiers = {
        'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0),
        'KNN classifier': KNeighborsClassifier(C),
        'SVC': SVC(),
        'RFST': RandomForestClassifier(n_estimators=100),
        'ADA': AdaBoostClassifier(n_estimators=100)  
    }

    import warnings
    # Filter out user warnings due to SVC and KNN usesr warnings
    warnings.filterwarnings("ignore", category=UserWarning)
    
    n_classifiers = len(classifiers)
    
    for index, (name, classifier) in enumerate(classifiers.items()):
        classifier.fit(X_train, np.ravel(y_train))
        X_test = np.ascontiguousarray(X_test) # fixes KNN c_contiguous array error
        y_pred = classifier.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        print("Accuracy (train) for %s: %0.1f%% " % (name, accuracy * 100))
        print(classification_report(y_test,y_pred))

### 🚀Challenge, Post-lecture quiz, Review & Self Study



There's a lot of jargon in these lessons, so take a minute to review [this list](https://docs.microsoft.com/dotnet/machine-learning/resources/glossary?WT.mc_id=academic-77952-leestott) of useful terminology!



## 4. Applied ML: build a web app



-   [ ] How to build a model and save it as an Onnx model
-   [ ] How to use Netron to inspect the model
-   [ ] How to use your model in a web app for inference



### [Pre-lecture quiz](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/25/)



### Build your model



*build a basic JavaScript-based system for inference. First, however, you need to train a model and convert it for use with Onnx*.

[Open Neural Network Exchange (ONNX)](https://onnx.ai/) is an open standard format for representing machine learning models. See the basit tutorials at [https://github.com/onnx/tutorials](https://github.com/onnx/tutorials)



#### Exercise: Train classification model



In [None]:
    # ! mamba install skl2onnx==1.15.0
    import pandas as pd

    data = pd.read_csv('../data/cleaned_cuisines.csv')
    X = data.iloc[:,2:]
    y = data[['cuisine']]

    from sklearn.model_selection import train_test_split
    from sklearn.svm import SVC
    from sklearn.model_selection import cross_val_score
    from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report
    
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)


In [None]:
# SVC
    model = SVC(kernel='linear', C=10, probability=True,random_state=0)
    model.fit(X_train,y_train.values.ravel())

In [None]:
    y_pred = model.predict(X_test)
    print(classification_report(y_test,y_pred))

In [None]:
    # ONNX EXPORT

    from skl2onnx import convert_sklearn
    from skl2onnx.common.data_types import FloatTensorType
    
    initial_type = [('float_input', FloatTensorType([None, 380]))]
    # ** Note the options: nocl: no class info embedded (smaller filesize). zipmap: list of dictionaries 
    options = {id(model): {'nocl': True, 'zipmap': False}}
    
    onx = convert_sklearn(model, initial_types=initial_type, options=options)
    with open("./model.onnx", "wb") as f:
        f.write(onx.SerializeToString())



### OPTIONAL View your model using [netron](https://github.com/lutzroeder/netron)



#### MAC
    # brew install netron
    # open model.onnx
#### WIN
    winget install -s winget netron

![img](../4-Applied/images/netron.png)



### Build a recommender web application (by writing index.html and js, then running http-server) then test



Better way: FastAPI! Wait for it (4-Web-App)
Best to check [../4-Applied/solution/index.html](../4-Applied/solution/index.html)



### 🚀Challenge, Post-lecture quiz, Review & Self Study



## 5. Convolutional Neural Nets



### Interactively: [https://poloclub.github.io/cnn-explainer/](https://poloclub.github.io/cnn-explainer/)



### Using scikit-learn MLP as a proxy to CNN



    After [Training CNN with Images in Sklearn Neural Net: A Step-by-Step Guide | Saturn Cloud Blog](https://saturncloud.io/blog/training-cnn-with-images-in-sklearn-neural-net-a-stepbystep-guide/)


In [None]:

    from sklearn.datasets import load_digits
    from skimage import color
    from skimage.transform import resize
    # Load sample images: other datasets load_digits() and load_iris()
    X,y = load_digits(return_X_y=True)
    # X = dataset.images
    # y = dataset.target
    
    # Commented parts for possible preprocessing using scikit-image
    # Preprocessing: resize, grayscale, normalize pixel values etc
    # import numpy as np
    # from sklearn.preprocessing import StandardScaler
    
    # Resize images
    # X_resized = np.array([resize(image, (64, 64), anti_aliasing=True) for image in X])
    # Convert to grayscale
    # X_gray = np.array([color.rgb2gray(image) for image in X_resized])
    
    # Normalize pixel values
    # scaler = StandardScaler()
    # X_scaled = scaler.fit_transform(X_gray.reshape(-1, 64 * 64))
    
    # Train - test split
    from sklearn.model_selection import train_test_split
    
    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



#### Model: we fake a MLP with relu activation as CNN



In [None]:
    from sklearn.neural_network import MLPClassifier
    
    # Define MLPClassifier as CNN: 8x8 is flattened in the input layer, then progressively shrinked
    mlp = MLPClassifier(hidden_layer_sizes=(32, 16), activation='relu', solver='adam', max_iter=500)
    
    # Compile and fit the model
    mlp.fit(X_train, y_train)
    score = mlp.score(X_test, y_test)
    print(f"Accuracy: {score}")



#### Improve the Model Performance?

In [None]:
    # Increase model complexity
    mlp = MLPClassifier(hidden_layer_sizes=(64, 32), activation='relu', solver='adam', max_iter=1000)
    mlp.fit(X_train, y_train)
    score = mlp.score(X_test, y_test)
    print(f"Accuracy: {score}")


In [None]:

    # Use SGD optimizer
    mlp = MLPClassifier(hidden_layer_sizes=(64, 32), activation='relu', solver='sgd', max_iter=500)
    mlp.fit(X_train, y_train)
    score = mlp.score(X_test, y_test)
    print(f"Accuracy: {score}")



In [None]:


    # Use SGD optimizer with a low initial learning rate
    mlp = MLPClassifier(hidden_layer_sizes=(64, 32), activation='relu', solver='sgd', learning_rate_init=0.001, max_iter=500)
    mlp.fit(X_train, y_train)
    score = mlp.score(X_test, y_test)
    print(f"Accuracy: {score}")

#### Data augmentation (requires skimage)



## Assignment: Parametrization of classification algorithms



There are a lot of parameters that are set by default when working with these classifiers. Intellisense in VS Code can help you dig into them. Adopt one of the ML Classification Techniques in this lesson and retrain models tweaking various parameter values. 

Build a notebook explaining why some changes help the model quality while others degrade it. Be detailed in your answer. 

For example, 

-   In [linear SVC](https://scikit-learn.org/stable/modules/svm.html#classification), 
    -   Increasing C results in better model quality
    
    -   Increasing max iterations results in better model quality
    
    -   Increasing tolerance results in worse model quality

-   In [KNN](https://saturncloud.io/glossary/knn/), 
    -   Increasing n\_neighbors results in better model quality
    
    -   Increasing p results in better model quality
    
    -   Increasing tolerance results in worse model quality

