# Learner selection

Having analysed and modelled coffee data, it is time to select the approriate Learner for it. 

**Learning objective:** From my analysis I want to select learner, which will most accuratelly fit the data and predict brewing method, focusing on avoiding false positive (could be brand damaging to the client), but also considering false negative (not to miss out on cheaper raw materials).


To chose the learner I take into cosideration the following factors: all my data is labelled categorical text data, and I have only 851 samples after data cleaning and removal of outliers.

Because of small volume of data I will try *SGD Classifier*, which relies on simple stochastic descent learning routine and is easy to implement.
Because of categorical text data I will also test *Naive Bayes*, which works well with small training data, also it is good for combating the curse of dimensionality (though it does not seem to be the problem in my dataset).
Another Classifier I will try is the *KNN*, which works well where the decision boundry is very irregular.
Finally, I am considering *Decistion Trees*, however, their implemention in sklear requires numerical data only, and I will have to encode all of my text variables.

My initial assumption is that Naive Bayes will best correspond to my data, and the problem I am trying to solve.

## Data split

Before I can start testing learners, I need to split my data. I will use random seed 42, to make sure I always get the same randomly divided data.

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [14]:
coffee_df = pd.read_csv('data\coffee_desk_dataset_ead_selected.csv', index_col=0)
coffee_df.head(5)

Unnamed: 0_level_0,origin_region,natural,fermented_traditional,fermented_closed_tank,brewing_method_binary_num
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Latam,True,False,False,1
2,Africa,True,False,False,1
3,Africa,False,False,False,1
4,Asia,True,False,False,1
5,Latam,True,False,False,1


In [16]:
X_df = coffee_df.drop('brewing_method_binary_num', axis=1) # defining predictors
y_df = coffee_df['brewing_method_binary_num'] # defining target variable

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=42) #using random state to ensure I always have random division with the same random numbers
X_validation, X_test, y_validation, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

## Data encoding

Since my data is categorical text data, I will encode it with one hot encoder into binary vectors.

In [18]:
from sklearn.preprocessing import OneHotEncoder

In [19]:
encoder = OneHotEncoder(handle_unknown="ignore")
encoder.fit(X_train) # all variables are categorical

OneHotEncoder(handle_unknown='ignore')

In [20]:
X_train = encoder.transform(X_train)
X_validation = encoder.transform(X_validation)
X_test = encoder.transform(X_test)

## Model selection

Having encoded data (target is already encoded as 1 for specialty brewing and 0 for espresso), I can now proceed to fit different models on my data, and decide which to use as the optimal with possible changes to optimazation and hyperparameters.

In [21]:
from sklearn.naive_bayes import CategoricalNB
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier

In [35]:
model = CategoricalNB()
X_train_dense = X_train.todense()
X_validation_dense = X_validation.todense()
model.fit(X_train_dense, y_train)
y_train_pred = model.predict(X_train_dense)
y_validation_pred = model.predict(X_validation_dense)

In [37]:
from sklearn.metrics import confusion_matrix

tn, fp, fn, tp = confusion_matrix(y_validation, y_validation_pred).ravel()
print(f'True negatives: {tn}')
print(f'True positives: {tp}')
print(f'False negatives: {fn}')
print(f'False positives: {fp}')

True negatives: 28
True positives: 31
False negatives: 12
False positives: 14


In [38]:
tn, fp, fn, tp = confusion_matrix(y_train, y_train_pred).ravel()
print(f'True negatives: {tn}')
print(f'True positives: {tp}')
print(f'False negatives: {fn}')
print(f'False positives: {fp}')

True negatives: 209
True positives: 258
False negatives: 103
False positives: 110


In [None]:
models = {
    'CategoricalBayes' : CategoricalNB(),
    'SGDClassifier' : SGDClassifier(),
    'KNNs' : KNeighborsClassifier(n_neighbors=9)
}

predictions_by_model = {}

for name, model in models.items():

    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_validation_pred = model.predict(X_validation)

    predictions_by_model[name] = y_test_pred