<a href="https://colab.research.google.com/github/chocobearz/SASSA-Sklean/blob/main/sassa_sklean_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# In this workshop we will use sklean to do some classification

Import the dataset

In [None]:
import pandas as pd

# Calling the dataset
dataset = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
dataset

Check out the dataset

In [None]:
dataset.info()

See what classes we have

In [None]:
dataset['species'].unique()

Split the dataset into test, train and validation

In [None]:
from sklearn.model_selection import train_test_split

# Create a Training & Test set (80% training, 20% testing)
train, test = train_test_split(dataset, test_size=0.2)

In [None]:
train

Split to validation set (could also use cross validation, but we will use a validation set)

In [None]:
# Set aside validation set (validation set is 25% of the training data)
# Overall, leaves us with 60% training, 20% validation, 20% testing
train, validate = train_test_split(train, test_size=0.25)

Start the decision tree. [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

In [None]:
# Decision Tree Classifier
# ************************
from sklearn.tree import DecisionTreeClassifier

# Initialize the classifier
classifier = DecisionTreeClassifier(max_depth=5) ## THIS LINE

# Separate the Features from the Target
train_X = train.drop('species', axis=1) # features
train_y = train['species'] # target

# Train the classifier
classifier.fit(train_X, train_y)

Plot the tree.
Boo matplotlib

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=[10, 10])

plot_tree(classifier, 
          feature_names=train.columns, 
          class_names=classifier.classes_)
plt.show()

Test on validation

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

predictions = classifier.predict(train_X)
print('----- Training -----')
print('Accuracy:', accuracy_score(train_y, predictions))
print('Precision:', list(zip(classifier.classes_, precision_score(train_y, predictions, average=None))))
print('Recall:', list(zip(classifier.classes_, recall_score(train_y, predictions, average=None))))

# Separate the Features from the Target
validate_X = validate.drop('species', axis=1)
validate_y = validate['species']

predictions = classifier.predict(validate_X)
print('\n----- Validation -----')
print('Accuracy:', accuracy_score(validate_y, predictions))
print('Precision:', list(zip(classifier.classes_, precision_score(validate_y, predictions, average=None))))
print('Recall:', list(zip(classifier.classes_, recall_score(validate_y, predictions, average=None))))

In [None]:
precision_score(train_y, predictions)

Some options for tuning:

`max_depth`: The maximum depth of the learned tree. Reducing this value shortens the graph and limits how many rules the tree can learn.

`min_samples_leaf`: The minimum samples allowed at a leaf node. Increaing this value prevents splits to be made the seperate a small number of nodes (think about seperating on ID number).

`class_weight`: how much weight examples from different classes should be given in finding the optimal splits. This can help deal with class-imbalance, a common problem in data science.

Evaluate the model with test set

In [None]:
# Separate the Testing features from the Target
test_X = test.drop('species', axis=1)
test_y = test['species']

predictions = classifier.predict(test_X)
print('\n----- Testing -----')
print('Accuracy:', accuracy_score(test_y, predictions))
print('Precision:', list(zip(classifier.classes_, precision_score(test_y, predictions, average=None))))
print('Recall:', list(zip(classifier.classes_, recall_score(test_y, predictions, average=None))))



###**Accuracy**
Is the proportion of correctly classified examples. 

Accuracy is perhaps the most common metric for evaluating classification problems.However, there are a few problems with it. 

First, **it doesn't account for differences in importance between classes.** 

Second, **accuracy doesn't deal with the class imbalance**

###**Precision**
For binary classification problems (where we can think of classes as Positive and Negative), precision is the percentage of predicted positives which are truly positive.

For multi-class classification problems (when we have more than 2 classes), precision is calculated for each class individually. We can choose to calculate the average precision over all classes, or leave it as a class-by-class measure. 

###**Recall**
For binary classification problems, recall is the percentage of truly positive examples which were predicted as positive.

As with precision, recall is calculated on a class-by-class basis for multi-class classification problems.

###**F Measure**
Precision and recall are often at odds with one another (very high recall often requires taking a hit to performance). The F-measure (or f1-score) is an attempt to balance precision and recall into a single metric.

F-Measure gives equal importance to precision and recall while in reality one is often more important than the other (determined by the problem itself).

As with precision and recall, F-measure is calculated per class for classification with more than 2 classes.

##Logistic Regression example

In [None]:
# Import algorithm
from sklearn.linear_model import LogisticRegression

# select features & target
X = dataset.drop('species', axis=1)
y = dataset['species']

# split the dataset
X_train_val, X_test, y_train_val, y_test = train_test_split(X, 
                                                            y, 
                                                            test_size=0.2, 
                                                            random_state=0)

X_train, X_val, y_train, y_val = train_test_split(X_train_val, 
                                                  y_train_val, 
                                                  test_size=0.25, 
                                                  random_state=0)

# Train the classifier
log_classifier = LogisticRegression()
log_classifier.fit(X_train, y_train)

# Validate & Adjust
predictions = log_classifier.predict(X_val)
print('\n----- Validation -----')
print('Accuracy:', accuracy_score(y_val, predictions))
print('Precision:', list(zip(log_classifier.classes_, precision_score(y_val, predictions, average=None))))
print('Recall:', list(zip(log_classifier.classes_, recall_score(y_val, predictions, average=None))))

# ! Hyper-parameter tuning !
print(log_classifier)
# Uncomment to evaluate on the test set
predictions = log_classifier.predict(X_test)
print('\n----- Testing -----')
print('Accuracy:', accuracy_score(y_test, predictions))
print('Precision:', list(zip(log_classifier.classes_, precision_score(y_test, predictions, average=None))))
print('Recall:', list(zip(log_classifier.classes_, recall_score(y_test, predictions, average=None))))