## Classical Machine Learning
Prerequisites:
- Python experience, including a basing understanding of python syntax, loops, conditional statements, functions, and data types in python
- Some background in numerical computing - MATLAB, R, numpy, or similar, and an understanding of vectors, matrices, and relevant linear algebra concepts

Goals for this session:
- Introduce the sklearn API and common practices in the field of machine learning
- Provide intuition for various classical machine learning techniques regarding their complexity, performance, and effectiveness in the context of different applications
- Explore concepts such as feature selection, model selection, hyperparameter tuning, performance metrics, and the bias/variance tradeoff
- Apply this knowledge to a real-world dataset in a competition-style activity


In [1]:
# For Colab
# !git clone https://github.com/WAT-ai/onboarding-tutorials-2023
TRAIN_DATA_PATH = "/content/onboarding-tutorials-2023/data/cml_training_data.csv"
TEST_DATA_PATH = "/content/onboarding-tutorials-2023/data/cml_training_data.csv"

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import precision_score, recall_score, f1_score

Classifier shortlist
- LogisticRegression
- RidgeClassifier
- SVC
- KNeighborsClassifier
- GaussianProcessClassifier (too slow)
- GaussianNB
- DecisionTreeClassifier

In [3]:
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

In [4]:
# Process the data
# df_train = pd.read_csv(TRAIN_DATA_PATH)
# df_test = pd.read_csv(TEST_DATA_PATH, index_col=0)

df_train = pd.read_csv("../data/cml_training_data.csv")
df_test = pd.read_csv("../data/cml_testing_data.csv")


X_train = df_train.drop("labels", axis=1).values
y_train = df_train["labels"]

X_test = df_test.values

In [5]:
X_test.shape, X_train.shape

((2000, 12), (8000, 12))

# Train your model

In [6]:
params = {
    "class_weight": "balanced",
    "random_state": 87,
    "max_iter": 10_000
}
model = LogisticRegression(**params)

In [9]:
trained_models = []
metrics = {}

name = model.__class__.__name__
print(f"Training {name}...")
metrics.setdefault("model_name", []).append(name)

model.fit(X_train, y_train)
train_acc = model.score(X_train, y_train)
print(f"Model training accuracy: {train_acc*100:.2f}%")

Training LogisticRegression...
Model training accuracy: 87.01%


In [10]:
preds = model.predict(X_test)

In [12]:
preds

array([3, 2, 0, ..., 1, 3, 0])

In [13]:
preds = model.predict(X_test)

df_preds = pd.DataFrame({
    "index": np.arange(X_test.shape[0]),
    "test_predictions": preds,
})

df_preds.to_csv("./test_predictions.csv", index=False)

Upload `test_preds.csv` to Kaggle to enter your score.