## This is my work template for building ML models

The typical ML workflow as follows:
1. Define the Problem
2. Data Ingestion
3. Data Splitting: The training set is used to train the model, the validation set is used to tune hyperparameters, and the test set is used to evaluate the model's performance.
4. Data Exploration
5. Data Preprocessing: Missing data, imbalance, outliers, data transformation (e.g., normalization, encoding categorical variables).
6. Feature Engineering: Engineer new features from the existing data or perform dimensionality reduction techniques like PCA if needed
7. Feature Selection
8. Modelling
9. Hyperparameter Tuning
10. Evaluating the model
11. Iteration

# Uncle Steve's Amazing Do-All function

It slices! It dices!

In order to streamline the evaluation of each dataset, let's create an function that takes in a dataset, the name of the target column, and the name of any columsn to drop (because that's decided by the human), and then automate the rest:

- Converting datatypes of the target column if necessary
- OHE any categorical features
- Splitting data into training and testing
- Training and evaluating all the models/ensembles
- Returning a list of the performance of all the models

In [1]:
import datetime
print(datetime.datetime.now())

2023-10-08 17:56:46.008497


In [2]:
import pandas as pd
pd.show_versions(as_json=False)

import sklearn
sklearn.__version__


INSTALLED VERSIONS
------------------
commit           : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python           : 3.8.10.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.22631
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_Canada.1252

pandas           : 1.5.2
numpy            : 1.24.1
pytz             : 2022.7.1
dateutil         : 2.8.2
setuptools       : 56.0.0
pip              : 23.2.1
Cython           : 0.29.14
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.8.0
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli     

'1.2.0'

In [3]:
!apt-get install swig -y
!pip install Cython numpy

# sometimes you have to run the next command twice on colab
# I haven't figured out why
!pip install auto-sklearn

'apt-get' is not recognized as an internal or external command,
operable program or batch file.


Collecting auto-sklearn
  Using cached auto-sklearn-0.15.0.tar.gz (6.5 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'


  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [17 lines of output]
      Traceback (most recent call last):
        File "C:\Users\james\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 353, in <module>
          main()
        File "C:\Users\james\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "C:\Users\james\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 118, in get_requires_for_build_wheel
          return hook

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier, StackingClassifier, VotingClassifier, BaggingClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier

from sklearn.experimental import enable_hist_gradient_boosting  # noqa

from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, recall_score, precision_score, roc_auc_score

import autosklearn.classification

import time

# Helper function
def do_all_for_dataset(dataset_name, df, target_col, drop_cols=[]):

    # If target_col is an object, convert to numbers
    if df[target_col].dtype == 'object':
      df[target_col] =  df[target_col].astype('category').cat.codes

    # OHE all categorical columns
    cat_cols = list(df.select_dtypes(include=['object']).columns) 
    if target_col in cat_cols: cat_cols.remove(targe_col)
    if len(cat_cols) > 0:
      df = pd.concat([df,pd.get_dummies(df[cat_cols])],axis=1)

    # Split into X and y
    X = df.drop(drop_cols + cat_cols + [target_col], axis=1)
    y = df[target_col]

    # Split into training and testing
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

    print('Y (train) counts:')
    print(y_train.value_counts())
    print('Y (test) counts:')
    print(y_test.value_counts())
    
    nb = GaussianNB()   
    lr = LogisticRegression(random_state=42, solver='lbfgs', max_iter=5000)
    dt = DecisionTreeClassifier(random_state=42)
    knn = KNeighborsClassifier(n_neighbors=7)

    rf = RandomForestClassifier(random_state=42, n_estimators=200)
    ada = AdaBoostClassifier(random_state=42, n_estimators=200)

    scorer = autosklearn.metrics.make_scorer(
        'f1_score',
        sklearn.metrics.f1_score
    )    
    automl = autosklearn.classification.AutoSklearnClassifier(
          time_left_for_this_task=100, # run auto-sklearn for at most X secs
          per_run_time_limit=15, # spend at most 60 sec for each model training
          metric=scorer
          )


    est_list = [('DT', dt), ('LR', lr), ('NB', nb), ('RF', rf), ('ADA', ada)]
       
    dict_classifiers = {
        "LR": lr, 
        "NB": nb,
        "DT": dt,
        "KNN": knn,
        "Voting": VotingClassifier(estimators = est_list, voting='soft'),
        "Bagging": BaggingClassifier(DecisionTreeClassifier(), n_estimators=200, random_state=42),
        "RF": rf,
        "ExtraTrees": ExtraTreesClassifier(random_state=42, n_estimators=200),
        "Adaboost": ada,
        "GBC": GradientBoostingClassifier(random_state=42, n_estimators=200),
        "Stacking": StackingClassifier(estimators=est_list, final_estimator=LogisticRegression()),
        "automl": automl,
    }
    
    model_results = list()
    
    for model_name, model in dict_classifiers.items():
        start = time.time()
        y_pred = model.fit(X_train, y_train).predict(X_test)
        end = time.time()
        total = end - start
        
        accuracy       = accuracy_score(y_test, y_pred)
        f1             = f1_score(y_test, y_pred)
        recall         = recall_score(y_test, y_pred)
        precision      = precision_score(y_test, y_pred)
        roc_auc        = roc_auc_score(y_test, y_pred)
    
        df = pd.DataFrame({"Dataset"   : [dataset_name],
                           "Method"    : [model_name],
                           "Time"      : [total],
                           "Accuracy"  : [accuracy],
                           "Recall"    : [recall],
                           "Precision" : [precision],
                           "F1"        : [f1],
                           "AUC"       : [roc_auc],
                          })
        model_results.append(df)
   

    dataset_results = pd.concat([m for m in model_results], axis = 0).reset_index()

    dataset_results = dataset_results.drop(columns = "index",axis =1)
    dataset_results = dataset_results.sort_values(by=['F1'], ascending=False)
    dataset_results['Rank'] = range(1, len(dataset_results)+1)
    
    return dataset_results

In [None]:
# German Credit Example
df = pd.read_csv('https://raw.githubusercontent.com/stepthom/869_course/main/data/GermanCredit.csv')
r = do_all_for_dataset('GermanCredit', df, target_col='Class', drop_cols=[])
results.append(r)
r

Now: