# Train Model for Classifier for Categorical Training Dataset in Sklearn

## Objective

Regarding to machine learning classification, one of the common tasks is to load a csv file, specifies a classifier (SVM, KNN, tree, etc), select the training attributes (e.g. Outlook, Temperature, Humidity, Windy) and target attribute (e.g. Play golf). For training attributes with categorical values (e.g. Outlook has Rainy, Overcast, Sunny), we need to turn them from text into numeric values using sklearn.preprocessing.LabelEncoder.  
  
This Python TrainModel class helps to gather all the above procedures.

In [2]:
import pandas as pd
df = pd.read_csv('weather_nominal.csv')
df

Unnamed: 0,Outlook,Temperature,Humidity,Windy,Play golf
0,Rainy,Hot,High,False,No
1,Rainy,Hot,High,True,No
2,Overcast,Hot,High,False,Yes
3,Sunny,Mild,High,False,Yes
4,Sunny,Cool,Normal,False,Yes
5,Sunny,Cool,Normal,True,No
6,Overcast,Cool,Normal,True,Yes
7,Rainy,Mild,High,False,No
8,Rainy,Cool,Normal,False,Yes
9,Sunny,Mild,Normal,False,Yes


## TrainModel

In [3]:
# %load TrainModel.py
from sklearn import preprocessing
class TrainModel:
    def __init__(self, clf, df_raw, target):
        self.clf = clf
        self.encoders = {}
        self.df = self.transform(df_raw)
        self.target = target
    def transform(self, df_raw):
        df = df_raw.copy()
        for c in df:
            if (df[c].dtype=='object'):
                le = preprocessing.LabelEncoder()
                le.fit(df[c].tolist())
                result = le.transform(df[c].tolist())
                df[c] = result
                self.encoders[c] = le
        return df
    def get_train_x(self):
        return self.df[[x for x in self.df.columns if x!=self.target]]
    def get_train_y(self):
        return self.df[[self.target]].iloc[:,0].values
    def get_train_x_names(self):
        return [x for x in self.df.columns if x!=self.target]
    def get_train_y_names(self):
        return list(self.encoders[self.target].classes_)
    def run(self):
        self.clf.fit(self.get_train_x(), self.get_train_y())
    def predict(self):
        print('trained y', self.get_train_y())
        print('predict y', self.clf.predict(self.get_train_x()))


## Decision Tree

The usage of TrainMode is to supply a sklearn classifier, specifies the training attributes and target attribute.

In [4]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy')
train = TrainModel(clf, df, target=df.columns[-1])
train.run()
train.predict()

trained y [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
predict y [0 0 1 1 1 0 1 0 1 1 1 1 1 0]


## Linear Regression

In [5]:
from sklearn import linear_model
clf = linear_model.LinearRegression()
train = TrainModel(clf, df, target=df.columns[-1])
train.run()
train.predict()

trained y [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
predict y [0.44342746 0.1540409  0.69483934 0.38305745 0.6292113  0.33982473
 0.84264849 0.63446933 0.88062317 1.01129503 0.97332035 0.59649464
 1.32307692 0.09367089]


## Logistic Regression

In [6]:
from sklearn import linear_model
clf = linear_model.LogisticRegression()
train = TrainModel(clf, df, target=df.columns[-1])
train.run()
train.predict()

trained y [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
predict y [1 0 1 0 1 1 1 1 1 1 1 1 1 0]


## Support Vector Machine

In [7]:
from sklearn import svm
clf = svm.SVC()
train = TrainModel(clf, df, target=df.columns[-1])
train.run()
train.predict()

trained y [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
predict y [1 1 1 1 1 1 1 1 1 1 1 1 1 1]


## Naive Bayes

### Gaussian

In [8]:
from sklearn import naive_bayes
clf = naive_bayes.GaussianNB()
train = TrainModel(clf, df, target=df.columns[-1])
train.run()
train.predict()

trained y [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
predict y [0 0 1 0 1 1 1 0 1 1 1 1 1 0]


### Bernoulli

In [9]:
from sklearn import naive_bayes
clf = naive_bayes.BernoulliNB()
train = TrainModel(clf, df, target=df.columns[-1])
train.run()
train.predict()

trained y [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
predict y [0 0 1 0 1 1 1 0 1 1 1 1 1 0]


### MultinomialNB

In [10]:
from sklearn import naive_bayes
clf = naive_bayes.MultinomialNB()
train = TrainModel(clf, df, target=df.columns[-1])
train.run()
train.predict()

trained y [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
predict y [1 0 1 1 1 1 1 1 1 1 1 1 1 0]


## K Nearest Neighbors

In [11]:
from sklearn import neighbors
clf = neighbors.KNeighborsClassifier()
train = TrainModel(clf, df, target=df.columns[-1])
train.run()
train.predict()

trained y [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
predict y [0 0 0 0 0 0 1 0 1 1 1 1 1 0]


## Random Forest

In [12]:
from sklearn import ensemble
clf = ensemble.RandomForestClassifier()
train = TrainModel(clf, df, target=df.columns[-1])
train.run()
train.predict()

trained y [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
predict y [0 0 1 1 1 0 1 0 1 1 1 1 1 0]


## K Means Clustering

In [13]:
from sklearn import cluster
clf = cluster.KMeans(n_clusters=3, random_state=0)
train = TrainModel(clf, df, target=df.columns[-1])
train.run()
train.predict()

trained y [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
predict y [0 0 0 1 2 2 0 1 2 1 1 0 0 1]


## Gradient Boosting

In [14]:
from sklearn import ensemble
clf = ensemble.GradientBoostingClassifier(n_estimators=100, learning_rate=1, max_depth=1, random_state=0)
train = TrainModel(clf, df, target=df.columns[-1])
train.run()
train.predict()

trained y [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
predict y [0 0 1 1 1 0 1 0 1 1 1 1 1 0]


## Extreme Gradient Boosting

In [15]:
# git clone --recursive https://github.com/dmlc/xgboost
# cd xgboost; cp make/minimum.mk ./config.mk; make -j4
# cd python-package; sudo python setup.py install
from xgboost import XGBClassifier
clf = XGBClassifier()
train = TrainModel(clf, df, target=df.columns[-1])
train.run()
train.predict()

trained y [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
predict y [0 0 0 1 1 1 1 1 1 1 1 0 1 0]


  if diff:


## Light Gradient Boosting

In [16]:
import lightgbm
train = TrainModel(clf, df, target=df.columns[-1])

# run(), does not call fit() but train()
d_train = lightgbm.Dataset(train.get_train_x(), label=train.get_train_y())
params = {}
params['learning_rate'] = 0.003
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'binary_logloss'
params['sub_feature'] = 0.5
params['num_leaves'] = 10
params['min_data'] = 1
params['max_depth'] = 10
clf = lightgbm.train(params, d_train, 100)

train.predict()

trained y [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
predict y [0 0 0 1 1 1 1 1 1 1 1 0 1 0]


  if diff:


## Catboost

In [17]:
from catboost import CatBoostRegressor
clf = CatBoostRegressor()
train = TrainModel(clf, df, target=df.columns[-1])
train.run()
train.predict()

0:	learn: 0.7952409	total: 55.4ms	remaining: 55.3s
1:	learn: 0.7833689	total: 58.7ms	remaining: 29.3s
2:	learn: 0.7761413	total: 60ms	remaining: 19.9s
3:	learn: 0.7688179	total: 61.6ms	remaining: 15.3s
4:	learn: 0.7581547	total: 62.7ms	remaining: 12.5s
5:	learn: 0.7500263	total: 63.7ms	remaining: 10.6s
6:	learn: 0.7392812	total: 64.9ms	remaining: 9.21s
7:	learn: 0.7288340	total: 66.5ms	remaining: 8.24s
8:	learn: 0.7197826	total: 67.4ms	remaining: 7.42s
9:	learn: 0.7136919	total: 69.4ms	remaining: 6.87s
10:	learn: 0.7055841	total: 71.3ms	remaining: 6.41s
11:	learn: 0.7002862	total: 74.9ms	remaining: 6.17s
12:	learn: 0.6928357	total: 76.4ms	remaining: 5.8s
13:	learn: 0.6867423	total: 77.4ms	remaining: 5.45s
14:	learn: 0.6788206	total: 78.3ms	remaining: 5.14s
15:	learn: 0.6713538	total: 79.2ms	remaining: 4.87s
16:	learn: 0.6630735	total: 80.1ms	remaining: 4.63s
17:	learn: 0.6561555	total: 81.1ms	remaining: 4.43s
18:	learn: 0.6504448	total: 82.3ms	remaining: 4.25s
19:	learn: 0.6451427	tota