# Testing different ML models

### Imports
- We can use numpy darrays to store the matrix
- Then for easier access and better presentation we can convert it to a pandas DataFrame
- This will allow us the quick access methods of pandas, so that we can iteratively select each column from the dataframe and fit a model for it based on other columns
- Then we can import and use any of the sklearn clasifiers, I chose for these tests Naive Bayes, SVM and Random Forest

In [1]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pylab as plt

### Matrix/Data Format
- Our matrix is likely going to be a large matrix of 0s and 1s where 1 denotes the existance of this attribute for a specific document
- The dimensionality of the matrix will be the number of attributes we predefine for each bot use case (by the number of documents we have available)

In [2]:
# generate a random matrix of 1s and 0s of dimensionality 10000 x 5
matrix = np.random.randint(2, size=(10000, 5))

In [3]:
matrix

array([[0, 0, 1, 1, 0],
       [0, 0, 0, 0, 1],
       [1, 0, 1, 0, 0],
       ...,
       [0, 0, 0, 0, 0],
       [1, 1, 1, 1, 0],
       [1, 0, 1, 1, 1]])

In [4]:
# convert to data frame and assign generic attribute/entity names
df = pd.DataFrame(matrix, columns=['A', 'B', 'C', 'D', 'E'])

In [5]:
df.head()

Unnamed: 0,A,B,C,D,E
0,0,0,1,1,0
1,0,0,0,0,1
2,1,0,1,0,0
3,0,0,1,0,1
4,1,0,0,1,0


## Method

- For the models to work we need to have a separation of the input data X and the training output data Y
- The main problem arises from the fact that we do not know how many attributes will be provided at a time to use (i.e. predict the value of attribute 'D' based only on the known existence of attributes 'A' and 'B')
- Hence, the current solution I propose is to train models for each of the attributes by using as input data all the other attributes and use means to fill in those values which are missing (I'll explain further later)
- As shown below, this would mean, for example, that we take the known values of columns A, B, D, E from the training data and fit the model for the attribute 'C'
- This way we will have models for predicting the probability of existence for each and every attribute on its ow

In [6]:
# separate the input/output distinction in the training data
X, y = df[['A', 'B', 'D', 'E']], df['C']

### Naive Bayes
- Benefit of this model is that it simple and straightforward, using only conditional probabilities

In [7]:
clf = GaussianNB()
# fit the classifier on the training dataset
clf.fit(X, y)

# predict the Pr(C = ['0' or '1'] | A, B, D, E)
print(clf.predict_proba([[1, 1, 1, 1]]))

[[0.51000558 0.48999442]]


### SVM
- SVM or SVC (support vector classifier) are effectively at fitting the data and separating them into clear classfication labels, especially when only predicting 0 or 1 like in this case

In [8]:
# clf = SVC()
clf.fit(X, y)

# SVM does not support probability prediction, only the value directly
print(clf.predict([[1, 0, 0, 1]]))
print(clf.predict([[0, 1, 1, 0]]))

[1]
[0]


### Random Forest
- Random Forests are proably the most human-understandable approach since they simple use a number of decision trees and splits based on the values/existence of other attributes.

In [9]:
clf = RandomForestClassifier(n_estimators=100, max_depth=4, random_state=0)
clf.fit(X, y)

print(clf.predict_proba([[1, 1, 1, 1]]))

[[0.52614083 0.47385917]]


### Testing the execution time for Random Forest for 10000 docs with 5 attributes

In [10]:
# I wanted to test this to simply validate that the training of these models will be extremely fast since
# we are only dealing with a limited dimensionality space (i.e. max 15-20 attributes) and values of 0 and 1
%time

clf = RandomForestClassifier(n_estimators=100, max_depth=4, random_state=0)
columns = ['A', 'B', 'C', 'D', 'E']
for col in columns:
    temp = columns.copy()
    temp.remove(col)
    X, y = df[temp], df[col]
    clf.fit(X, y)


CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 5.96 µs


## Dealing with the problem of not knowing the values since the bot is an iterative experience (one attribute value after another is obtained)

- The problem is the fact that at the start we will get either a simple message "I'd like to create a contract with this and that value of this 'attribute'..." or just "I'd like to create a new contract"
- Hence, we need to decide which attribute to ask for (either in which order or based on the limited number of given attribute values)
- One solution to this is to train models for each subset of input variables (i.e. train models for only knowing attributes 'A', or 'B', or ['A', 'C'], or ['B', 'C', 'D']
- However, while this may produce better results, this will result in an inordinate number of models that need to be trained and retrained (based on my understanding, this will be a factorial expansion, meaning that if there are 10 attributes, we will have to train 10! models which is unreasonable
- The other more appropriate solution to this problem is what is often done in ML and that is to use the mean values of the missing attributes with the idea being to fill their values with the most neutral ones for the model (shown below)

In [11]:
df.mean()

A    0.4978
B    0.5004
C    0.4942
D    0.4895
E    0.5004
dtype: float64

In [12]:
# Predicting the probability for C based only on the mean values of other attributes
print(clf.predict_proba([[0.5106, 0.4937, 0.4953, 0.5033]]))

[[0.50010003 0.49989997]]


As shown above, the value predicted when using only means will be close to the mean of the original attribute values

### Problems with the 'mean values' solution

- The known problems with this solution is that it essentially assumes a normal distribution of the input values as well as a relative level of inter-independence
- However I think that this will not be an issue even if certain attributes are always present and the individual models for each attribute should ensure that if A and B are always together, then whenever A==1, the model should predict B=1 as well

### Testing out Random Forest classifier with actual values instead of 0s and 1s... (to be done later)

- This turned out to be a much more difficult problem than I assumed at first since I need to figure out and research which kind of approach to data labeling and which ML algorithm approach to actually use for this
- I'll get back to it next week