<h1>Classifier Building in Scikit-learn</h1>

This is a part of my article at <a href= "https://theactivereader.medium.com/?source=home---------2---------------------c40960af_8c5a_44ca_96fb_32b14f8bf3e7-------2">Salam</a>, you can read the article from here: 

<p>This script gives a VERY basic step by step approach to use the Naive Bayes algorithm
using the scikit-learn python library.<br>
The library includes a variety of classifiers depending on the distribution of your data. <br>
At the end of the script I just apply a prediction for each possible 'Evidence' class. 
</p>

In this example, you can use the dummy dataset with two columns: Evidence, and cancer. 

In [3]:
import pandas as pd
from sklearn import preprocessing
from sklearn.naive_bayes import CategoricalNB
import numpy as np

In [18]:
# create some data
evidence = pd.Series(['smoke', 'Obesity','Alcohol', 'Diet', 'Materials', 'smoke', 'Cancer_syndromes', 'Cancer_syndromes', 'Bacteria_and_parasites',
                     'Diet', 'Viruses', 'smoke', 'Alcohol', 'Alcohol', 'smoke', 'Cancer_syndromes', 'Obesity'])

cancer = pd.Series(['yes', 'no','no', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'no', 'no', 'no', 'yes'])

In [19]:
# create pandas dataframe and rename columns
df = pd.concat([evidence, cancer], axis=1)
df.columns = ['Evidence', 'Cancer']

In [20]:
print(df.head())

    Evidence Cancer
0      smoke    yes
1    Obesity     no
2    Alcohol     no
3       Diet     no
4  Materials    yes


In [21]:
# create frequency table
print(pd.crosstab(index=df['Evidence'], columns=df['Cancer']))

Cancer                  no  yes
Evidence                       
Alcohol                  2    1
Bacteria_and_parasites   0    1
Cancer_syndromes         1    2
Diet                     1    1
Materials                0    1
Obesity                  1    1
Viruses                  0    1
smoke                    1    3


In [22]:
# encode the features
oe_evidence = preprocessing.OrdinalEncoder()
le_cancer = preprocessing.LabelEncoder()

In [27]:
# ordinal encoder for evidence feature
oe_evidence.fit(np.array(df['Evidence']).reshape(-1,1))
evidence = oe_evidence.transform(np.array(df['Evidence']).reshape(-1, 1))

In [28]:
evidence

array([[7.],
       [5.],
       [0.],
       [3.],
       [4.],
       [7.],
       [2.],
       [2.],
       [1.],
       [3.],
       [6.],
       [7.],
       [0.],
       [0.],
       [7.],
       [2.],
       [5.]])

In [31]:
# label encoder for target
le_cancer.fit(df['Cancer'])
cancer_y = le_cancer.fit_transform(df['Cancer'])

In [32]:
cancer_y

array([1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1])

In [33]:
# view encodings
print(oe_evidence.categories_)

[array(['Alcohol', 'Bacteria_and_parasites', 'Cancer_syndromes', 'Diet',
       'Materials', 'Obesity', 'Viruses', 'smoke'], dtype=object)]


In [35]:
print(le_cancer.classes_)

['no' 'yes']


In [36]:
# set and fit classifier
clf = CategoricalNB()
clf.fit(evidence, cancer_y)

CategoricalNB()

In [37]:
# predict and view any given evidence value
evidence_classes = np.unique(evidence)

In [40]:
evidence_classes

array([0., 1., 2., 3., 4., 5., 6., 7.])

[array(['Alcohol', 'Bacteria_and_parasites', 'Cancer_syndromes', 'Diet',
        'Materials', 'Obesity', 'Viruses', 'smoke'], dtype=object)]

In [74]:
j -=1 
for i in evidence_classes:
    j +=1
    print("evidence", oe_evidence.categories_[0][j], "-",
          "cancer probability:", np.round(clf.predict_proba([[i]]), 2))

evidence Alcohol - cancer probability: [[0.53 0.47]]
evidence Bacteria_and_parasites - cancer probability: [[0.27 0.73]]
evidence Cancer_syndromes - cancer probability: [[0.33 0.67]]
evidence Diet - cancer probability: [[0.43 0.57]]
evidence Materials - cancer probability: [[0.27 0.73]]
evidence Obesity - cancer probability: [[0.43 0.57]]
evidence Viruses - cancer probability: [[0.27 0.73]]
evidence smoke - cancer probability: [[0.27 0.73]]
