Welcome! In this tutorial, we'll be building a decision tree classifier using scikit-learn for the Human Chromosome 1 Dataset.

In [1]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

First, let's grab our data. This dataset contains 1200 experimentally validated binding sites, and 1200 random sequences. We'll need to split our data into a training set and a test set.

In [2]:
training_data = pd.read_csv('../datasets/humanSP1/humanSP1_train.csv', sep= ',', header=None)
X = training_data.values[:,0]
Y = training_data.values[:, 1:2]

Depending on the classifier your using, you might need to change the encoding of your input from letters (ACGT) to numbers. Here we map:

A -> 00
C -> 01
G -> 10
T -> 11

Now you know the origin of the hackseq logo ;)

In [3]:
updated_X = []

for line in X:
	tmp= []
	for character in line:
		if character == 'A':
			tmp.append("00")
		elif character == 'C':
			tmp.append("01")
		elif character == 'G':
			tmp.append("10")
		elif character == 'T':
			tmp.append("11")
	updated_X.append(tmp)

Finally, we will use sci-kit learns train_test_split function to split our dataset into 70% training and 30% testing respectively:

In [4]:
X_train, X_test, y_train, y_test = train_test_split(updated_X, Y, test_size = 0.3, random_state = 100)

In [5]:
tfbs_classifier = tree.DecisionTreeClassifier()
tfbs_classifier = tfbs_classifier.fit(X_train, y_train)
print("Data is trained!")

Data is trained!


[Here](https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb) is a great in-depth explanation of decision trees. Decision trees use the information theory concept of "entropy" to maximize the information gained at each node in the tree. Now let's try running our test data:

In [6]:
y_pred = tfbs_classifier.predict(X_test)
print(y_pred)

['non-binding site' 'non-binding site' 'binding site' 'non-binding site'
 'non-binding site' 'binding site' 'non-binding site' 'binding site'
 'binding site' 'binding site' 'binding site' 'binding site'
 'binding site' 'binding site' 'non-binding site' 'binding site'
 'binding site' 'non-binding site' 'non-binding site' 'binding site'
 'non-binding site' 'binding site' 'binding site' 'binding site'
 'binding site' 'binding site' 'non-binding site' 'non-binding site'
 'non-binding site' 'binding site' 'binding site' 'binding site'
 'binding site' 'binding site' 'non-binding site' 'non-binding site'
 'non-binding site' 'non-binding site' 'binding site' 'non-binding site'
 'binding site' 'non-binding site' 'non-binding site' 'non-binding site'
 'binding site' 'binding site' 'binding site' 'non-binding site'
 'binding site' 'binding site' 'non-binding site' 'non-binding site'
 'binding site' 'binding site' 'non-binding site' 'non-binding site'
 'non-binding site' 'non-binding site' 'bindin

Let's see how well our predictions fare from the truth.
There are a number of accuracy metrics that we can use, we'll use F1 score. [Here](https://towardsdatascience.com/precision-vs-recall-386cf9f89488) is an introduction to F1 score, including why we want to use it over accuracy.

In [7]:
f1_score(y_test, y_pred, average='macro')

0.6992481203007519

Not bad, but I'm sure with some optimizations we can improve our score. Now let's talk about rules.

Rules take the form of an {IF:THEN} expression. For example:

{IF 'red' AND 'octagon' THEN 'stop-sign'}

{IF 'salary' < 70,000 AND 1yrs < 'time_employed' < 3yrs AND 'last_promotion' == null THEN 'employee_quits'}.

Decision trees are great for extracting rules. Your goal is to try and extract some rules from this dataset.
[Here](https://scikit-learn.org/stable/modules/tree.html#classification) is a sci-kit learn tutorial on how to extract rules from a decision tree.