# Machine learning specialization : LaTeX example

First, import the dataset.

In [3]:
import pandas as pd

df = pd.read_csv("dataset_latex.csv")

df.head()

The dataset contains a list of the results of compilation of a latex document (if it is valid regarding the number of pages or not) and the options of the compilation associated.

The next step is to split the columns of the dataset in two parts.

In [4]:
X = df.drop(columns=["valid"])
y = df["valid"]

The X part is the one with the compilation options, or the data we always know. The y part is the one that has to be guessed with regard to the X part, usually called the label.

Then, we need to split the rows of the dataset. One will be used to train the machine learning model, and the other will be used to verify how well the model is able to guess the label on new data.

In [5]:
from sklearn.model_selection import train_test_split

test_size=0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)

Now that we have the data all set up, here comes the machine learning part.

We will use a decision tree classifier algorithm.

In [7]:
from sklearn import tree

clf = tree.DecisionTreeClassifier(max_depth=4)

The algorithm is now configured and is ready to be trained on the data.

In [8]:
clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

The model is now ready to predict, given a latex compilation configuration, the result of every latex compilation of the subject document.

In [9]:
clf.predict(X_test)

array([False, False,  True,  True,  True, False,  True, False,  True,
        True,  True, False,  True,  True,  True,  True,  True, False,
        True,  True,  True,  True, False, False, False,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True,  True, False, False,  True,
        True, False,  True, False, False,  True, False, False, False,
        True,  True,  True,  True,  True,  True,  True, False, False,
       False,  True, False,  True,  True, False,  True, False, False,
        True,  True, False,  True,  True,  True, False,  True, False,
       False, False, False, False, False, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True, False, False, False, False,  True, False,  True,  True,
       False,  True,  True,  True, False,  True,  True,  True,  True,
        True, False,  True, False])

We can measure how true the prediction is by using the testing set and the accuracy of the model.

In [10]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, clf.predict(X_test))

0.8677685950413223

Perk of decision tree : it is interpretable.

We can print it to vizualize.

In [11]:
import graphviz

def print_tree(clf, f_names, name):
    
    dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names=f_names,  
                         filled=True, rounded=True,
                         special_characters=True)  
    graph = graphviz.Source(dot_data)  
    graph.render(name)
    
print_tree(clf, X_train.columns.values, "tree")

We can create a set of rules for which the compilation will give a valid result (according to the model).

In [14]:
from sklearn.tree import _tree


def tree_to_rules_valid(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    #print ("def tree({}):".format(", ".join(feature_names)))

    def recurse(node, previous_rules):
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            #print ("{}if {} <= {}:".format(indent, name, threshold))
            recurse(tree_.children_left[node], previous_rules+[name + " <= " + str(threshold)])
            #print ("{}else:  # if {} > {}".format(indent, name, threshold))
            recurse(tree_.children_right[node], previous_rules+[name + " > " + str(threshold)])
        else:
            if tree_.value[node][0][0] < tree_.value[node][0][1]:
                #print(" & ".join(previous_rules) + " ---> " + str(tree_.value[node]))
                print(" & ".join(previous_rules))


    recurse(0, [])
    
tree_to_rules_valid(clf, X_train.columns)

JS_SCRIPTSIZE <= 0.5 & cserver_size <= 0.8499999940395355 & bref_size <= 0.949999988079071
JS_SCRIPTSIZE <= 0.5 & cserver_size <= 0.8499999940395355 & bref_size > 0.949999988079071 & cserver_size <= 0.75
JS_SCRIPTSIZE <= 0.5 & cserver_size <= 0.8499999940395355 & bref_size > 0.949999988079071 & cserver_size > 0.75
JS_SCRIPTSIZE <= 0.5 & cserver_size > 0.8499999940395355 & PL_FOOTNOTE <= 0.5 & vspace_bib <= 4.549999952316284
JS_SCRIPTSIZE <= 0.5 & cserver_size > 0.8499999940395355 & PL_FOOTNOTE > 0.5 & bref_size <= 0.949999988079071
JS_SCRIPTSIZE > 0.5 & cserver_size <= 0.75 & PL_FOOTNOTE <= 0.5 & PARAGRAPH_ACK <= 0.5
JS_SCRIPTSIZE > 0.5 & cserver_size <= 0.75 & PL_FOOTNOTE > 0.5 & bref_size <= 0.8499999940395355


Accuracy is a good metric but can hide some flaws.

In [39]:
from sklearn.metrics import confusion_matrix

tn, fp, fn, tp = confusion_matrix(y_test, clf.predict(X_test)).ravel()

print("            ","Predicted True","  Predicted False")
print("Actual True   ",tp,"(TP)         ", fn,"(FN)")
print("Actual False  ",fp,"(FP)         ", tn,"(TN)")

             Predicted True   Predicted False
Actual True    71 (TP)          4 (FN)
Actual False   7 (FP)          39 (TN)


What we learn from the confusion matrix : 

accuracy = (TP + TN) / (TP + TN + FP + FN)

recall = TP / (TP + FN) -> Flexibility

precision = TP / (TP + FP) -> Safety

Do it again, make the parameters change.

In [23]:

#Parameters

test_size=0.3

#Decision tree classifier parameters
#More details here : https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
hyperparams = {
    "criterion":"gini",
    "splitter":"best",
    "max_features":None,
    "max_depth":None,
    "min_samples_split":2,
    "min_samples_leaf":1,
    "min_weight_fraction_leaf":0.,
    "max_leaf_nodes":None,
    "class_weight":None,
    "random_state":None,
    "min_impurity_decrease":1e-7,
    "presort":False
}


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)

clf = tree.DecisionTreeClassifier(**hyperparams)
clf.fit(X_train, y_train)
accuracy_score(y_test, clf.predict(X_test))

0.9090909090909091