<a href="https://colab.research.google.com/github/bogden1/TechneTraining1/blob/main/Code/Techne_ML_workbook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Set up variables and install useful library code

In [None]:
import sys
data_source = "Github"
!git clone https://github.com/bogden1/TechneTraining1.git
sys.path.insert(0, 'TechneTraining1')
sys.path.insert(0, 'TechneTraining1/Code')
github_data = "TechneTraining1/Data/TopicModelling/"
import techne_library_code as tlc
from IPython.display import display
import math
TEST_MODE = False

# Topic Modelling

### Load word list from Topic Model built on 'regulation' related websites.
Display words in table.

A topic model is created from a corpus of text by an unsupervised machine learning algorithm. The process is non-deterministic which means the results will differ every time it is run. Below is on results from running the software MALLET over the 'regulation' corpus.

The primary output is a list of topics (in this case 8, one row each) and a list of words most representative of that topic. A word can appear in more than one topic.

From this we can get a high level overview of a corpus of text.

In [None]:
topic_words = tlc.read_topic_list(github_data + "topic_list.txt")
TOPICS = len(topic_words)
topic_table = tlc.pd.DataFrame([v[0:12] for v in topic_words.values()])
topic_table

Stop words are used in Natural Language Processing to filter out very common words, or those which may negatively affect the results.

Below is a list of example stop words.

In [None]:
stop_words = ["i","or","which","of","and","is","the","a","you're","you","at","his","etc",'an','where','when']

Exercise: Picking stop words

We can add more to this list by selecting from the following list. Which ones do you think might be worth filtering out?

In [None]:
additional_stops = ['medical','freight','pdf','plan','kb','regulation','risk']
stop_word_select = tlc.widgets.SelectMultiple(options=additional_stops, rows=len(additional_stops))
stop_word_select

In [None]:
for w in stop_word_select.value:
    print("Adding",w,"to stop words")
    stop_words.append(w)

As well as a list of topics and related words, MALLET also produces a topic breakdown for each document in the corpus.

Here we load the topic data and visualise some examples.

In [None]:
topics_per_doc = tlc.read_doc_topics(github_data + "topics_per_doc.txt")

The following plots show the proportion of each topic attributed to 4 different documents.



1.   Top Left: One topic clearly dominates
2.   Top Right: One dominant topic but a second topic is above the level of others
3.   Bottom Left: Two topics clearly above others
4.   Bottom Right: Topics close to being even

In [None]:
file_number_list = [212, 85, 9, 372]
fig, ax = tlc.plot_doc_topics(file_number_list, topics_per_doc, TOPICS)
tlc.pyplot.show()

### Exercise: From Topics to Classes

For this Machine Learning exercise we want to predict a Category of regulation (e.g. "Medicine" or "Rail"). The categories we may want to predict do not map one-to-one with the topics above. So first we need to create that mapping.

Firstly we will define a list of possible categories. Sometimes the topics that come out may be worth ignoring (e.g. cookie information) but in this case all of them seem to be of interest.

In [None]:
topics_of_interest = [0,1,2,3,4,5,6,7]
class_names = {0:"General",1:"Medicine",2:"Rail",3:"Safety",4:"Pensions",5:"Education",6:"Other",-1:"Unclassified"}
topic_to_class = {}
if TEST_MODE:
    topic_to_class = {0:1,1:0,2:2,3:3,4:4,5:4,6:2,7:2}  #For testing

Using the dropdown and list selector below we can set the mapping from topic to Class (a term more commonly used in machine learning for category.

In [None]:
class_drop = tlc.widgets.Dropdown(options=[(v,k) for k,v in class_names.items()], 
                                  value=topics_of_interest[0], rows = len(class_names))
topic_select = tlc.widgets.SelectMultiple(options=[(w[0:5],t) for t,w in topic_words.items()],
                                      value = [k for k,v in topic_to_class.items() if v == class_drop.value],
                                      rows = TOPICS+1, height="100%")

button = tlc.widgets.Button(description="Update")
output = tlc.widgets.Output()

def on_button_clicked(b):
    for v in topic_select.value:
        topic_to_class[v] = class_drop.value
        print("Updated")

button.on_click(on_button_clicked)

### Exercise: Map topics to classes

In [None]:
V = tlc.widgets.VBox([class_drop, topic_select, button])
V

Update the selected values here and then return to the dropdown until finished.

Display the resulting mappings from **Topic** to **Class**

In [None]:
tlc.pd.DataFrame([(",".join(topic_words[k][0:10]),class_names[v]) for k,v in topic_to_class.items()], columns=['Topic Words','Class'])

## Viewing document class proportions

In [None]:
classes_per_doc = tlc.topic_to_class_scores(topics_per_doc, topic_to_class)

Generally every document contains a bit of each topic. Before visualising the class breakdown for our sample documents, we can filter out lowest scoring classes and focus on the primary class(es) by zeroing all values below a threshold. We then **normalise** the probabilities to add to 1.

Run the next piece of code to create a slider to set the threshold, and then the following one will draw graphs. To try a different threshold, adjust the slider and rerun the graph code.

In [None]:
class_threshold = tlc.widgets.FloatSlider(0.10, min=0.10, max=0.65, step=0.05)
class_threshold

This is the graph code. It shows Classes above the threshold defined by the slider for the four documents previously visualised. 

In [None]:
filtered_classes_per_doc = tlc.filter_topics_by_threshold(classes_per_doc, class_threshold.value)
class_count = len(filtered_classes_per_doc['file_1.txt'])
fig, ax = tlc.plot_doc_topics(file_number_list, filtered_classes_per_doc, class_count, normalise=True)
tlc.pyplot.show()

# Representing text as numbers

The next stage is to use the results of the topic modelling to train a Supervised Learning algorithm.

Supervised Learning is learning by example. We label our data in advance with categories, and then the algorithm derives a function which will map an input data item to an output Class. The input data is usually termed **Features**, the outputs are often called **Responses**.



## Term Frequency-Inverse Document Frequency (TF-IDF)

For this exercise the features are the words in our documents. There is no semantic meaning attached, the words are no more than tokens. Imagine a spreadsheet where each row represents a document and each column represents a word from a fixed vocabulary.

One representation we could use would be to use a 1 to indicate a word appears in a document, and a 0 if it doesn't. This is simple but overly so. A better representation is TF-IDF which stands for Term Frequency-Inverse Document Frequency. It is a very simple idea but the general gist is that a word that appears in most documents will score lowly, while a word which appears in few documents will score highly (this is the Inverse Document Frequency part). The Term Frequency increases the score when it appears many times in the document.

Now that we've mapped topics and categories, it is time to prepare the text corpus (the document contents) for Machine Learning.

First we load the content from a file.

In [None]:
D4ML = tlc.MLData()
D4ML.load_content(github_data + "tm_file_contents.txt")
D4ML.set_classes(classes_per_doc)

We have some influence over the parameters used to define the TFIDF representation of our corpus. How they are set can influence the results.

1. Features: how many distinct words from the corpus to use for the Vocabulary
2. Min Doc Frequency: minimum number of documents a word must appear in to be considered
3. Max Doc Frequency: maximum number of documents a word must appear in to be considered


In [None]:
FEATURES=1000
MIN_DOC_FREQ=4
MAX_DOC_FREQ=100

Calculate the TF-IDF scores for each document and prepare some of the data for Machine Learning

In [None]:
D4ML.add_stop_words(*stop_words)
D4ML.calc_tfidf(FEATURES, MIN_DOC_FREQ, MAX_DOC_FREQ)
print("Documents:",D4ML.TFIDF.shape[0],"\tWords",D4ML.TFIDF.shape[1])
training_features, training_classes, training_ids = D4ML.get_ml_data()
training_features.shape

Here we can see an example of the TF-IDF scores for a document. They are sorted in score order, the higher scores indicating a greater importance for that document.

In [None]:
EXAMPLE = 0   # 0 to 3 only
vocabulary = D4ML.vectorizer.get_feature_names()
example_row = D4ML.TFIDF[D4ML.file_to_idx['file_' + str(file_number_list[EXAMPLE]) + '.txt']]
example_table = tlc.pd.DataFrame(zip([vocabulary[w] for w in example_row.nonzero()[1]],
                                 [int(example_row[0,v]*1000)/1000 for i,v in enumerate(example_row.nonzero()[1])]),
                                 columns = ['word','tfidf']).sort_values(by='tfidf', ascending=False)
example_table

# Supervised Machine Learning

For this exercise we will use the Naive Bayes algorithm. It is called Bayes because it uses Bayesian probability (named after the Reverend Thomas Bayes who discovered it). It is Naive because it assumes that all words in a document are independent of each other (think of the sentence "my cat miaows when hungry"). It seems like a bad assumption but actually works well in practice.

Bayesian probability is surprisingly simple and gives us the ability to flip probabilities around. From a corpus of text I can easily calculate the probability of the word "Passenger" appearing in a document about "Railways", and also in a document about "Medicines". What Bayes' rule does is allow me to then calculate the probability that a document is about "Railways" or "Medicines" given that it contains the word "Passenger". Very handy!

### Preparing training and test data

Before starting any Machine Learning we need to split our data into Training and Test datasets. The reason for this is that algorithms can appear to perform very well against the dataset they were trained with, but then perform very badly on new, unseen, data.

The algorithm will learn from the Training data and then we check its performance against the Test data.

In [None]:
TEST_PROPORTION = 0.6
X_train, X_test, \
y_train, y_test, \
x_train_ids, x_test_ids = tlc.train_test_split(training_features, training_classes, training_ids,
                                               test_size = TEST_PROPORTION, 
                                               random_state=42, stratify=training_classes)

### Training the Naive Bayes model

Now we **fit** a Naive Bayes model to the training data. Two lines of code and it is done!

In [None]:
model = tlc.BernoulliNB()
model.fit(X_train, y_train)

### Optional: Under the hood of Naive Bayes
One nice feature of Naive Bayes is we can see exactly what is going on internally and could recreate its results with a calculator, if we so wished.

The first part of the calculation is called the prior and is the probability of each class without any further information (i.e. without seeing the document content)

This corpus is heavily skewed so rail is dominant.

In [None]:
tlc.pyplot.bar(height=[math.exp(p) for p in model.class_log_prior_], 
               x=[x for x in range(len(model.class_log_prior_))])
tlc.pyplot.xticks([0,1,2,3,4],[class_names[c] for c in range(len(model.class_log_prior_))])
tlc.pyplot.show()

We can also take a class and see which words have the highest probability within that class.

In [None]:
C = 1
N = 10
print("Class:",class_names[C])
topN = (-model.feature_log_prob_[C]).argsort()[0:N]
words = [vocabulary[w] for w in topN]
probs = tlc.np.exp(model.feature_log_prob_[C,topN])
tlc.pyplot.bar(height=probs, 
               x=[x for x in range(len(words))])
tlc.pyplot.xticks(range(len(words)),words, rotation=45)
tlc.pyplot.show()

We can then choose a word and see the probability of each class for that word. The final result is a combination of this and the prior.

Since the prior is heavily in favour of one class, the word probabilities need to be strongly in favour of another to change the result.

In [None]:
W = 9
print("Word:",vocabulary[topN[W]])
w_probs = tlc.np.exp([model.feature_log_prob_[i,topN[W]] for i in range(len(model.class_log_prior_))])
tlc.pyplot.bar(height=w_probs, 
               x=[x for x in range(len(model.class_log_prior_))])
tlc.pyplot.show()

### Evaluating the model's performance

Having created the model we now use it to generate predictions for the test dataset.

We have two ways of assessing the performance of the model. The first is to give an accuracy score, quite simply the percentage of predictions it has got right.

In [None]:
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)
print("Prediction Accuracy:",int(tlc.accuracy_score(y_test, y_pred)*1000)/10,'%')

A more granular method is to view what is quite rightly called a **Confusion Matrix**.

A confusion matrix is a grid mapping 'correct' answers to predictions. The rows represent the class assigned in the test data and the columns represent predictions. The top left to bottom right diagonal shows us how many of each class have been predicted correctly. All of the other numbers count incorrect predictions. The number in row 2, column 3, will show how many documents of whatever the 2nd class represents ("Medicine") have been misclassified as the 3rd class ("Rail").

In this example we see that "Rail" documents tend to be classified correctly, but a lot of the other types are being also misclassified as "Rail".

We can see from the numbers that this dataset is highly imbalanced, so most of the records are "Rail". This may be responsible for the bias towards that class.

In [None]:
fig, ax = tlc.draw_confusion(y_test, y_pred, model, class_names)
tlc.pyplot.show()

### Exploring individual predictions
Similar to the topic model output, the prediction also comes with probabilities. We can visualise these probabilites for a selection of predictions.

In [None]:
y_true_pred = [x for x in zip(range(len(y_test)),y_test, y_pred, y_prob)]
incorrect_predictions = [(y[0],y[1],y[2],y[3]) for i,y in enumerate(y_true_pred) if y[1] != y[2]]
incorrect_unsure = [x for x in incorrect_predictions if max(x[3]) < 0.8]

In [None]:
prediction_sample = tlc.random.sample(range(len(incorrect_unsure)), min(5, len(incorrect_unsure)))
fig, ax = tlc.pyplot.subplots(min(5,len(prediction_sample)),1)
fig.set_size_inches(5,7)

for i, sample_idx in enumerate(prediction_sample):
    prediction_row = incorrect_unsure[sample_idx]
    data_idx = prediction_row[0]
    true_class = prediction_row[1]
    predicted = prediction_row[2]
    tp = prediction_row[3]
    class_probs = [tp[0],tp[1],tp[2],tp[3],tp[4]]
    ax[i].set_ylim([0,1.0])
    ax[i].text(.4,0.8, str(data_idx) + ": True:" + class_names[true_class] + ", Predicted:" + class_names[predicted],
        horizontalalignment='center',
        transform=ax[i].transAxes)
    if i < 4:
        ax[i].set_xticks([])
    else:
        ax[i].set_xticks(ticks=[0,1,2,3,4])
        ax[i].set_xticklabels(labels=['General','Medicine','Rail', 'Safety', 'Pension'])
    ax[i].bar(x = [0,1,2,3,4], height = class_probs)
tlc.pyplot.show()

If we look at probabilities in aggregate we see that generally high confidence predictions match the correct class, and lower confidence ones are more likely to be incorrect. This is desirable behaviour because it gives us more trust in the high confidence predictions.

In [None]:

prob_match_sum = tlc.np.zeros((11,4))

for i in range(11):
    prob_match_sum[i,0] = i/10
for row in y_true_pred:
    max_prob = int(max(row[3]) * 10)
    prob_match_sum[max_prob,int(row[1] == row[2])+1] += 1
    prob_match_sum[max_prob,3] += 1

prob_match_sum = tlc.pd.DataFrame(prob_match_sum, columns=["Probability", "NoMatch", "Match", "Total"])

ax = prob_match_sum.plot(x="Probability", y="Total", kind="bar", color="blue")
ax.legend(['Disagree', 'Match'])
prob_match_sum.plot(x="Probability", y="Match", kind="bar", ax=ax, color="orange")
tlc.pyplot.show()

We will now use another Machine Learning algorithm called Nearest Neighbours to help use classify some new training data.

This code prepares the data for the next section. Firstly the data is indexed to find similar documents quickly (KDTree), then we sort the predictions by their prediction probability (most confident first). Finally we create a sample of most and least confident predictions.

In [None]:
kdt = tlc.KDTree(training_features, leaf_size=30, metric='minkowski', p=2)
tfidf_words = D4ML.vectorizer.get_feature_names()
max_sorted_asc = sorted(y_true_pred, key=lambda max_x : max(max_x[3]), reverse=True)
sample_ids = (tlc.np.random.beta(0.55, 0.4, 50) * len(max_sorted_asc)).astype('int')
sample_ids = list(set(sample_ids))
#tlc.random.shuffle(sample_ids)
s = 0

# Reviewing and correcting predictions

## Form setup code

In [None]:
output = tlc.widgets.Output()

neighbour_list = []
neighbour_idx = 0
NUMBER_OF_NEIGHBOURS = 4

human_classification = {}

def dropdown_update(change):

    check_row = change.new
    this_idx = max_sorted_asc[check_row][0]
    this_words = set([tfidf_words[w] for w in tlc.np.nonzero(X_test[this_idx])[1]])
    true_class_drop.value = y_test[this_idx]
    this_file_name = D4ML.idx_to_file[x_test_ids[this_idx]]
    file_name_text.set_state({'value':this_file_name})
    file_name_text.send_state('value')
    predicted_class_text.set_state({'value': class_names[y_pred[this_idx]]})
    predicted_class_text.send_state('value')
    prediction_prob = tlc.np.max(y_prob[this_idx])
    doc_pred_prob.set_state({'value': prediction_prob})
    doc_pred_prob.send_state('value')
    doc_words.set_state({'value': ",".join(list(this_words))})
    doc_words.send_state('value')
    nn_dist, nn_ind = kdt.query(X_test[this_idx], k=NUMBER_OF_NEIGHBOURS)
    doc_contents.set_state({'value': D4ML.file_contents[this_file_name]})
    doc_contents.send_state('value')
    neighbour_count = 0
    global neighbour_list
    neighbour_list = []
    for i in range(len(nn_dist[0])):
        dist = nn_dist[0][i]
        idx =  nn_ind[0][i]
        file_name = D4ML.idx_to_file[idx]
        if idx == x_test_ids[this_idx]:
            continue
        pred = model.predict(training_features[idx])
        neighbour_distance.set_state({'value':dist})
        neighbour_distance.send_state('value')
        true_class = -1
        words = set([tfidf_words[w] for w in tlc.np.nonzero(training_features[idx])[1]])
        if len(words) == 0:
            continue
        if file_name in class_probs:
            true_class = tlc.np.argmax(class_probs[file_list[idx]])
        overlap_words = ",".join(list(words.intersection(this_words)))
        
        neighbour_list.append([idx, dist, file_name, overlap_words, D4ML.file_contents[file_name],
                               true_class, class_names[pred[0]]])
        neighbour_count += 1
    accordion.set_title(3, "Nearest Neighbours: " + str(len(neighbour_list)))
    with output:
        output.clear_output()
        display(accordion)


sample_dropdown = tlc.widgets.Dropdown(options=sample_ids, value=None, description='Choose a document',
                                       layout={'width':'100px'}, style={'description_width': 'initial'})
sample_dropdown.observe(dropdown_update, 'value')

file_name_text = tlc.widgets.Text(value=None, description="File Name")
class_options = [(v,k) for k,v in class_names.items()]
true_class_drop = tlc.widgets.Dropdown(options=class_options,
                                       value = None, description='True Class')
update_true = tlc.widgets.Button(description="Update")

def on_true_pressed(b):
    human_classification[file_name_text.value] = true_class_drop.value

update_true.on_click(on_true_pressed)

#true_class_drop.observe(on_true_changed,'value')
predicted_class_text = tlc.widgets.Text(value=None, description='Predicted class', style={'description_width': 'initial'})
doc_pred_prob = tlc.widgets.FloatText(value=None, description='Probability')
details = tlc.widgets.VBox([file_name_text,
                            tlc.widgets.HBox([true_class_drop, predicted_class_text, update_true]),
                            doc_pred_prob])

neighbour_overlap = tlc.widgets.Text(value="", 
                               layout={'height': '100%', 'width': '700px'}, disabled=False)
neighbour_distance = tlc.widgets.FloatText(value=None, description='Distance')
neighbour_true_class = tlc.widgets.Dropdown(value=None, description="True", options=class_options)
neighbour_prediction = tlc.widgets.Text(value=None, description="Prediction")
neighbour_file = tlc.widgets.Text(value=None)
neighbour_content = tlc.widgets.Textarea(value=None, layout={'height': '150px', 'width': '700px'})

def on_next_clicked(b):
    global neighbour_idx
    global neighbour_list
    if len(neighbour_list) == 0:
        return
    neighbour_idx += 1
    if neighbour_idx >= len(neighbour_list):
        neighbour_idx = 0
    neighbour_data = neighbour_list[neighbour_idx]
    neighbour_true_class.value = neighbour_data[5]
    neighbour_true_class.send_state('value')
    neighbour_prediction.set_state({'value':neighbour_data[6]})
    neighbour_prediction.send_state('value')
    neighbour_distance.set_state({'value':neighbour_data[1]})
    neighbour_distance.send_state('value')
    neighbour_file.set_state({'value':neighbour_data[2]})
    neighbour_file.send_state('value')
    neighbour_overlap.set_state({'value':neighbour_data[3]})
    neighbour_overlap.send_state('value')
    neighbour_content.set_state({'value':neighbour_data[4]})
    neighbour_content.send_state('value')


next_neighbour = tlc.widgets.Button(description="Next", layout={'width':'100px'})
next_neighbour.on_click(on_next_clicked)

neighbour_true_update = tlc.widgets.Button(description="Update")

def on_neighbour_update_click(b):
    human_classification[neighbour_file.value] = neighbour_true_class.value

neighbour_true_update.on_click(on_neighbour_update_click)

neighbour_details = tlc.widgets.VBox([next_neighbour, neighbour_file, neighbour_distance,
                                      tlc.widgets.HBox([neighbour_true_class, neighbour_true_update]),
                                      neighbour_prediction])

doc_words = tlc.widgets.Text(value=None, layout={'height': '100%', 'width': '700px'}, disabled=False)
#overlap.observe(sample_dropdown, on_button_clicked)
doc_contents = tlc.widgets.Textarea(value=None,
                                    layout={'height': '200px', 'width': '700px'})
sub_accordion = tlc.widgets.Accordion(children=[neighbour_details, neighbour_overlap, neighbour_content])
accordion = tlc.widgets.Accordion(children=[details,doc_words, doc_contents, sub_accordion])
accordion.set_title(0,"Details")
accordion.set_title(1,"Document Features")
accordion.set_title(2,"Document Contents")
accordion.set_title(3,"Nearest Neighbours")
sub_accordion.set_title(0,"Neighbour details")
sub_accordion.set_title(1,"Word overlap")
sub_accordion.set_title(2,"Neighbour content")


## Refreshing the training data

Run the line of code below and then use the dropdown menu to select a sample document. The ids will be meaningless so pick any. This will open a form for viewing the classification results for the document. The form layout is called 'accordion' and you can click on any title and a different part of the form will open up. You can also change the 'true' classification and press update to save the new value. As well as checking the selected document classification, the form also shows the most similar documents (using nearest neighbours) to classify those at the same time.

The output from this exercise could be used in future to train a new classifier.

In [None]:
display(sample_dropdown, output)

In [None]:
human_classification