<a href="https://colab.research.google.com/github/mark-bell-tna/TechneTraining/blob/main/Code/Techne_ML_workbook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Set up variables and install useful library code

In [None]:
import sys
data_source = "Github"
!git clone https://github.com/nationalarchives/TechneTraining.git
sys.path.insert(0, 'TechneTraining')
sys.path.insert(0, 'TechneTraining/Code')
github_data = "TechneTraining/Data/"
import techne_library_code as tlc
from IPython.display import display

## Load word list from Topic Model built on 'regulation' related websites.
Display words in table.

A topic model is created from a corpus of text by an unsupervised machine learning algorithm. The process is non-deterministic which means the results will differ every time it is run. Below is on results from running the software MALLET over the 'regulation' corpus.

The primary output is a list of topics (in this case 8, one row each) and a list of words most representative of that topic. A word can appear in more than one topic.

From this we can get a high level overview of a corpus of text.

In [None]:
topic_words = tlc.read_topic_list(github_data + "TM/topic_list.txt")
TOPICS = len(topic_words)
topic_table = tlc.pd.DataFrame([v[0:12] for v in topic_words.values()])
topic_table

Stop words are used in Natural Language Processing to filter out very common words, or those which may negatively affect the results.

Below is a list of example stop words.

In [None]:
stop_words = ["i","or","which","of","and","is","the","a","you're","you","at","his","etc",'an','where','when']

We can add more to this list by selecting from the following list. Which ones do you think might be worth filtering out?

In [None]:
additional_stops = ['medical','freight','pdf','plan','kb','regulation','risk']
stop_word_select = tlc.widgets.SelectMultiple(options=additional_stops, rows=len(additional_stops))
stop_word_select

In [None]:
for w in stop_word_select.value:
    print("Adding",w,"to stop words")
    stop_words.append(w)

As well as a list of topics and related words, MALLET also produces a topic breakdown for each document in the corpus.

Here we load the topic data and visualise some examples.

In [None]:
topics_per_doc = tlc.read_doc_topics(github_data + "TM/topics_per_doc.txt")

The following plots show the proportion of each topic attributed to 4 different documents.



1.   Top Left: One topic clearly dominates
2.   Top Right: One dominant topic but a second topic is above the level of others
3.   Bottom Left: Two topics clearly above others
4.   Bottom Right: Topics close to being even

In [None]:
file_number_list = [212, 85, 9, 372]
fig, ax = tlc.plot_doc_topics(file_number_list, topics_per_doc, TOPICS)
tlc.pyplot.show()

## From Topics to Classes

For this Machine Learning exercise we want to predict a Category of regulation (e.g. "Medicine" or "Rail"). The categories we may want to predict do not map one-to-one with the topics above. So first we need to create that mapping.

Firstly we will define a list of possible categories. Sometimes the topics that come out may be worth ignoring (e.g. cookie information) but in this case all of them seem to be of interest.

In [None]:
topics_of_interest = [0,1,2,3,4,5,6,7]
class_names = {0:"General",1:"Medicine",2:"Rail",3:"Safety",4:"Pensions",5:"Education",6:"Other",-1:"Unclassified"}
topic_to_class = {}
topic_to_class = {0:1,1:0,2:2,3:3,4:4,5:4,6:2,7:2}  #For testing

Using the dropdown and list selector below we can set the mapping from topic to Class (a term more commonly used in machine learning for category.

In [None]:
class_drop = tlc.widgets.Dropdown(options=[(v,k) for k,v in class_names.items()], value=topics_of_interest[0], rows = len(class_names))
topic_select = tlc.widgets.SelectMultiple(options=[(w[0:5],t) for t,w in topic_words.items()],
                                      value = [k for k,v in topic_to_class.items() if v == class_drop.value],
                                      rows = TOPICS+1, height="100%")

button = tlc.widgets.Button(description="Update")
output = tlc.widgets.Output()

#D = display(button, output)

def on_button_clicked(b):
    for v in topic_select.value:
        topic_to_class[v] = class_drop.value
        print("Updated")
        #with output:
        #    print("Mapped",topic_to_class[v],"to",class_drop.value)

button.on_click(on_button_clicked)

In [None]:
V = tlc.widgets.VBox([class_drop, topic_select, button])
V

Update the selected values here and then return to the dropdown until finished.

Display the resulting mappings from **Topic** to **Class**

In [None]:
tlc.pd.DataFrame([(",".join(topic_words[k][0:10]),class_names[v]) for k,v in topic_to_class.items()], columns=['Topic Words','Class'])

In [None]:
classes_per_doc = tlc.topic_to_class_scores(topics_per_doc, topic_to_class)

Generally every document contains a bit of each topic. Before visualising the class breakdown for our sample documents, we can filter out lowest scoring classes and focus on the primary class(es) by zeroing all values below a threshold. We then **normalise** the probabilities to add to 1.

Run the next piece of code to create a slider to set the threshold, and then the following one will draw graphs. To try a different threshold, adjust the slider and rerun the graph code.

In [None]:
class_threshold = tlc.widgets.FloatSlider(0.10, min=0.10, max=0.65, step=0.05)
class_threshold

This is the graph code. It shows Classes above the threshold defined by the slider for the four documents previously visualised. 

In [None]:
filtered_classes_per_doc = tlc.filter_topics_by_threshold(classes_per_doc, class_threshold.value)
class_count = len(filtered_classes_per_doc['file_1.txt'])
fig, ax = tlc.plot_doc_topics(file_number_list, filtered_classes_per_doc, class_count, normalise=True)
tlc.pyplot.show()

Now that we've mapped topics and categories, it is time to prepare the text corpus (the document contents) for Machine Learning.

First we load the content from a file.

In [None]:
file_contents = tlc.load_content(github_data + "TM/tm_file_contents.txt")

file_list = []
corpus = []

for k,v in file_contents.items():
    file_list.append(k)
    corpus.append(tlc.clean_string(v))

file_to_idx = dict([(x,i) for i,x in enumerate(file_list)])

## Representing text as numbers

The next stage is to use the results of the topic modelling to train a Supervised Learning algorithm.

Supervised Learning is learning by example. We label our data in advance with categories, and then the algorithm derives a function which will map an input data item to an output Class. The input data is usually termed **Features**, the outputs are often called **Responses**.

### Term Frequency-Inverse Document Frequency (TF-IDF)

For this exercise the features are the words in our documents. There is no semantic meaning attached, the words are no more than tokens. Imagine a spreadsheet where each row represents a document and each column represents a word from a fixed vocabulary.

One representation we could use would be to use a 1 to indicate a word appears in a document, and a 0 if it doesn't. This is simple but overly so. A better representation is TF-IDF which stands for Term Frequency-Inverse Document Frequency. It is a very simple idea but the general gist is that a word that appears in most documents will score lowly, while a word which appears in few documents will score highly (this is the Inverse Document Frequency part). The Term Frequency increases the score when it appears many times in the document.

We have some influence over the parameters used to define the TFIDF representation of our corpus. How they are set can influence the results.

1. Features: how many distinct words from the corpus to use for the Vocabulary
2. Min Doc Frequency: minimum number of documents a word must appear in to be considered
3. Max Doc Frequency: maximum number of documents a word must appear in to be considered


In [None]:
FEATURES=1000
MIN_DOC_FREQ=4
MAX_DOC_FREQ=100

In [None]:
vectorizer = tlc.TfidfVectorizer(max_features=FEATURES, min_df=MIN_DOC_FREQ, max_df=MAX_DOC_FREQ, stop_words = stop_words)
TFIDF = vectorizer.fit_transform(corpus)
print("Documents:",TFIDF.shape[0],"\tWords",TFIDF.shape[1])
training_files, training_features, training_class = tlc.prepare_for_ml(TFIDF, classes_per_doc, file_to_idx)


In [None]:
vocabulary = vectorizer.get_feature_names()
example_row = TFIDF[file_number_list[2]]
example_table = tlc.pd.DataFrame(zip([vocabulary[w] for w in example_row.nonzero()[1]],
                                 [int(example_row[0,v]*1000)/1000 for i,v in enumerate(example_row.nonzero()[1])]),
                                 columns = ['word','tfidf'])
example_table

## Supervised Machine Learning

For this exercise we will use the Naive Bayes algorithm. It is called Bayes because it uses Bayesian probability (named after the Reverend Thomas Bayes who discovered it). It is Naive because it assumes that all words in a document are independent of each other (think of the sentence "my cat miaows when hungry"). It seems like a bad assumption but actually works well in practice.

Bayesian probability is surprisingly simple and gives us the ability to flip probabilities around. From a corpus of text I can easily calculate the probability of the word "Passenger" appearing in a document about "Railways", and also in a document about "Medicines". What Bayes' rule does is allow me to then calculate the probability that a document is about "Railways" or "Medicines" given that it contains the word "Passenger". Very handy!

Before starting any Machine Learning we need to split our data into Training and Test datasets. The reason for this is that algorithms can appear to perform very well against the dataset they were trained with, but then perform very badly on new, unseen, data.

The algorithm will learn from the Training data and then we check its performance against the Test data.

In [None]:
TEST_PROPORTION = 0.6
X_train, X_test, y_train, y_test, f_file_train, f_file_test = tlc.train_test_split(training_features, training_class, training_files,
                                                                             test_size = TEST_PROPORTION, random_state=42, stratify=training_class)

Now we **fit** a Naive Bayes model to the training data. Two lines of code and it is done!

In [None]:
model = tlc.BernoulliNB()
model.fit(X_train, y_train)

Having created the model we now use it to generate predictions for the test dataset.

We have two ways of assessing the performance of the model. The first is to give an accuracy score, quite simply the percentage of predictions it has got right.

In [None]:
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)
print("Prediction Accuracy:",int(tlc.accuracy_score(y_test, y_pred)*1000)/10,'%')


A more granular method is to view what is quite rightly called a **Confusion Matrix**.

A confusion matrix is a grid mapping 'correct' answers to predictions. The rows represent the class assigned in the test data and the columns represent predictions. The top left to bottom right diagonal shows us how many of each class have been predicted correctly. All of the other numbers count incorrect predictions. The number in row 2, column 3, will show how many documents of whatever the 2nd class represents ("Medicine") have been misclassified as the 3rd class ("Rail").

In this example we see that "Rail" documents tend to be classified correctly, but a lot of the other types are being also misclassified as "Rail".

We can see from the numbers that this dataset is highly imbalanced, so most of the records are "Rail". This may be responsible for the bias towards that class.

In [None]:
fig, ax = tlc.draw_confusion(y_test, y_pred, model, class_names)
tlc.pyplot.show()

In [None]:
y_true_pred = [x for x in zip(range(len(y_test)),y_test, y_pred, y_prob)]

In [None]:
true_0_pred_2 = [(y[0],y[1],y[2],y[3]) for i,y in enumerate(y_true_pred) if y[1] in [0,2] and y[2] in [0,2] and y[1] != y[2]]
[x for x in true_0_pred_2][0:5]

In [None]:
prediction_sample = tlc.random.sample(range(len(true_0_pred_2)), min(5, len(true_0_pred_2)))
fig, ax = tlc.pyplot.subplots(min(5,len(prediction_sample)),1)
fig.set_size_inches(5,7)

for i, sample_idx in enumerate(prediction_sample):
    data_idx = true_0_pred_2[sample_idx][0]
    tp = true_0_pred_2[sample_idx][3]
    class_probs = [tp[0],tp[1],tp[2],tp[3],tp[4]]
    ax[i].set_ylim([0,1.0])
    #ax[int(i/2), i % 2].set_xlim([0,TOPICS-1])
    #ax[i].title.set_text(str(idx))
    ax[i].text(.4,0.8, str(data_idx),
        horizontalalignment='center',
        transform=ax[i].transAxes)
    if i < 4:
        ax[i].set_xticks([])
    else:
        ax[i].set_xticks(ticks=[0,1,2,3,4])
        ax[i].set_xticklabels(labels=['General','Medicine','Rail', 'Safety', 'Pension'])
    ax[i].bar(x = [0,1,2,3,4], height = class_probs)
tlc.pyplot.show()

In [None]:
uncertains = [x for x in true_0_pred_2 if max(x[3][1], x[3][0]) < 0.95]
for u in uncertains:
    print([u[0],u[1],u[2],[int(y*100)/100 for y in u[3]]])

In [None]:
kdt = tlc.KDTree(training_features, leaf_size=30, metric='euclidean')
tfidf_words = vectorizer.get_feature_names()

In [None]:
this_idx = uncertains[0][0]
this_file = file_list[this_idx]
print(file_contents[this_file])
this_words = set([tfidf_words[w] for w in tlc.np.nonzero(training_features[this_idx])[1]])
print(this_words)

In [None]:

prob_match_sum = tlc.np.zeros((11,4))
#print(prob_match_sum.shape)
for i in range(11):
    prob_match_sum[i,0] = i/10
for row in y_true_pred:
    max_prob = int(max(row[3]) * 10)
    #if prob_match_sum[max_prob, 0] == 0:
    #    prob_match_sum[max_prob, 0] = max_prob/10
    prob_match_sum[max_prob,int(row[1] == row[2])+1] += 1
    prob_match_sum[max_prob,3] += 1

#for x in max_sorted_asc[0:5]:
#    print(x)

#print("")
#for x in max_sorted_desc[0:5]:
#    print(x)

prob_match_sum = tlc.pd.DataFrame(prob_match_sum, columns=["Probability", "NoMatch", "Match", "Total"])
#prob_match_sum

ax = prob_match_sum.plot(x="Probability", y="Total", kind="bar", color="blue")
ax.legend(['Disagree', 'Match'])
prob_match_sum.plot(x="Probability", y="Match", kind="bar", ax=ax, color="orange")
tlc.pyplot.show()

In [None]:
max_sorted_asc = sorted(y_true_pred, key=lambda max_x : max(max_x[3]), reverse=False)

In [None]:
check_row = tlc.random.randint(0,100)
print("Random number:", check_row)
print(max_sorted_asc[check_row])
this_idx = max_sorted_asc[check_row][0]
this_words = set([tfidf_words[w] for w in tlc.np.nonzero(training_features[this_idx])[1]])

print("Doc Index:", this_idx, "Doc words:", ",".join(list(this_words)))
nn_dist, nn_ind = kdt.query(training_features[this_idx], k=4)
print("Contents:", file_contents[file_list[this_idx]])
print("")
for i in range(len(nn_dist[0])):
    dist = nn_dist[0][i]
    idx = nn_ind[0][i]
    if idx == this_idx:
        continue
    pred = model.predict(training_features[idx])
    true_class = -1
    if file_list[idx] in class_file_scores:
        true_class = tlc.np.argmax(class_file_scores[file_list[idx]])
    print("Match index:", idx, "; Predicted as:", class_names[pred[0]],
          "; Labelled as:", class_names[true_class],
          "; Distance score:",dist)
    words = set([tfidf_words[w] for w in tlc.np.nonzero(feature_matrix[idx])[1]])
    print("\tWords:  ", ",".join(list(words)))
    print("\tOverlap:", ",".join(list(words.intersection(this_words))))
    print("\tContent:",file_contents[file_list[idx]].strip())
    print("")

    