Building a machine learning
model in python
for topics extraction and relate all legislations by their similarity.
Given a new phrase, the model with predict the topic is about and will list all related legislations.
LDA (Latent Dirichlet Allocation)
unsupervised machine-learning model will be used for classifying documents into topics and prediction.
Dataset of legislations are downloaded from the Official Journal of the European Commission using this project.
This project has been built as part of the dissertation for the Postgraduate Studies in Big Data & Analytics at University of Leuven.
-
Load all
documents
from the filesystem, clean and pre-process the content, and build a list with tuples <FileName, FileContent>[ [fileName1, fileContent1], [fileName1, fileContent2], ... ]
-
Build the
LDA model
:documents_words = documents[:, 1] id2word = corpora.Dictionary(documents_words) bow_corpus = [id2word.doc2bow(document) for document in documents_words] lda_model = LdaModel(corpus=bow_corpus, id2word=id2word)
Dictionary
with all the words present in all files is created.Bag of Words
is created for each file using the dictionary. A BoW is a list of tuples <tokenId, tokenCount>, expressing how many times a word is present in other documents.Corpus Bag of Words
is built putting all BoWs together.LDA model
with the Dictionary and Corpus BoW is built.
Topics
calculated by the LDA model are a set of words followed by a probability. See the chart below for distribution of topics and their words. -
Prediction function
:- Using the LDA model’s Dictionary, we calculate the
BoW
for a givenunseen phrase
. - The model will calculate the relation the new BoW has with each topic.
- A list of tuples
<topicId, probability>
will be returned, and after ordering we take the highest probability, which is the predicted topic.
def predictTopic(lda_model, unseen_phrase): id2word = lda_model.id2word unseen_bow = id2word.doc2bow(unseen_phrase) predicted_topics = sorted(lda_model[unseen_bow]) return predicted_topics[0][0]
- Using the LDA model’s Dictionary, we calculate the
-
MAPPER
topic_to_documents
- Treat the full file as an unseen phrase, and predict the topic using the prediction function of step 3.
- Repeat the process for each file adding its filename to the predicted topic's list, building a
topic_to_document
mapper as follows.Code:[ [topic0, [docName1, docName2, docName3]] [topic1, [docName4]] ... ]
topic_to_documents = defaultdict(list) for doc in documents: prediction = predictTopic(lda_model, doc[1]) topic_to_documents[prediction].append(doc[0])
-
Test the model
- An unseen phrase is given for prediction.
- Call the
prediction function
which gives us the predicted topic. - Using the MAPPER, we extract the list of files associated with that topic.
unseen_phrase = 'In order to ensure a consistent level of protection for natural persons ....' # phrase taken from GDPR reg predictedTopic = predictTopic(lda_model, unseen_phrase.split()) print("unseen phrase: ", unseen_phrase) print("Predicted Topic", predictedTopic, "-", lda_model.print_topic(predictedTopic)) print("Documents containing Topic", predictedTopic,
Output:
Predicted Topic 4 - 0.072*"datum" + 0.034*"personal" + 0.028*"processing" + 0.026*"subject" + 0.024*"authority" + 0.021*"supervisory" + 0.020*"controller" + 0.014*"protection" + 0.013*"right" + 0.013*"purpose" Documents containing Topic 4 - ['reg_2016_679_akn_nr119seq0001.xml']
The file containing all the described steps in sequence is ALL.lda-topic-modeling-ec-laws.py
.
The actual implementation is not the most accurate approach because we are associating only one topic to a regulation, but often regulations talk about different topics. Rather than choosing the most ranked topic for a document, and for the unseen phrase too, would be better if we build a matrix with the probabilities as follow:
This way we know that our unseen phrase is talking about 3 different topics: Topic2, Topic5 and Topic7
and when it’s time for document extraction, we extract:
Document 4 and 5
for Topic2;Document 1 and 2
for Topic5;Document 2
for Topic7.
To achieve this solution the algorithm should consider Euclidean distances
between the topic and documents. To be implemented!
Running the single python
file will penalise the execution time.
The model requires around 20 minutes to be built for the range of 1500 legislations.
There is no need to create and train the model each time we want to do a prediction.
Typically the model will be created only once and used many times for prediction.
Only when the data set, or the pre-processing strategy changes, we need to recreate the model.
For this reason the project has been split in the following files:
-
1.create_documents_array.py
- Loads all documents, reads fileContent, cleans xml, lemmatize, and creates the array
documents
with tuples <FileName, FileContent>. - OUTPUT: save the array
documents
in the filesystem. - EXAMPLES: different .pkl files are created for regulations of 2016, 2017, 2018, 2019 2020 and ALL.
For regulations of the year 2016 use the file ./model/EU_REG_year-2016_documentsArr.pkl.
- Loads all documents, reads fileContent, cleans xml, lemmatize, and creates the array
-
2.create_lda_model.py
- Loads from the filesystem the array documents created in previews .py file and creates the
LDA model
. - OUTPUT: creates .pkl files in filesystem:
bow_corpus
, Bag of Words for all documents.lda_model
, the model.
- EXAMPLES: different models, and the related bow_corpus, for regulations of 2016, 2017, 2018, 2019 2020 are created in the folder ./model.
- Loads from the filesystem the array documents created in previews .py file and creates the
-
3.visualize.py
- Visualises topics distribution of the model.
- Uses file
2.create_lda_model.py
. - OUTPUT: Html file with interactive visualisation. Ex
./model/EU_REG_year-2016_nrtopics20_visualize.html
-
4.predict_topic.py
- Tests the model using an unseen phrase and print the predicted Topic closer to the phrase.
- Uses file
2.create_lda_model.py
- OUTPUT: Print in the console the result of 4 predictions.
See
predict.out
with all predictions. Or fulllogs
- Tests the model using an unseen phrase and print the predicted Topic closer to the phrase.
-
5.create_mapper_topic_to_docs.py
- Builds a mapper Topic-Documents.
- Uses files
2.create_lda_model.py
and4.predict_topic.py
. - OUTPUT: Ex, mapper
./model/EU_REG_year-2016_nrtopics20_topic_to_docs.pkl
-
6.predict_documents.py
- Tests the model using an unseen phrase and print the topic closer to the phrase, together with the list of documents containing that topic.
- Uses files
2.create_lda_model.py
and4.predict_topic.py
and5.create_mapper_topic_to_docs.py
- OUTPUT: Print in the console the result of 4 predictions. See
logs
The following info is used during the Lemmatization of the document content to reduce dimensionality of the dictionary for topic modelling.
Less frequent lemmas are removed in the pre-processing phase.