- Kontogeorgos Ioannis (P3351807)
- Panagiotis Spyrakis (p3351819)
he goal of this challenge was to apply machine learning algorithms and techniques in a real world problem. The problem was to classify messages posted on Twitter which were related with the recent SARS-CoV2 outbreak. Each message was classified into one out of fifteen classes based on its context. Also we were given a graph that models the retweet relationships between users of the platform
The project has been split into 4 sections(notebooks).
-
Notebook communities_detection contains the implementation for the community detection algorithms. It also contains the ensabling of the community detection extracted features, with some Supervised ML Algorithms.
-
Notebook ml_classifiers contains the implementation for a set of Supervised ML Algorithms.
-
Notebook flair_text_classification_torch contains the RNN implemetation using the Flair Framework.
-
Notebook keras_rnn_stackoverflow_posts_with_lstm contains the Uni-Bi LSTM - RNN implementation using Keras, Talos
The project is structured with the below formation:
-
Folder /notebooks which contains the jupyter notebooks implementation described above.
-
Folder /app which contains the main functionality of the custom implementation for the LSTM RNN using Keras.
-
preprocessing.py module, which contains all the functions that are used from the notebooks for the required data formation and processing.
-
models.py module, which contains the code for the RNNs models generation. The implementation is relevant to the Talos documentation in order to complete a successful parameter tuning for each model.
-
metrics.py module, which contains the code for the RNNs models compilation metrics as f1 , accuracy. etc
-
visualization.py module, which contains reusable code for the models performance visualization.
-
layers.py module, which contains the code for the custom Layers of LinearAttention and DeepAttention that will be used by the RNN models.
-
-
Folder data, which contains all the reusable data sources, execution logs. The autogenerated subfolders for the better file handling are:
-
fasttext_dir: Here should be placed all fasttex Embeddigs related files.
-
flair_data_dir: Here should be placed all the input data for the Corpus of the Flair Models.
-
flair_emdg_dir: Here should be placed all the downloaded embeddings and vocabularies for the Flair Models.
-
flair_output_dir: The folder with the execution logs of the flair package.
-
-
'definitions.py' module, which contains some global variables and definitions which are essential for the project organization.
-
Create or load an existing conda environment
conda env create -f requirements.yml conda activate text_analytics
Useful reference link
-
Install the packages from the requiremets.txt file
-
Download the vocabulary that will be used in the project and place it under the /data folder Download here
-
Flair Embeddings
Twitter Embeddings
https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/twitter.gensim.vectors.npy https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/twitter.gensim
News Forward English
https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-news-english-forward-1024-v0.2rc.pt
News Backward English
https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-news-english-backward-1024-v0.2rc.pt
Glove
https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/glove.gensim.vectors.npy https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/glove.gensim
`