### Data Set: 
https://github.com/mhjabreel/CharCnn_Keras/tree/master/data/ag_news_csv

The AG's news topic classification dataset is constructed by choosing the 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. 

The classes are: 

* World
* Sports
* Business
* Science/Technology

#### For more information on how to use Lbl2Vec, visit the [API Guide](https://lbl2vec.readthedocs.io/en/latest/api.html#)

In [1]:
from lbl2vec import Lbl2Vec
import pandas as pd
from gensim.utils import simple_preprocess
from gensim.models.doc2vec import TaggedDocument, Doc2Vec
from gensim.parsing.preprocessing import strip_tags
from sklearn.metrics import f1_score

### Load data

In [25]:
# load train data
# ag_train = pd.read_csv('data/train.csv',sep=',',header=None, names=['class','title','description'])
ag_train = pd.read_csv('data/compiled_data.csv',sep=',',header=None, names=['class','title','description'])

# load test data
# ag_test = pd.read_csv('data/test.csv',sep=',',header=None, names=['class','title','description'])
ag_test = pd.read_csv('data/delme.csv',sep=',',header=None, names=['class','title','description'])

# load labels with keywords
labels = pd.read_csv('data/labels.csv',sep=',')

# split keywords by separator and save them as array
labels['keywords'] = labels['keywords'].apply(lambda x: x.split(' '))

# convert description keywords to lowercase
labels['keywords'] = labels['keywords'].apply(lambda description_keywords: [keyword.lower() for keyword in description_keywords])

# get number of keywords for each class
labels['number_of_keywords'] = labels['keywords'].apply(lambda row: len(row))

In [4]:
labels

Unnamed: 0,class_index,class_name,keywords,number_of_keywords
0,1,business,"[crypto, binance, ftx, coinbase, blockchain, s...",13
1,2,ecommerce,"[mobile, laptop, smart, watch, resolution, pix...",14
2,3,educational,"[machine, learning, deep, neural, networks, ar...",14
3,4,entertainment,"[film, movie, show, actor, box, office, festiv...",16
4,5,news,"[india, us, modi, biden, pm, asean, summit, g2...",15


### Tokenize data

In [5]:
# doc: document text string
# returns tokenized document
# strip_tags removes meta tags from the text
# simple preprocess converts a document into a list of lowercase tokens, ignoring tokens that are too short or too long 
# simple preprocess also removes numerical values as well as punktuation characters
def tokenize(doc):
    return simple_preprocess(strip_tags(doc), deacc=True, min_len=2, max_len=15)

In [6]:
# add data set type column
ag_train['data_set_type'] = 'train'
ag_test['data_set_type'] = 'test'

# concat train and test data
ag_full_corpus = pd.concat([ag_train,ag_test]).reset_index(drop=True)

In [7]:
# tokenize and tag documents combined title + description for Lbl2Vec training
ag_full_corpus['tagged_docs'] = ag_full_corpus.apply(lambda row: TaggedDocument(tokenize(row['title'] + '. ' + row['description']), [str(row.name)]), axis=1)

In [8]:
# add doc_key column
ag_full_corpus['doc_key'] = ag_full_corpus.index.astype(str)

In [9]:
# add class_name column
ag_full_corpus = ag_full_corpus.merge(labels, left_on='class', right_on='class_index', how='left').drop(['class', 'keywords'], axis=1)

In [10]:
ag_full_corpus.head()

Unnamed: 0,title,description,data_set_type,tagged_docs,doc_key,class_index,class_name,number_of_keywords
0,Crypto_CEO_warns_his_industry_faces_2008-style...,CNN values your feedback 1. How relevant is th...,train,"([style_crisis_, cnn, values, your, feedback, ...",0,1,business,13
1,At_least_$1_billion_of_client_funds_missing_at...,CNN values your feedback 1. How relevant is th...,train,"([at_least_, cnn, values, your, feedback, how,...",1,1,business,13
2,Crypto_is_making_a_big_comeback,CNN values your feedback 1. How relevant is th...,train,"([cnn, values, your, feedback, how, relevant, ...",2,1,business,13
3,Elon_Musk_sold_nearly_$4_billion_worth_of_Tesl...,CNN values your feedback 1. How relevant is th...,train,"([cnn, values, your, feedback, how, relevant, ...",3,1,business,13
4,"Crypto_giant_Binance_drops_bid_to_save_rival,_...",CNN values your feedback 1. How relevant is th...,train,"([cnn, values, your, feedback, how, relevant, ...",4,1,business,13


### Train Doc2Vec model

In [11]:
# in order to get optimal Lbl2Vec results the given Doc2Vec model 
# should be trained with the parameters “dbow_words=1” and “dm=0”.
doc2vec_model = Doc2Vec(documents=ag_full_corpus['tagged_docs'][ag_full_corpus['data_set_type']=='train'], dbow_words=1, dm=1)

# Train Lbl2Vec

Train a new Lbl2Vec model using a pretrained [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.Doc2Vec) model:
* keywords_list : iterable list of lists with descriptive keywords for each topic.
* doc2vec_model : pretrained [gensim.models.doc2vec.Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.Doc2Vec) model. Lbl2Vec uses its word and document vectors to compute the label vectors.
* label_names : iterable list of custom names for each label. Label names and keywords of the same topic must have the same index.
* similarity_threshold : only documents with a higher similarity to the respective description keywords than this treshold are used to calculate the label embedding.
* min_num_docs : minimum number of documents that are used to calculate the label embedding. 
* epochs : number of iterations over the corpus.

In [14]:
# init model with parameters
lbl2vec_model = Lbl2Vec(keywords_list=list(labels['keywords']), doc2vec_model=doc2vec_model, label_names=list(labels['class_name']), similarity_threshold=0.30, min_num_docs=10, epochs=10)

In [15]:
# train model
lbl2vec_model.fit()

2022-11-13 20:27:19,387 - Lbl2Vec - INFO - Load document and word embeddings
2022-11-13 20:27:19,391 - Lbl2Vec - INFO - Train label embeddings


# Predict topics of documents used to train Doc2Vec

Compute similarity scores of learned document vectors from documents that were used to train the Doc2Vec model to each of the learned label vectors. The similarity scores consist of cosine similarities and therefore have a value range of [-1,1].

In [30]:
# predict similarity scores
model_docs_lbl_similarities = lbl2vec_model.predict_model_docs()

2022-11-13 20:45:19,300 - Lbl2Vec - INFO - Get document embeddings from model
2022-11-13 20:45:19,305 - Lbl2Vec - INFO - Calculate document<->label similarities


In [31]:
model_docs_lbl_similarities.head()

Unnamed: 0,doc_key,most_similar_label,highest_similarity_score,business,ecommerce,educational,entertainment,news
0,0,news,0.947287,0.947287,0.947287,0.600837,0.947287,0.947287
1,1,news,0.958892,0.958892,0.958892,0.594815,0.958892,0.958892
2,2,news,0.962748,0.962748,0.962748,0.647924,0.962748,0.962748
3,3,news,0.972864,0.972864,0.972864,0.663397,0.972864,0.972864
4,4,news,0.98782,0.98782,0.98782,0.680078,0.98782,0.98782


In [32]:
model_docs_lbl_similarities

Unnamed: 0,doc_key,most_similar_label,highest_similarity_score,business,ecommerce,educational,entertainment,news
0,0,news,0.947287,0.947287,0.947287,0.600837,0.947287,0.947287
1,1,news,0.958892,0.958892,0.958892,0.594815,0.958892,0.958892
2,2,news,0.962748,0.962748,0.962748,0.647924,0.962748,0.962748
3,3,news,0.972864,0.972864,0.972864,0.663397,0.972864,0.972864
4,4,news,0.98782,0.98782,0.98782,0.680078,0.98782,0.98782
5,5,news,0.954411,0.954411,0.954411,0.621703,0.954411,0.954411
6,6,news,0.985901,0.985901,0.985901,0.651554,0.985901,0.985901
7,7,news,0.976708,0.976708,0.976708,0.661348,0.976708,0.976708
8,8,news,0.976484,0.976484,0.976484,0.689421,0.976484,0.976484
9,9,news,0.981094,0.981094,0.981094,0.687815,0.981094,0.981094


## Evaluate prediction of documents used to train Doc2Vec

In [33]:
# merge DataFrames to compare the predicted and true topic labels
evaluation_train = model_docs_lbl_similarities.merge(ag_full_corpus[ag_full_corpus['data_set_type']=='train'], left_on='doc_key', right_on='doc_key')

In [34]:
y_true_train = evaluation_train['class_name']
y_pred_train = evaluation_train['most_similar_label']
print('F1 score:',f1_score(y_true_train, y_pred_train, average='micro'))

F1 score: 0.425


# Predict topics of unknown documents

Learn document vectors of new documents that were **not** used to train the Doc2Vec model and compute the similarity scores to each of the learned label vectors. The similarity scores consist of cosine similarities and therefore have a value range of [-1,1].

In [26]:
# predict similarity scores of new test documents (they were not used during Doc2Vec training)
new_docs_lbl_similarities = lbl2vec_model.predict_new_docs(tagged_docs=ag_full_corpus['tagged_docs'][ag_full_corpus['data_set_type']=='test'])

2022-11-13 20:44:56,698 - Lbl2Vec - INFO - Calculate document embeddings
2022-11-13 20:44:57,913 - Lbl2Vec - INFO - Calculate document<->label similarities


In [27]:
new_docs_lbl_similarities.head()

Unnamed: 0,doc_key,most_similar_label,highest_similarity_score,business,ecommerce,educational,entertainment,news
0,40,news,0.911007,0.911007,0.911007,0.595638,0.911007,0.911007
1,41,news,0.867709,0.867709,0.867709,0.549646,0.867709,0.867709
2,42,news,0.888887,0.888887,0.888887,0.591218,0.888887,0.888887
3,43,news,0.881533,0.881533,0.881533,0.568577,0.881533,0.881533
4,44,news,0.910535,0.910535,0.910535,0.584487,0.910535,0.910535


## Evaluate prediction of new documents

In [28]:
# merge DataFrames to compare the predicted and true topic labels
evaluation_test = new_docs_lbl_similarities.merge(ag_full_corpus[ag_full_corpus['data_set_type']=='test'], left_on='doc_key', right_on='doc_key')

In [29]:
y_true_test = evaluation_test['class_name']
y_pred_test = evaluation_test['most_similar_label']
print('F1 score:',f1_score(y_true_test, y_pred_test, average='micro'))

F1 score: 0.35
