# TopicLDA usecase

In [1]:
import sys
sys.path.append("../")

import numpy as np
import pandas as pd
from utils import TopicLDA

## Data

In [2]:
data_path = '../../data/'
with open(data_path + 'categories_string.csv') as f:
    categories = dict(map(lambda x:x.strip('\n').split(',')[::-1], f.readlines()[1:]))
    
X = pd.read_json(data_path + 'train.json').set_index('Id')
y = pd.read_csv(data_path + 'train_label.csv', index_col='Id', dtype={'Category': 'category'}).Category

Label of interest and examples of unclear description

In [3]:
categories['24']

'architect'

In [4]:
X_arch = X[y == '24']
X_arch.description.to_list()[:2]

[' He runs a boutique design studio attending clients in the United States, Europe and Asia. His work explores the convergence of human arts and science to give shape to an ever evolving design practice. With a particular commitment towards design education, Arturo is permanently engaged with the international design and development communities and often travels the world to exchange ideas about design with other designers and developers. Prior to his current venture, Arturo worked for Microsoft in Redmond for 7 years driving design evangelism.',
 ' He focuses on cloud security, identity and access management, mobility security, and security for Microsoft platforms and solutions. Jan is based in Belgium.']

## Processing
The `TopicLDA` class take the number of cluster to build.  
It has been made to follow a bit the grammar of the sklearn models (with `.fit`, `.predict` and `.fit_predict` methods).  
`.model` gives access to lda model, `.dictionary` point to vocabularies

In [5]:
clust = TopicLDA(2)
logits_arch = clust.fit_predict(X_arch.description)
logits_arch

array([[0.69572955, 0.3042705 ],
       [0.03479052, 0.9652095 ],
       [0.12912217, 0.8708778 ],
       ...,
       [0.91574144, 0.08425859],
       [0.9634624 , 0.03653761],
       [0.03820208, 0.96179795]], dtype=float32)

In [6]:
clust

[(0, '0.019*"architecture" + 0.019*"design" + 0.012*"university" + 0.009*"projects" + 0.008*"architectural" + 0.006*"new" + 0.006*"work" + 0.005*"architects" + 0.005*"also" + 0.005*"project"'), (1, '0.010*"years" + 0.008*"experience" + 0.007*"software" + 0.007*"development" + 0.006*"systems" + 0.006*"solutions" + 0.005*"management" + 0.005*"data" + 0.005*"worked" + 0.005*"business"')]

In the following, the label '24' will be kept for cluster 0 and the label '29' will be created for cluster 1.  
During the classification, a Lambda layer will aggregate the logits from '24' and '29' to '24'. 

In [7]:
y_arch = np.vectorize({0:'24',1:'29'}.get)(np.argmax(logits_arch, axis = 1))
y_arch

array(['24', '29', '29', ..., '24', '24', '29'], dtype='<U2')

In [8]:
y_new = y.cat.add_categories('29')
y_new[y_new == '24'] = y_arch
dict(zip(y_new[3:5], X_arch.description.to_list()[:2]))

{'24': ' He runs a boutique design studio attending clients in the United States, Europe and Asia. His work explores the convergence of human arts and science to give shape to an ever evolving design practice. With a particular commitment towards design education, Arturo is permanently engaged with the international design and development communities and often travels the world to exchange ideas about design with other designers and developers. Prior to his current venture, Arturo worked for Microsoft in Redmond for 7 years driving design evangelism.',
 '29': ' He focuses on cloud security, identity and access management, mobility security, and security for Microsoft platforms and solutions. Jan is based in Belgium.'}