# Training step

Now that we have all the basic features, let's generated a simple model for topic modelling in order to discover what are these news about. For this experiment we will use the Topic Modeling with LDA, a simple clustering model that will help us investigating calssification features about data. This implementation is available at Gensim library. 

## Project setup

In [1]:
import sys
import os

sys.path.append('../')
os.chdir("../")

In [2]:
os.environ['DATA_DIR'] = os.path.join(os.path.abspath("./"), './data')

In [3]:
!wget -O - https://install.python-poetry.org | python3 -

--2022-12-01 15:10:36--  https://install.python-poetry.org/
Resolving install.python-poetry.org (install.python-poetry.org)... 76.76.21.98, 76.76.21.123
Connecting to install.python-poetry.org (install.python-poetry.org)|76.76.21.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28457 (28K) [text/plain]
Saving to: ‘STDOUT’


2022-12-01 15:10:36 (22.4 MB/s) - written to stdout [28457/28457]

[36mRetrieving Poetry metadata[0m

The latest version ([1m1.2.2[0m) is already installed.


In [4]:
import os

os.environ['PATH'] = f"{os.environ['PATH']}:/home/jovyan/.local/bin"

In [5]:
!poetry env use system
!poetry env remove --all
!poetry config virtualenvs.create false

### Install dependencies

In [6]:
!poetry install --no-cache -n

[30;43mSkipping virtualenv creation, as specified in config file.[39;49m
[34mInstalling dependencies from lock file[39m

No dependencies to install or update


### Project structure

As we are still following recommended [Data Science cookiecutter template](https://drivendata.github.io/cookiecutter-data-science/), the training step will be on `src/models` folder.

In [11]:
!mkdir -p src/models
!touch src/models/__init__.py

## Feature load step

Considering models composition strategy discussion on previous notebooks, we are going to use artifacts provided by `FeatureBuildFlow` in order to load results. In order to keep track of features and experiments, loading step will ned to know which version of the artifacts to load, so version tag will be used as input for this training step. 

**P.S.**: it is mandatory that training and feature building step use the same tag. So training will fail if there is no feature build step with the same tag.

In [36]:
%%writefile src/models/train_model.py
import os
import pandas as pd

from metaflow import FlowSpec, Parameter, step, Flow, current

class TrainModelFlow(FlowSpec):
    
    @step
    def start(self):
        version = list(current.tags)[0]
        print(f"Loading features from version {version}")
        fl = Flow('FeatureBuildFlow')
        r = list(fl.runs(version))[0]
        self.results = r.data.results
        self.next(self.end)
        
    @step
    def end(self):
        print("Done!")
        

if __name__ == '__main__':
    TrainModelFlow()

Overwriting src/models/train_model.py


In [37]:
!python src/models/train_model.py run --tag 0.0.1

[35m[1mMetaflow 2.7.14[0m[35m[22m executing [0m[31m[1mTrainModelFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:jovyan[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-12-01 16:00:18.164 [0m[1mWorkflow starting (run-id 1669921218093386):[0m
[35m2022-12-01 16:00:18.190 [0m[32m[1669921218093386/start/1 (pid 3007)] [0m[1mTask is starting.[0m
[35m2022-12-01 16:00:18.608 [0m[32m[1669921218093386/start/1 (pid 3007)] [0m[22mLoading features from version 0.0.1[0m
[35m2022-12-01 16:00:18.799 [0m[32m[1669921218093386/start/1 (pid 3007)] [0m[1mTask finished successfully.[0m
[35m2022-12-01 16:00:18.834 [0m[32m[1669921218093386/end/2 (pid 3017)] [0m[1mTask is starting.[0m
[35m2022-12-01 16:00:19.246 [0m[32m[1669921218093386/end/2 (pid 3017)] [0m[22mDo

In [38]:
from metaflow import Flow
fl = Flow('TrainModelFlow')
r = fl.latest_run
df_r = r.data.results
df_r

Unnamed: 0,TITLE,CONTENT,LINK,PUBLISHED,PUBLISHED_DATE,token_set
0,"Com surto na região Norte, campanha contra o s...","<img src=""https://s2.glbimg.com/HA6WyXj_0FWazn...",https://g1.globo.com/ap/amapa/noticia/2018/07/...,2018-07-23T22:40:12+00:00,2018-07-23,"[vacinação, voltada, público, infantil, aconte..."
1,Polícia Civil de Juiz de Fora recebe denúncia ...,Duas firmas são de São Paulo e uma de Belo Hor...,https://g1.globo.com/mg/zona-da-mata/noticia/2...,2018-07-23T21:01:26+00:00,2018-07-23,"[duas, firmas, paulo, belo, horizonte, agora, ..."
2,Segunda edição do ‘Encontro de Bateristas do T...,"<img src=""https://s2.glbimg.com/ygBGnNNFsGwUPA...",https://g1.globo.com/mg/triangulo-mineiro/noti...,2018-07-23T20:45:48+00:00,2018-07-23,"[170, bateristas, esperados, evento, pode, tor..."
3,Comissariado do AP fiscaliza embarque de menor...,"<img src=""https://s2.glbimg.com/3ywfLG3crqb2_I...",https://g1.globo.com/ap/amapa/noticia/2018/07/...,2018-07-23T20:36:37+00:00,2018-07-23,"[fluxo, passageiros, aumenta, principais, port..."
4,Ceará tem 66 municípios com emergência reconhe...,"<img src=""https://s2.glbimg.com/aktFEMxx84AuWQ...",https://g1.globo.com/ce/ceara/noticia/2018/07/...,2018-07-23T20:13:49+00:00,2018-07-23,"[nesta, segunda-feira, reconhecimento, municíp..."
...,...,...,...,...,...,...
175,Por que a esquadria de alumínio se tornou um d...,"<img src=""https://s2.glbimg.com/k5A5kFIGvVX5Ib...",https://g1.globo.com/ba/bahia/especial-publici...,2022-11-28T18:40:41+00:00,2022-11-28,"[aliando, versatilidade, design, resistência, ..."
176,Perfil da PM em SC curte posts antidemocrático...,"<img src=""https://s2.glbimg.com/hHaPhjo7AiNLn-...",https://g1.globo.com/sc/santa-catarina/noticia...,2022-11-28T18:39:29+00:00,2022-11-28,"[polícia, militar, rodoviária, instaurou, proc..."
177,Ponte que liga Toca da Onça a Rio Bonito é rec...,"<img src=""https://s2.glbimg.com/iA1ARsUvBzuFOH...",https://g1.globo.com/rj/regiao-serrana/noticia...,2022-11-28T18:39:25+00:00,2022-11-28,"[ponte, existia, local, levada, força, água, r..."
178,Menina denuncia estupro à professora e ex-padr...,"<img src=""https://s2.glbimg.com/DsJJ0ftZ_UF0g9...",https://g1.globo.com/go/goias/noticia/2022/11/...,2022-11-28T18:38:07+00:00,2022-11-28,"[garota, anos, disse, abusada, desde, anos, se..."


## Training step

In [16]:
# FIXME: there is an install deprecation notice on gensim that prevents us using poetry to install it: https://github.com/RaRe-Technologies/gensim/issues/3362

!poetry run python -m pip install gensim --disable-pip-version-check --no-deps --no-cache-dir --no-binary gensim

[30;43mSkipping virtualenv creation, as specified in config file.[39;49m
[33mDEPRECATION: --no-binary currently disables reading from the cache of locally built wheels. In the future --no-binary will not influence the wheel cache. pip 23.1 will enforce this behaviour change. A possible replacement is to use the --no-cache-dir option. You can use the flag --use-feature=no-binary-enable-wheel-cache to test the upcoming behaviour. Discussion can be found at https://github.com/pypa/pip/issues/11453[0m[33m


In [17]:
!poetry add gensim

[30;43mSkipping virtualenv creation, as specified in config file.[39;49m
The following packages are already present in the pyproject.toml and will be skipped:

  • [36mgensim[39m

If you want to update it to the latest compatible version, you can use `poetry update package`.
If you prefer to upgrade it to the latest available version, you can use `poetry add package@latest`.

Nothing to add.


In [52]:
%%writefile src/models/train_model.py
import os
import pandas as pd

import gensim

from metaflow import FlowSpec, Parameter, step, Flow, current

class TrainModelFlow(FlowSpec):
    topics = Parameter('topics', default=None)
    iterations = Parameter('iterations', default=100)

    @property
    def version(self):
        return list(current.tags)[0]
    
    @step
    def start(self):
        print(f"Loading features from version {self.version}")
        fl = Flow('FeatureBuildFlow')
        r = list(fl.runs(self.version))[0]
        self.results = r.data.results
        self.next(self.train_lda)

    @step 
    def train_lda(self):
        print(f"Training LDA model version {self.version}...")
        dictionary = gensim.corpora.Dictionary(self.results['token_set'])
        self.corpus_train = self.results['token_set'].apply(dictionary.doc2bow)
        self.lda_model = gensim.models.LdaMulticore(self.corpus_train, id2word=dictionary, iterations=self.iterations, num_topics=self.topics, random_state=42)
        self.next(self.end)
    
    @step
    def end(self):
        print("Done!")
        

if __name__ == '__main__':
    TrainModelFlow()

Overwriting src/models/train_model.py


### Registering parameters

Now we can use metaflow `Parameter` class to register model parameters. This will be important as we compare experiments later.

In [53]:
!python src/models/train_model.py run --tag 0.0.1 --topics 10 --iterations 100

[35m[1mMetaflow 2.7.14[0m[35m[22m executing [0m[31m[1mTrainModelFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:jovyan[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-12-01 16:12:47.002 [0m[1mWorkflow starting (run-id 1669921966934743):[0m
[35m2022-12-01 16:12:47.029 [0m[32m[1669921966934743/start/1 (pid 3551)] [0m[1mTask is starting.[0m
[35m2022-12-01 16:12:47.625 [0m[32m[1669921966934743/start/1 (pid 3551)] [0m[22mLoading features from version 0.0.1[0m
[35m2022-12-01 16:12:47.875 [0m[32m[1669921966934743/start/1 (pid 3551)] [0m[1mTask finished successfully.[0m
[35m2022-12-01 16:12:47.908 [0m[32m[1669921966934743/train_lda/2 (pid 3561)] [0m[1mTask is starting.[0m
[35m2022-12-01 16:12:48.463 [0m[32m[1669921966934743/train_lda/2 (pid 3561)]

### Log model performance

Topic modeling performance can be calculated using Coherence Model. This will be registered in order to compare to next versions.

In [54]:
%%writefile src/models/train_model.py
import os
import pandas as pd

import gensim

from metaflow import FlowSpec, Parameter, step, Flow, current

class TrainModelFlow(FlowSpec):
    topics = Parameter('topics', default=None)
    iterations = Parameter('iterations', default=100)
    coherence_alg = Parameter('coherence', default='u_mass')

    @property
    def version(self):
        return list(current.tags)[0]
    
    @step
    def start(self):
        print(f"Loading features from version {self.version}")
        fl = Flow('FeatureBuildFlow')
        r = list(fl.runs(self.version))[0]
        self.results = r.data.results
        self.next(self.train_lda)

    @step 
    def train_lda(self):
        print(f"Training LDA model version {self.version}...")
        dictionary = gensim.corpora.Dictionary(self.results['token_set'])
        self.corpus_train = self.results['token_set'].apply(dictionary.doc2bow)
        self.lda_model = gensim.models.LdaMulticore(self.corpus_train, id2word=dictionary, iterations=self.iterations, num_topics=self.topics, random_state=42)
        self.next(self.end)
    
    @step
    def end(self):
        cm = gensim.models.coherencemodel.CoherenceModel(model=self.lda_model, corpus=self.corpus_train, coherence=self.coherence_alg)
        self.coherence = cm.get_coherence()
        print(f"Done! Model coherence is {self.coherence}")
        

if __name__ == '__main__':
    TrainModelFlow()

Overwriting src/models/train_model.py


In [55]:
!python src/models/train_model.py run --tag 0.0.1 --topics 10 --iterations 100

[35m[1mMetaflow 2.7.14[0m[35m[22m executing [0m[31m[1mTrainModelFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:jovyan[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-12-01 16:19:39.424 [0m[1mWorkflow starting (run-id 1669922379345632):[0m
[35m2022-12-01 16:19:39.455 [0m[32m[1669922379345632/start/1 (pid 3796)] [0m[1mTask is starting.[0m
[35m2022-12-01 16:19:40.010 [0m[32m[1669922379345632/start/1 (pid 3796)] [0m[22mLoading features from version 0.0.1[0m
[35m2022-12-01 16:19:40.246 [0m[32m[1669922379345632/start/1 (pid 3796)] [0m[1mTask finished successfully.[0m
[35m2022-12-01 16:19:40.284 [0m[32m[1669922379345632/train_lda/2 (pid 3806)] [0m[1mTask is starting.[0m
[35m2022-12-01 16:19:40.836 [0m[32m[1669922379345632/train_lda/2 (pid 3806)]