# Introduction

[Previous notebook](./01_workflow_demo.ipynb) was about investigating data and generating some features to use in a Data Science model. The idea was to use public news sources and try to investigate some patterns in data. Next step would be using data to train some topic modelling technique and apply it to data.

## Project setup

In [1]:
import sys
import os

sys.path.append('../')
os.chdir("../")

In [2]:
os.environ['DATA_DIR'] = os.path.join(os.path.abspath("./"), './data')

In [3]:
!wget -O - https://install.python-poetry.org | python3 -

--2022-12-01 14:38:35--  https://install.python-poetry.org/
Resolving install.python-poetry.org (install.python-poetry.org)... 76.76.21.164, 76.76.21.22
Connecting to install.python-poetry.org (install.python-poetry.org)|76.76.21.164|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28457 (28K) [text/plain]
Saving to: ‘STDOUT’


2022-12-01 14:38:36 (11.3 MB/s) - written to stdout [28457/28457]

[36mRetrieving Poetry metadata[0m

The latest version ([1m1.2.2[0m) is already installed.


In [4]:
import os

os.environ['PATH'] = f"{os.environ['PATH']}:/home/jovyan/.local/bin"

In [5]:
!poetry env use system
!poetry env remove --all
!poetry config virtualenvs.create false

## Install dependencies

In [6]:
!poetry install --no-cache -n

[30;43mSkipping virtualenv creation, as specified in config file.[39;49m
[34mInstalling dependencies from lock file[39m

[39;1mPackage operations[39;22m: [34m21[39m installs, [34m17[39m updates, [34m0[39m removals

  [34;1m•[39;22m [39mUpdating [39m[36msix[39m[39m ([39m[39;1m1.16.0 /home/conda/feedstock_root/build_artifacts/six_1620240208055/work[39;22m[39m -> [39m[39;1m1.16.0[39;22m[39m)[39m: [34mPending...[39m
[2A[0J  [34;1m•[39;22m [39mUpdating [39m[36msix[39m[39m ([39m[39;1m1.16.0 /home/conda/feedstock_root/build_artifacts/six_1620240208055/work[39;22m[39m -> [39m[39;1m1.16.0[39;22m[39m)[39m: [34mInstalling...[39m
[2A[0J  [32;1m•[39;22m [39mUpdating [39m[36msix[39m[39m ([39m[39;1m1.16.0 /home/conda/feedstock_root/build_artifacts/six_1620240208055/work[39;22m[39m -> [39m[32m1.16.0[39m[39m)[39m
  [34;1m•[39;22m [39mInstalling [39m[36mjmespath[39m[39m ([39m[39;1m1.0.1[39;22m[39m)[39m: [34mPending...[39m


## Loading data

Need some parameters to load previous data.

In [17]:
import os

os.environ['FILE_URL'] = os.path.join(os.path.abspath("./"), 'data/news/news.parquet')
os.environ['FILE_URL']

'/home/jovyan/data/news/news.parquet'

In [18]:
import os

import pandas as pd

filepath = os.environ['FILE_URL']
df = pd.read_parquet(filepath, engine='fastparquet')
df

Unnamed: 0,TITLE,CONTENT,LINK,PUBLISHED,PUBLISHED_DATE
0,"Com surto na região Norte, campanha contra o s...","<img src=""https://s2.glbimg.com/HA6WyXj_0FWazn...",https://g1.globo.com/ap/amapa/noticia/2018/07/...,2018-07-23T22:40:12+00:00,2018-07-23
1,Polícia Civil de Juiz de Fora recebe denúncia ...,Duas firmas são de São Paulo e uma de Belo Hor...,https://g1.globo.com/mg/zona-da-mata/noticia/2...,2018-07-23T21:01:26+00:00,2018-07-23
2,Segunda edição do ‘Encontro de Bateristas do T...,"<img src=""https://s2.glbimg.com/ygBGnNNFsGwUPA...",https://g1.globo.com/mg/triangulo-mineiro/noti...,2018-07-23T20:45:48+00:00,2018-07-23
3,Comissariado do AP fiscaliza embarque de menor...,"<img src=""https://s2.glbimg.com/3ywfLG3crqb2_I...",https://g1.globo.com/ap/amapa/noticia/2018/07/...,2018-07-23T20:36:37+00:00,2018-07-23
4,Ceará tem 66 municípios com emergência reconhe...,"<img src=""https://s2.glbimg.com/aktFEMxx84AuWQ...",https://g1.globo.com/ce/ceara/noticia/2018/07/...,2018-07-23T20:13:49+00:00,2018-07-23
...,...,...,...,...,...
175,Por que a esquadria de alumínio se tornou um d...,"<img src=""https://s2.glbimg.com/k5A5kFIGvVX5Ib...",https://g1.globo.com/ba/bahia/especial-publici...,2022-11-28T18:40:41+00:00,2022-11-28
176,Perfil da PM em SC curte posts antidemocrático...,"<img src=""https://s2.glbimg.com/hHaPhjo7AiNLn-...",https://g1.globo.com/sc/santa-catarina/noticia...,2022-11-28T18:39:29+00:00,2022-11-28
177,Ponte que liga Toca da Onça a Rio Bonito é rec...,"<img src=""https://s2.glbimg.com/iA1ARsUvBzuFOH...",https://g1.globo.com/rj/regiao-serrana/noticia...,2022-11-28T18:39:25+00:00,2022-11-28
178,Menina denuncia estupro à professora e ex-padr...,"<img src=""https://s2.glbimg.com/DsJJ0ftZ_UF0g9...",https://g1.globo.com/go/goias/noticia/2022/11/...,2022-11-28T18:38:07+00:00,2022-11-28


# Dependencies and workflows

For the next step we will use the feature `IncludeFile` in metaflow workflows, so we store this data as an artifact in metaflow storage. This will help on reproducibility that we can investigate later.

## Loading step

First step will just load data from directory and put it inside the workflow. Following [Data Science cookiecutter template](https://drivendata.github.io/cookiecutter-data-science/), the feature building step will be on `src/features` folder.

In [9]:
!mkdir -p src/features
!touch src/features/__init__.py

In [10]:
%%writefile src/features/build_features.py
import os
import pandas as pd

from metaflow import FlowSpec, Parameter, step


FILE_URL = os.environ['FILE_URL']

class FeatureBuildFlow(FlowSpec):
    news = FILE_URL
    
    @step
    def start(self):
        self.results = pd.read_parquet(self.news, engine='fastparquet')
        self.next(self.end)
        
    @step
    def end(self):
        print("Done!")
        

if __name__ == '__main__':
    FeatureBuildFlow()

Overwriting src/features/build_features.py


In [11]:
!python src/features/build_features.py run

[35m[1mMetaflow 2.7.14[0m[35m[22m executing [0m[31m[1mFeatureBuildFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:jovyan[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-11-30 18:10:52.224 [0m[1mWorkflow starting (run-id 1669831852155982):[0m
[35m2022-11-30 18:10:52.253 [0m[32m[1669831852155982/start/1 (pid 225)] [0m[1mTask is starting.[0m
[35m2022-11-30 18:10:52.948 [0m[32m[1669831852155982/start/1 (pid 225)] [0m[1mTask finished successfully.[0m
[35m2022-11-30 18:10:52.985 [0m[32m[1669831852155982/end/2 (pid 231)] [0m[1mTask is starting.[0m
[35m2022-11-30 18:10:53.458 [0m[32m[1669831852155982/end/2 (pid 231)] [0m[22mDone![0m
[35m2022-11-30 18:10:53.608 [0m[32m[1669831852155982/end/2 (pid 231)] [0m[1mTask finished successfully.[0m
[35m2

In [14]:
from metaflow import Flow
fl = Flow('FeatureBuildFlow')
runs_list = list(fl)
runs_list

[Run('FeatureBuildFlow/1669832474162958'),
 Run('FeatureBuildFlow/1669832118021600'),
 Run('FeatureBuildFlow/1669831889593254'),
 Run('FeatureBuildFlow/1669831852155982'),
 Run('FeatureBuildFlow/1669816256226342'),
 Run('FeatureBuildFlow/1669816224447863'),
 Run('FeatureBuildFlow/1669815068030614'),
 Run('FeatureBuildFlow/1669815009964566'),
 Run('FeatureBuildFlow/1669814927431930'),
 Run('FeatureBuildFlow/1669814088506057')]

In [15]:
from metaflow import Run
r = fl.latest_run
df_r = r.data.results
df_r

Unnamed: 0,TITLE,CONTENT,LINK,PUBLISHED,PUBLISHED_DATE,token_set
0,"Com surto na região Norte, campanha contra o s...","<img src=""https://s2.glbimg.com/HA6WyXj_0FWazn...",https://g1.globo.com/ap/amapa/noticia/2018/07/...,2018-07-23T22:40:12+00:00,2018-07-23,"[vacinação, voltada, público, infantil, aconte..."
1,Polícia Civil de Juiz de Fora recebe denúncia ...,Duas firmas são de São Paulo e uma de Belo Hor...,https://g1.globo.com/mg/zona-da-mata/noticia/2...,2018-07-23T21:01:26+00:00,2018-07-23,"[duas, firmas, paulo, belo, horizonte, agora, ..."
2,Segunda edição do ‘Encontro de Bateristas do T...,"<img src=""https://s2.glbimg.com/ygBGnNNFsGwUPA...",https://g1.globo.com/mg/triangulo-mineiro/noti...,2018-07-23T20:45:48+00:00,2018-07-23,"[170, bateristas, esperados, evento, pode, tor..."
3,Comissariado do AP fiscaliza embarque de menor...,"<img src=""https://s2.glbimg.com/3ywfLG3crqb2_I...",https://g1.globo.com/ap/amapa/noticia/2018/07/...,2018-07-23T20:36:37+00:00,2018-07-23,"[fluxo, passageiros, aumenta, principais, port..."
4,Ceará tem 66 municípios com emergência reconhe...,"<img src=""https://s2.glbimg.com/aktFEMxx84AuWQ...",https://g1.globo.com/ce/ceara/noticia/2018/07/...,2018-07-23T20:13:49+00:00,2018-07-23,"[nesta, segunda-feira, reconhecimento, municíp..."
...,...,...,...,...,...,...
175,Por que a esquadria de alumínio se tornou um d...,"<img src=""https://s2.glbimg.com/k5A5kFIGvVX5Ib...",https://g1.globo.com/ba/bahia/especial-publici...,2022-11-28T18:40:41+00:00,2022-11-28,"[aliando, versatilidade, design, resistência, ..."
176,Perfil da PM em SC curte posts antidemocrático...,"<img src=""https://s2.glbimg.com/hHaPhjo7AiNLn-...",https://g1.globo.com/sc/santa-catarina/noticia...,2022-11-28T18:39:29+00:00,2022-11-28,"[polícia, militar, rodoviária, instaurou, proc..."
177,Ponte que liga Toca da Onça a Rio Bonito é rec...,"<img src=""https://s2.glbimg.com/iA1ARsUvBzuFOH...",https://g1.globo.com/rj/regiao-serrana/noticia...,2022-11-28T18:39:25+00:00,2022-11-28,"[ponte, existia, local, levada, força, água, r..."
178,Menina denuncia estupro à professora e ex-padr...,"<img src=""https://s2.glbimg.com/DsJJ0ftZ_UF0g9...",https://g1.globo.com/go/goias/noticia/2022/11/...,2022-11-28T18:38:07+00:00,2022-11-28,"[garota, anos, disse, abusada, desde, anos, se..."


## Tokenization step

Now that we have been able to add news from other sources, let's run a simple feature generation process. As the goal is to tokenize the results, let's add a new step to the workflow creating a tokenized version of the content. The final goal is to apply an LDA transformation.

In [20]:
!poetry add nltk

[30;43mSkipping virtualenv creation, as specified in config file.[39;49m
The following packages are already present in the pyproject.toml and will be skipped:

  • [36mnltk[39m

If you want to update it to the latest compatible version, you can use `poetry update package`.
If you prefer to upgrade it to the latest available version, you can use `poetry add package@latest`.

Nothing to add.


In [21]:
!poetry install

[30;43mSkipping virtualenv creation, as specified in config file.[39;49m
[34mInstalling dependencies from lock file[39m

No dependencies to install or update


### Download nltk features

Load the modules that will be used on feature generation

In [18]:
%%writefile src/features/build_features.py
import os
import nltk

import pandas as pd

from metaflow import FlowSpec, Parameter, step


FILE_URL = os.environ['FILE_URL']

class FeatureBuildFlow(FlowSpec):
    news = FILE_URL
    
    @step
    def start(self):
        self.results = pd.read_parquet(self.news, engine='fastparquet')
        self.next(self.download_nltk)
        
    @step
    def download_nltk(self):
        os.environ['NLTK_DATA'] = os.path.join(os.environ['DATA_DIR'], './nltk_data')
        os.makedirs(os.environ['NLTK_DATA'], exist_ok=True)
        nltk.data.path = [os.environ['NLTK_DATA']]
        nltk.download('all')
        self.next(self.end)
    
    @step
    def end(self):
        print("Done!")
        

if __name__ == '__main__':
    FeatureBuildFlow()

Overwriting src/features/build_features.py


In [19]:
!python src/features/build_features.py run

[35m[1mMetaflow 2.7.14[0m[35m[22m executing [0m[31m[1mFeatureBuildFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:jovyan[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-11-30 18:15:18.094 [0m[1mWorkflow starting (run-id 1669832118021600):[0m
[35m2022-11-30 18:15:18.123 [0m[32m[1669832118021600/start/1 (pid 404)] [0m[1mTask is starting.[0m
[35m2022-11-30 18:15:18.891 [0m[32m[1669832118021600/start/1 (pid 404)] [0m[1mTask finished successfully.[0m
[35m2022-11-30 18:15:18.932 [0m[32m[1669832118021600/download_nltk/2 (pid 410)] [0m[1mTask is starting.[0m
[35m2022-11-30 18:15:19.478 [0m[32m[1669832118021600/download_nltk/2 (pid 410)] [0m[22m['/home/jovyan/./data/./nltk_data'][0m
[35m2022-11-30 18:15:19.620 [0m[32m[1669832118021600/download_nltk

### Tokenization

Generate tokens from text and save it back on a new dataset.

In [22]:
%%writefile src/features/build_features.py
import os
import string
import re

import nltk

import pandas as pd

from nltk import word_tokenize
from nltk.corpus import stopwords

from metaflow import FlowSpec, Parameter, step


FILE_URL = os.environ['FILE_URL']

class FeatureBuildFlow(FlowSpec):
    news = FILE_URL
    
    @step
    def start(self):
        self.results = pd.read_parquet(self.news, engine='fastparquet')
        self.next(self.download_nltk)
        
    @step
    def download_nltk(self):
        os.environ['NLTK_DATA'] = os.path.join(os.environ['DATA_DIR'], './nltk_data')
        os.makedirs(os.environ['NLTK_DATA'], exist_ok=True)
        nltk.data.path = [os.environ['NLTK_DATA']]
        nltk.download('all')
        self.next(self.tokenization)
        
    @step
    def tokenization(self):
        stop = set(stopwords.words('portuguese') + list(string.punctuation))
        stop.update(['http', 'pro', 'https', 't.', 'co'])

        def preprocess(words):
            # Remove HTML marks
            words = re.sub('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});', '', words)
            tokens = word_tokenize(words)
            tokens = [word for word in tokens if word not in stop]
            tokens = [word for word in tokens if re.search(r'\w+', word) and len(word) > 2]
            return tokens
    
        self.results['token_set'] = self.results.apply(lambda row: preprocess(row.CONTENT.lower()), axis=1)
        print("Tokenization finished!")
        self.next(self.end)
    
    @step
    def end(self):
        print("Done!")
        

if __name__ == '__main__':
    FeatureBuildFlow()

Overwriting src/features/build_features.py


In [23]:
!python src/features/build_features.py run

[35m[1mMetaflow 2.7.14[0m[35m[22m executing [0m[31m[1mFeatureBuildFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:jovyan[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-11-30 18:21:14.238 [0m[1mWorkflow starting (run-id 1669832474162958):[0m
[35m2022-11-30 18:21:14.266 [0m[32m[1669832474162958/start/1 (pid 612)] [0m[1mTask is starting.[0m
[35m2022-11-30 18:21:15.052 [0m[32m[1669832474162958/start/1 (pid 612)] [0m[1mTask finished successfully.[0m
[35m2022-11-30 18:21:15.087 [0m[32m[1669832474162958/download_nltk/2 (pid 618)] [0m[1mTask is starting.[0m
[35m2022-11-30 18:21:15.782 [0m[32m[1669832474162958/download_nltk/2 (pid 618)] [0m[22m[nltk_data] Downloading collection 'all'[0m
[35m2022-11-30 18:21:15.782 [0m[32m[1669832474162958/download

In [28]:
from metaflow import Flow
fl = Flow('FeatureBuildFlow')
runs_list = list(fl)
runs_list

[Run('FeatureBuildFlow/1669832474162958'),
 Run('FeatureBuildFlow/1669832118021600'),
 Run('FeatureBuildFlow/1669831889593254'),
 Run('FeatureBuildFlow/1669831852155982'),
 Run('FeatureBuildFlow/1669816256226342'),
 Run('FeatureBuildFlow/1669816224447863'),
 Run('FeatureBuildFlow/1669815068030614'),
 Run('FeatureBuildFlow/1669815009964566'),
 Run('FeatureBuildFlow/1669814927431930'),
 Run('FeatureBuildFlow/1669814088506057')]

In [29]:
from metaflow import Run
r = fl.latest_run
df_r = r.data.results
df_r

Unnamed: 0,TITLE,CONTENT,LINK,PUBLISHED,PUBLISHED_DATE,token_set
0,"Com surto na região Norte, campanha contra o s...","<img src=""https://s2.glbimg.com/HA6WyXj_0FWazn...",https://g1.globo.com/ap/amapa/noticia/2018/07/...,2018-07-23T22:40:12+00:00,2018-07-23,"[vacinação, voltada, público, infantil, aconte..."
1,Polícia Civil de Juiz de Fora recebe denúncia ...,Duas firmas são de São Paulo e uma de Belo Hor...,https://g1.globo.com/mg/zona-da-mata/noticia/2...,2018-07-23T21:01:26+00:00,2018-07-23,"[duas, firmas, paulo, belo, horizonte, agora, ..."
2,Segunda edição do ‘Encontro de Bateristas do T...,"<img src=""https://s2.glbimg.com/ygBGnNNFsGwUPA...",https://g1.globo.com/mg/triangulo-mineiro/noti...,2018-07-23T20:45:48+00:00,2018-07-23,"[170, bateristas, esperados, evento, pode, tor..."
3,Comissariado do AP fiscaliza embarque de menor...,"<img src=""https://s2.glbimg.com/3ywfLG3crqb2_I...",https://g1.globo.com/ap/amapa/noticia/2018/07/...,2018-07-23T20:36:37+00:00,2018-07-23,"[fluxo, passageiros, aumenta, principais, port..."
4,Ceará tem 66 municípios com emergência reconhe...,"<img src=""https://s2.glbimg.com/aktFEMxx84AuWQ...",https://g1.globo.com/ce/ceara/noticia/2018/07/...,2018-07-23T20:13:49+00:00,2018-07-23,"[nesta, segunda-feira, reconhecimento, municíp..."
...,...,...,...,...,...,...
175,Por que a esquadria de alumínio se tornou um d...,"<img src=""https://s2.glbimg.com/k5A5kFIGvVX5Ib...",https://g1.globo.com/ba/bahia/especial-publici...,2022-11-28T18:40:41+00:00,2022-11-28,"[aliando, versatilidade, design, resistência, ..."
176,Perfil da PM em SC curte posts antidemocrático...,"<img src=""https://s2.glbimg.com/hHaPhjo7AiNLn-...",https://g1.globo.com/sc/santa-catarina/noticia...,2022-11-28T18:39:29+00:00,2022-11-28,"[polícia, militar, rodoviária, instaurou, proc..."
177,Ponte que liga Toca da Onça a Rio Bonito é rec...,"<img src=""https://s2.glbimg.com/iA1ARsUvBzuFOH...",https://g1.globo.com/rj/regiao-serrana/noticia...,2022-11-28T18:39:25+00:00,2022-11-28,"[ponte, existia, local, levada, força, água, r..."
178,Menina denuncia estupro à professora e ex-padr...,"<img src=""https://s2.glbimg.com/DsJJ0ftZ_UF0g9...",https://g1.globo.com/go/goias/noticia/2022/11/...,2022-11-28T18:38:07+00:00,2022-11-28,"[garota, anos, disse, abusada, desde, anos, se..."


## Tagging and publishing

The goal of this step is to provide features for next step (training). In order to do that we are going to use [feature tags provided by Metaflow](https://docs.metaflow.org/scaling/tagging). Following [semantic versioning conventions](https://semver.org/) our model will have two tags: 

* Numbered version: `0.0.1`
* Reference to latest version: `latest`

These tags will be used on next step to make sure the version of the features we are using for training.

In [18]:
from metaflow import Flow

fl = Flow('FeatureBuildFlow')
r = fl.latest_run
r.add_tags(['0.0.1', 'latest'])