# Introduction

This notebook is a demo on how to setup a basic Workflow to Work with Jupyter Notebooks

## Install

First install CD4ML package to the development environment

In [1]:
#!pip install git+https://github.com/eduardosan/cd4ml@issue


In [9]:
!python ../setup.py develop
import sys

sys.path.append('../')

running develop
running egg_info
writing cd4ml.egg-info/PKG-INFO
writing dependency_links to cd4ml.egg-info/dependency_links.txt
writing requirements to cd4ml.egg-info/requires.txt
writing top-level names to cd4ml.egg-info/top_level.txt
reading manifest file 'cd4ml.egg-info/SOURCES.txt'
writing manifest file 'cd4ml.egg-info/SOURCES.txt'
running build_ext
Creating /opt/conda/lib/python3.10/site-packages/cd4ml.egg-link (link to .)
Adding cd4ml 0.0.1 to easy-install.pth file

Installed /home/eduardo/work/notebooks
Processing dependencies for cd4ml==0.0.1
Finished processing dependencies for cd4ml==0.0.1


Install here your project dependencies por the experiment

In [36]:
%%writefile requirements.txt

feedparser==6.0.10
pandas==1.4.2
openpyxl==3.0.10
nltk==3.7

Overwriting requirements.txt


In [37]:
!pip install -r requirements.txt



## Data extraction

For this example we will download some news feed data to use as a dataset. As this is an introductory example just a few news will be used

### Step 1: download data

The goal is to create the case as a multistep feature extraction. First step is to create a function to download data from a service provider

In [14]:
import feedparser
import pandas as pd

def fetch_feed_data(url):

    blog_feed = feedparser.parse(url)
        
    posts = blog_feed.entries  
    post_list = []
        
    for post in posts:
        post_dict = dict()
            
        post_dict["TITLE"] = post.title
        post_dict["CONTENT"] = post.summary
        post_dict["LINK"] = post.link
        post_dict["TIME_PUBLISHED"] = post.published
        # post_dict["TAGS"] = [tag.term for tag in post.tags]
            
        post_list.append(post_dict)
    df_post = pd.DataFrame(post_list)
    return df_post

Now we are going to use this function to define a task using CD4ML

In [5]:
from cd4ml.task import Task

download = Task(name='download', task=fetch_feed_data)

Every task in CD4ML has a method name `run` to excute it with args. Let's test it with a feed we know.

In [6]:
df_g1 = download.run('https://g1.globo.com/rss/g1/')
df_g1

Unnamed: 0,TITLE,CONTENT,LINK,TIME_PUBLISHED
0,Polícia Civil prende em flagrante suspeitos de...,"<img src=""https://s2.glbimg.com/2GTBoN0HdLB-d7...",https://g1.globo.com/sp/ribeirao-preto-franca/...,"Fri, 24 Jun 2022 14:59:48 -0000"
1,"Motociclista foge de blitz, dirige na contramã...","<img src=""https://s2.glbimg.com/bLPhUKr35VELAU...",https://g1.globo.com/ce/ceara/noticia/2022/06/...,"Fri, 24 Jun 2022 14:58:16 -0000"
2,Assista ao JAP1 desta sexta-feira,"<img src=""https://s2.glbimg.com/EFYUjNezFJIcnR...",https://g1.globo.com/ap/amapa/ao-vivo/acontece...,"Fri, 24 Jun 2022 14:56:55 -0000"
3,Tom Zé põe a língua brasileira para fora da na...,"<img src=""https://s2.glbimg.com/RYJ6V9cDlNn1Q9...",https://g1.globo.com/pop-arte/musica/blog/maur...,"Fri, 24 Jun 2022 14:56:04 -0000"
4,"'Vida de Bruno foi de coragem, dedicação e fid...","<img src=""https://s2.glbimg.com/O_BvPtE0HOQ5Nl...",https://g1.globo.com/pe/pernambuco/noticia/202...,"Fri, 24 Jun 2022 14:54:34 -0000"
5,Mais de 130 pinos de cocaína em Santo Antônio ...,"<img src=""https://s2.glbimg.com/VX-_Z95Tf_KFCs...",https://g1.globo.com/mg/centro-oeste/noticia/2...,"Fri, 24 Jun 2022 14:54:15 -0000"
6,Prefeitura anuncia pacote de obras de R$ 100 m...,"<img src=""https://s2.glbimg.com/91fJdMDJDsh_Tz...",https://g1.globo.com/sp/presidente-prudente-re...,"Fri, 24 Jun 2022 14:53:10 -0000"
7,Atividade de fusões e aquisições globais desac...,"Volume de transações internacionais caiu 25,5%...",https://g1.globo.com/economia/noticia/2022/06/...,"Fri, 24 Jun 2022 14:52:29 -0000"
8,Bolsonaro é aplaudido e xingado durante São Jo...,"<img src=""https://s2.glbimg.com/j8c4LyPi9vcDkX...",https://g1.globo.com/pe/caruaru-regiao/sao-joa...,"Fri, 24 Jun 2022 14:49:48 -0000"
9,EPTV 1 Piracicaba ao vivo,"<img src=""https://s2.glbimg.com/k7wZiPzBi-9kR_...",https://g1.globo.com/sp/piracicaba-regiao/ao-v...,"Fri, 24 Jun 2022 14:49:44 -0000"


### Step 2: create your first workflow

Now that we downloaded the data we can use to create new features. Let's create a workflow to download feeds from differente providers

In [15]:
from cd4ml.task import Task
from cd4ml.workflow import Workflow

download_g1 = Task(name='download_g1', task=fetch_feed_data)
download_g1_brasil = Task(name='download_g1_brasil', task=fetch_feed_data)
download_folha = Task(name='download_folha', task=fetch_feed_data)

run_config = {
    "download_folha": {
        'params': {'url': "https://feeds.folha.uol.com.br/emcimadahora/rss091.xml"},
        'output': 'download_folha'
    },
    "download_g1": {
        'params': {'url': "https://g1.globo.com/rss/g1/"},
        'output': 'download_g1'
    },
    "download_g1_brasil": {
        'params': {'url': "https://g1.globo.com/rss/g1/brasil"},
        'output': 'download_g1_brasil'
    },
}

w = Workflow()
w.add_task(download_g1)
w.add_task(download_g1_brasil)
w.add_task(download_folha)
output = w.run(run_config=run_config, executor='local')

When defining a workflow two informations are important for each task: `params` and `outputs`. The first will be used as a parameter to the function; the second will the output vaariable returned by the worklow. Let's check the ouput from first step as a start.

In [16]:
output['download_g1']

Unnamed: 0,TITLE,CONTENT,LINK,TIME_PUBLISHED
0,Polícia Civil prende em flagrante suspeitos de...,"<img src=""https://s2.glbimg.com/2GTBoN0HdLB-d7...",https://g1.globo.com/sp/ribeirao-preto-franca/...,"Fri, 24 Jun 2022 14:59:48 -0000"
1,"Motociclista foge de blitz, dirige na contramã...","<img src=""https://s2.glbimg.com/bLPhUKr35VELAU...",https://g1.globo.com/ce/ceara/noticia/2022/06/...,"Fri, 24 Jun 2022 14:58:16 -0000"
2,Assista ao JAP1 desta sexta-feira,"<img src=""https://s2.glbimg.com/EFYUjNezFJIcnR...",https://g1.globo.com/ap/amapa/ao-vivo/acontece...,"Fri, 24 Jun 2022 14:56:55 -0000"
3,Tom Zé põe a língua brasileira para fora da na...,"<img src=""https://s2.glbimg.com/RYJ6V9cDlNn1Q9...",https://g1.globo.com/pop-arte/musica/blog/maur...,"Fri, 24 Jun 2022 14:56:04 -0000"
4,"'Vida de Bruno foi de coragem, dedicação e fid...","<img src=""https://s2.glbimg.com/O_BvPtE0HOQ5Nl...",https://g1.globo.com/pe/pernambuco/noticia/202...,"Fri, 24 Jun 2022 14:54:34 -0000"
5,Mais de 130 pinos de cocaína em Santo Antônio ...,"<img src=""https://s2.glbimg.com/VX-_Z95Tf_KFCs...",https://g1.globo.com/mg/centro-oeste/noticia/2...,"Fri, 24 Jun 2022 14:54:15 -0000"
6,Prefeitura anuncia pacote de obras de R$ 100 m...,"<img src=""https://s2.glbimg.com/91fJdMDJDsh_Tz...",https://g1.globo.com/sp/presidente-prudente-re...,"Fri, 24 Jun 2022 14:53:10 -0000"
7,Atividade de fusões e aquisições globais desac...,"Volume de transações internacionais caiu 25,5%...",https://g1.globo.com/economia/noticia/2022/06/...,"Fri, 24 Jun 2022 14:52:29 -0000"
8,Bolsonaro é aplaudido e xingado durante São Jo...,"<img src=""https://s2.glbimg.com/j8c4LyPi9vcDkX...",https://g1.globo.com/pe/caruaru-regiao/sao-joa...,"Fri, 24 Jun 2022 14:49:48 -0000"
9,EPTV 1 Piracicaba ao vivo,"<img src=""https://s2.glbimg.com/k7wZiPzBi-9kR_...",https://g1.globo.com/sp/piracicaba-regiao/ao-v...,"Fri, 24 Jun 2022 14:49:44 -0000"


As we can see it will be necessary to aggregate data from different sources, so we proceed to the next step

### Step 3: dependencies

For the next step it will be necessary to work with **dependency** on the steps. Let's create a new step that uses data from other workflow in order to create an unified dataset. First we create a new function to define a new task and add it to the previous workflow.

In [30]:
import pandas as pd

def aggregate(download_g1, download_g1_brasil, download_folha):  
    return pd.concat([download_g1, download_g1_brasil, download_folha], ignore_index=True)

In [31]:
from cd4ml.task import Task

feeds_aggregate = Task(name='feeds_aggregate', task=aggregate)

The specifics about this step is to make sure dependencies are all declared

In [32]:
w = Workflow()

w.add_task(download_g1)
w.add_task(download_g1_brasil)
w.add_task(download_folha)
w.add_task(feeds_aggregate, dependency=['download_g1', 'download_g1_brasil', 'download_folha'])

Let's see the aggregated results after all the steps are executed

In [33]:
run_config = {
    "download_folha": {
        'params': {'url': "https://feeds.folha.uol.com.br/emcimadahora/rss091.xml"},
        'output': 'download_folha'
    },
    "download_g1": {
        'params': {'url': "https://g1.globo.com/rss/g1/"},
        'output': 'download_g1'
    },
    "download_g1_brasil": {
        'params': {'url': "https://g1.globo.com/rss/g1/brasil"},
        'output': 'download_g1_brasil'
    },
    "feeds_aggregate": {
        'params': None,
        'output': 'feeds_aggregate'
    }
}
output = w.run(run_config=run_config, executor='local')

{'download_folha': {'params': {'url': 'https://feeds.folha.uol.com.br/emcimadahora/rss091.xml'}, 'output': 'download_folha'}, 'download_g1': {'params': {'url': 'https://g1.globo.com/rss/g1/'}, 'output': 'download_g1'}, 'download_g1_brasil': {'params': {'url': 'https://g1.globo.com/rss/g1/brasil'}, 'output': 'download_g1_brasil'}, 'feeds_aggregate': {'params': None, 'output': 'feeds_aggregate'}}
node 'download_g1' was already marked done
node 'download_g1_brasil' was already marked done
node 'download_folha' was already marked done
node 'download_g1' was already marked done
node 'download_g1_brasil' was already marked done
node 'download_folha' was already marked done


In [34]:
output['feeds_aggregate']

Unnamed: 0,TITLE,CONTENT,LINK,TIME_PUBLISHED
0,Prazo para entrega de propostas para gestão do...,"<img src=""https://s2.glbimg.com/diFWQznLCT3vwy...",https://g1.globo.com/am/amazonas/noticia/2022/...,"Fri, 24 Jun 2022 19:11:38 -0000"
1,Mulher investigada por tráfico de drogas é pre...,"<img src=""https://s2.glbimg.com/3PcmKypzftoM9a...",https://g1.globo.com/se/sergipe/noticia/2022/0...,"Fri, 24 Jun 2022 19:10:55 -0000"
2,Covid: Campinas prevê iniciar na segunda aplic...,"<img src=""https://s2.glbimg.com/0ywst1CYsd2O5P...",https://g1.globo.com/sp/campinas-regiao/notici...,"Fri, 24 Jun 2022 19:10:38 -0000"
3,Homem é preso por espancar companheira com fio...,"<img src=""https://s2.glbimg.com/Z5l2yi2GZW7-7i...",https://g1.globo.com/sc/santa-catarina/noticia...,"Fri, 24 Jun 2022 19:07:18 -0000"
4,Inaugurado CAPSI que atenderá público infantoj...,"<img src=""https://s2.glbimg.com/ILF6cPIDpCxG40...",https://g1.globo.com/pa/santarem-regiao/notici...,"Fri, 24 Jun 2022 19:06:29 -0000"
...,...,...,...,...
175,Investigação da ONU aponta que jornalista da A...,"A <a href=""https://www1.folha.uol.com.br/folha...",https://redir.folha.com.br/redir/online/emcima...,24 Jun 2022 07:45:00 -0300
176,Globo: produções originais da linha de shows m...,"Em um prazo de dois anos, de 2020 para cá, nad...",https://redir.folha.com.br/redir/online/emcima...,24 Jun 2022 07:42:00 -0300
177,"MEC, o ministério que se tornou palco de escân...",As reviravoltas e retrocessos das políticas pa...,https://redir.folha.com.br/redir/online/emcima...,24 Jun 2022 07:00:00 -0300
178,Laudo pericial reforça relato de jovem sobre a...,A perícia apontou que as lesões do adolescente...,https://redir.folha.com.br/redir/online/emcima...,24 Jun 2022 07:00:00 -0300


### Step 3: feature generation

Now that we have been able to add news from other sources, let's run a simple feature generation process. As the goal is to tokenize the results, let's add a new step to the workflow creating a tokenized version of the content. The final goal is to apply an LDA transformation.

In [39]:
import nltk

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /home/eduardo/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /home/eduardo/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /home/eduardo/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /home/eduardo/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /home/eduardo/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /home/eduardo/nltk_data...
[nltk_data]    |

True

In [79]:
output['feeds_aggregate']['CONTENT'][0]

'<img src="https://s2.glbimg.com/RwgvPlDgshwwly8-SCV7FpE891I=/i.s3.glbimg.com/v1/AUTH_59edd422c0c84a879bd37670ae4f538a/internal_photos/bs/2019/F/t/MghUrGQDA4LxQ9cxDYng/trecho-dom-pedro.jpg" /><br />   Interdição ocorre das 7h às 17h, no sentido Anhanguera, a partir da divisa com Valinhos (SP). Faixa do lado esquerdo ficará livre durante as obras. Trecho da Rodovia Dom Pedro I  terá bloqueios neste sábado, em Campinas.\nReprodução/EPTV\nA Rodovia D. Pedro I (SP-065) terá faixas interditadas neste sábado (25) entre os km 125 e 127, no sentido Anhanguera, para obras de recuperação do pavimento. O trecho em Campinas (SP) fica próximo com a divisa de Valinhos (SP). Motoristas devem ficar atentos à sinalização da via, que indicarão as faixas indisponíveis.\nOs bloqueios ocorrem das das 7h às 17h, no entanto a faixa da esquerda permanecerá aberta para circulação durante todo o período. Ao todo, o trecho da rodovia tem entre duas e três faixas de rolamento.\nA concessionária Rota da Bandeiras 

In [None]:
import re
from nltk import word_tokenize
from nltk.corpus import stopwords
import string

def preprocess_pandas(feeds_aggregate):
    stop = set(stopwords.words('portuguese') + list(string.punctuation))
    stop.update(['http', 'pro', 'https', 't.', 'co'])

    def preprocess(words):
        # Remove HTML marks
        words = re.sub('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});', '', words)
        tokens = word_tokenize(words)
        tokens = [word for word in tokens if word not in stop]
        tokens = [word for word in tokens if re.search(r'\w+', word) and len(word) > 2]
        return tokens
    
    feeds_aggregate['token_set'] = feeds_aggregate.apply(lambda row: preprocess(row.CONTENT.lower()), axis=1)
    return feeds_aggregate

In [None]:
w = Workflow()

pre = Task(name='preprocess', task=preprocess_pandas)

w.add_task(download_g1)
w.add_task(download_g1_brasil)
w.add_task(download_folha)
w.add_task(feeds_aggregate, dependency=['download_g1', 'download_g1_brasil', 'download_folha'])
w.add_task(pre, dependency=['feeds_aggregate'])

In [None]:
run_config = {
    "download_folha": {
        'params': {'url': "https://feeds.folha.uol.com.br/emcimadahora/rss091.xml"},
        'output': 'download_folha'
    },
    "download_g1": {
        'params': {'url': "https://g1.globo.com/rss/g1/"},
        'output': 'download_g1'
    },
    "download_g1_brasil": {
        'params': {'url': "https://g1.globo.com/rss/g1/brasil"},
        'output': 'download_g1_brasil'
    },
    "feeds_aggregate": {
        'params': None,
        'output': 'feeds_aggregate'
    },
    "preprocess": {
        'params': None,
        'output': 'preprocess'
    }
}
output = w.run(run_config=run_config, executor='local')

Now we can see a preview with the processed text

Now we can see a preview with the processed text

In [86]:
output['preprocess']['token_set'][0]

['projeto',
 'principal',
 'acesso',
 'zona',
 'norte',
 'macapá',
 'feito',
 'convênio',
 'prefeitura',
 'unifap',
 'revitalização',
 'ponte',
 'sérgio',
 'arruda',
 'vai',
 'melhorar',
 'trânsito',
 'zona',
 'norte',
 'macapá',
 'etapa',
 'projeto',
 'revitalização',
 'ponte',
 'sérgio',
 'arruda',
 'principal',
 'acesso',
 'zona',
 'norte',
 'macapá',
 'divulgado',
 'diversas',
 'propostas',
 'devem',
 'dar',
 'mobilidade',
 'evitar',
 'alagamentos',
 'principais',
 'corredores',
 'trânsito',
 'capital',
 'ações',
 'fazem',
 'parte',
 'convênio',
 'assinado',
 'prefeitura',
 'curso',
 'engenharia',
 'civil',
 'universidade',
 'federal',
 'amapá',
 'unifap',
 'fazem',
 'parte',
 'iniciativa',
 'técnicos',
 'programa',
 'calha',
 'norte',
 'governo',
 'federal',
 'projeto',
 'traz',
 'revitalização',
 'necessária',
 'mobilidade',
 'urbana',
 'cidade',
 'ganha',
 'população',
 'sabemos',
 'construído',
 'profissionais',
 'excelência',
 'satisfeitos',
 'parceria',
 'firmada',
 'unifap',