# Introduction

This notebook is a demo on how to setup a basic Workflow to Work with Jupyter Notebooks

## Install

First install CD4ML package to the development environment

In [10]:
!python ../setup.py develop

Processing /home/eduardo/work
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: cd4ml
  Building wheel for cd4ml (setup.py) ... [?25ldone
[?25h  Created wheel for cd4ml: filename=cd4ml-0.0.1-py3-none-any.whl size=6733 sha256=54791abb92a138b2dc00f5a9003cc0547b72b750be3a435d9627b17bda2d95bc
  Stored in directory: /tmp/pip-ephem-wheel-cache-5qpbxoph/wheels/dd/fe/a9/e71b0c4e14d41f4dc4a95af3c2c4d87289e569db2eca1c7602
Successfully built cd4ml
Installing collected packages: cd4ml
  Attempting uninstall: cd4ml
    Found existing installation: cd4ml 0.0.1
    Uninstalling cd4ml-0.0.1:
      Successfully uninstalled cd4ml-0.0.1
Successfully installed cd4ml-0.0.1


Install here your project dependencies por the experiment

In [5]:
%%writefile requirements.txt

feedparser==6.0.10
pandas==1.4.2
openpyxl==3.0.10

Writing requirements.txt


In [6]:
!pip install -r requirements.txt

Collecting feedparser==6.0.10
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas==1.4.2
  Downloading pandas-1.4.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (11.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.0/11.0 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting openpyxl==3.0.10
  Downloading openpyxl-3.0.10-py2.py3-none-any.whl (242 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.1/242.1 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sgmllib3k
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting numpy>=1.21.0
  Downloading numpy-1.23.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (13.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.

## Data extraction

For this example we will download some news feed data to use as a dataset. As this is an introductory example just a few news will be used

### Step 1: download data

The goal is to create the case as a multistep feature extraction. First step is to create a function to download data from a service provider

In [7]:
feeds_json = {
    "Folha - Brasil - Em Cima da Hora": "https://feeds.folha.uol.com.br/emcimadahora/rss091.xml",
    "G1 - Land Page 1":"https://g1.globo.com/rss/g1/",
    "G1 - Brasil":"https://g1.globo.com/rss/g1/brasil",
}

In [36]:
import feedparser
import pandas as pd

def fetch_feed_data(feed_rss_url):

    blog_feed = feedparser.parse(feed_rss_url)
        
    posts = blog_feed.entries  
    post_list = []
        
    for post in posts:
        post_dict = dict()
            
        post_dict["TITLE"] = post.title
        post_dict["CONTENT"] = post.summary
        post_dict["LINK"] = post.link
        post_dict["TIME_PUBLISHED"] = post.published
        # post_dict["TAGS"] = [tag.term for tag in post.tags]
            
        post_list.append(post_dict)
    df_post = pd.DataFrame(post_list)
    return df_post

Now we are going to use this function to define a task using CD4ML

In [32]:
from cd4ml.task import Task

download = Task(name='download', task=fetch_feed_data)

Every task in CD4ML has a method name `run` to excute it with args. Let's test it with a feed we know.

In [37]:
df_g1 = download.run('https://g1.globo.com/rss/g1/')
df_g1

Unnamed: 0,TITLE,CONTENT,LINK,TIME_PUBLISHED
0,Veja o que abre e o que fecha no feriado do pa...,"<img src=""https://s2.glbimg.com/X6nwl6tG1i3uDU...",https://g1.globo.com/pr/oeste-sudoeste/noticia...,"Thu, 23 Jun 2022 14:16:28 -0000"
1,"Quem é Arilton Moura, pastor preso no Pará pel...","<img src=""https://s2.glbimg.com/KdvTp7vLIQgHSo...",https://g1.globo.com/pa/para/noticia/2022/06/2...,"Thu, 23 Jun 2022 14:16:28 -0000"
2,Planetário de Vitória faz aniversário: confira...,"<img src=""https://s2.glbimg.com/kHRaUCHBGBd1za...",https://g1.globo.com/es/espirito-santo/noticia...,"Thu, 23 Jun 2022 14:15:48 -0000"
3,Laboratório para caracterização e gestão de re...,"<img src=""https://s2.glbimg.com/l_WpqgAvdsn36v...",https://g1.globo.com/sp/presidente-prudente-re...,"Thu, 23 Jun 2022 14:15:45 -0000"
4,Homem se apresenta à polícia de SP dizendo que...,"<img src=""https://s2.glbimg.com/7LOZFzJyVUY8SW...",https://g1.globo.com/sp/sao-paulo/noticia/2022...,"Thu, 23 Jun 2022 14:15:42 -0000"
5,Em noite de muitos recados em defesa da democr...,A noite de quarta-feira (22) em Brasília foi d...,https://g1.globo.com/politica/blog/gerson-cama...,"Thu, 23 Jun 2022 14:14:05 -0000"
6,Natal sanciona isenção de ISS para permissioná...,"<img src=""https://s2.glbimg.com/Gz16XivaBWjQle...",https://g1.globo.com/rn/rio-grande-do-norte/no...,"Thu, 23 Jun 2022 14:13:58 -0000"
7,"PRF apreende 8 kg de prata em Atibaia, SP","<img src=""https://s2.glbimg.com/giVTZRnJ71fZ__...",https://g1.globo.com/sp/vale-do-paraiba-regiao...,"Thu, 23 Jun 2022 14:12:49 -0000"
8,Beneficiários do Auxílio Brasil já podem solic...,"<img src=""https://s2.glbimg.com/0YiFmeEfDLDLdU...",https://g1.globo.com/sp/bauru-marilia/especial...,"Thu, 23 Jun 2022 14:12:38 -0000"
9,Norte Fluminense vai receber unidade piloto do...,"<img src=""https://s2.glbimg.com/0QpFBdvf3KPq0E...",https://g1.globo.com/rj/norte-fluminense/notic...,"Thu, 23 Jun 2022 14:12:11 -0000"


### Step 2: create your first workflow

Now that we downloaded the data we can use to create new features. Let's create a workflow to download feeds from differente providers

In [None]:
from cd4ml.task import Task
from cd4ml.workflow import Workflow

download_g1 = Task(name='download_g1', task=fetch_feed_data)
download_g1_brasil = Task(name='download_g1_brasil', task=fetch_feed_data)
download_folha = Task(name='download_folha', task=fetch_feed_data)

feeds_json = {
    "download_folha": "https://feeds.folha.uol.com.br/emcimadahora/rss091.xml",
    "download_g1":"https://g1.globo.com/rss/g1/",
    "download_g1_brasil":"https://g1.globo.com/rss/g1/brasil",
}

w = Workflow()
w.add_task(download_g1)
w.add_task(download_g1_brasil)
w.add_task(download_folha)
output = w.run(params=feeds_json, executor='local')