# Introduction

This notebook is a demo on how to setup a basic Workflow to Work with Jupyter Notebooks


## Install

Let's configure it and install project dependencies

In [1]:
import sys

sys.path.append('../')

This cell contains all the Python requirements we have to deal with

In [3]:
%%writefile requirements.txt

feedparser==6.0.10
pandas==1.4.2
openpyxl==3.0.10
nltk==3.7

Overwriting requirements.txt


In [2]:
!pip install -r requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Collecting feedparser==6.0.10
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 KB[0m [31m391.1 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting pandas==1.4.2
  Downloading pandas-1.4.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (11.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.0/11.0 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting openpyxl==3.0.10
  Downloading openpyxl-3.0.10-py2.py3-none-any.whl (242 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.1/242.1 KB[0m [31m428.5 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting nltk==3.7
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m243.3 kB/s[0m eta [36m0:00:00[0ma 

## Data extraction

For this example we will download some news feed data to use as a dataset. As this is an introductory example just a few news will be used

### Step 1: download data

The goal is to create the case as a multistep feature extraction. First step is to create a function to download data from a service provider

In [5]:
import feedparser
import pandas as pd

url = 'https://g1.globo.com/rss/g1/'

blog_feed = feedparser.parse(url)

posts = blog_feed.entries  
post_list = []

for post in posts:
    post_dict = dict()

    post_dict["TITLE"] = post.title
    post_dict["CONTENT"] = post.summary
    post_dict["LINK"] = post.link
    post_dict["TIME_PUBLISHED"] = post.published
    # post_dict["TAGS"] = [tag.term for tag in post.tags]

    post_list.append(post_dict)
df_post = pd.DataFrame(post_list)
df_post

Unnamed: 0,TITLE,CONTENT,LINK,TIME_PUBLISHED
0,Vai a júri acusado de assassinar a tiros asses...,"<img src=""https://s2.glbimg.com/pzvczM0JN2NWQz...",https://g1.globo.com/rs/rio-grande-do-sul/noti...,"Fri, 07 Oct 2022 18:05:00 -0000"
1,"Stand-up, teatro e shows são algumas das atraç...","<img src=""https://s2.glbimg.com/rRBiwHg_zjAT71...",https://g1.globo.com/sp/vale-do-paraiba-regiao...,"Fri, 07 Oct 2022 18:03:08 -0000"
2,Nas ruas desde os 6 anos e usuário de drogas d...,"Com a ajuda de ONG, jovem de 15 anos voltou à ...",https://g1.globo.com/profissao-reporter/notici...,"Fri, 07 Oct 2022 18:03:07 -0000"
3,Cerca de 65% da carne bovina produzida em Mato...,"<img src=""https://s2.glbimg.com/fC1atT4q9H6K8H...",https://g1.globo.com/mt/mato-grosso/maisagromt...,"Fri, 07 Oct 2022 18:01:58 -0000"
4,"Comunidades quilombolas de Óbidos, no PA, serã...","<img src=""https://s2.glbimg.com/ktp75_4OeWpLas...",https://g1.globo.com/pa/santarem-regiao/notici...,"Fri, 07 Oct 2022 17:59:59 -0000"
5,Delegado de Umbaúba apresenta melhora de edema...,"<img src=""https://s2.glbimg.com/g97UjFr82VAUpB...",https://g1.globo.com/se/sergipe/noticia/2022/1...,"Fri, 07 Oct 2022 17:59:35 -0000"
6,"Prefeito de Maceió, JCH troca de partido para ...","<img src=""https://s2.glbimg.com/rulQBzPKprdmZ3...",https://g1.globo.com/al/alagoas/eleicoes/2022/...,"Fri, 07 Oct 2022 17:56:08 -0000"
7,Chico Chico volta ao Circo Voador com a turnê ...,"<img src=""https://s2.glbimg.com/QfZFD16gY7RkyS...",https://g1.globo.com/rj/rio-de-janeiro/o-que-f...,"Fri, 07 Oct 2022 17:56:06 -0000"
8,Câmera de segurança flagra batida entre moto e...,"<img src=""https://s2.glbimg.com/_vEHstjJ_cElRW...",https://g1.globo.com/mg/sul-de-minas/noticia/2...,"Fri, 07 Oct 2022 17:53:33 -0000"
9,Maíra Cardi é condenada a pagar R$ 24 mil após...,"<img src=""https://s2.glbimg.com/UMjkiHzs5vskZU...",https://g1.globo.com/pb/paraiba/noticia/2022/1...,"Fri, 07 Oct 2022 17:53:06 -0000"


## Step 2: Create your first workflow

Now that we downloaded the data we can use to create new features. Let's create a workflow to download feeds from differente providers. No we are going to user [Neflix Open Source metaflow Workflow package](https://metaflow.org/) to make the job easier.

In [4]:
%%writefile ../cd4ml/feeds_flow.py

import feedparser
import pandas as pd

from metaflow import FlowSpec, step

class FeedsFlow(FlowSpec):

    @step
    def start(self):
        self.feeds_url = [
            'https://feeds.folha.uol.com.br/emcimadahora/rss091.xml',
            'https://g1.globo.com/rss/g1/',
            'https://g1.globo.com/rss/g1/brasil'
        ]
        self.next(self.fetch_feed_data, foreach='feeds_url')

    @step
    def fetch_feed_data(self):
        
        print(f"Downloading from url {self.input}")
        blog_feed = feedparser.parse(self.input)

        posts = blog_feed.entries  
        post_list = []

        for post in posts:
            post_dict = dict()

            post_dict["TITLE"] = post.title
            post_dict["CONTENT"] = post.summary
            post_dict["LINK"] = post.link
            post_dict["TIME_PUBLISHED"] = post.published
            # post_dict["TAGS"] = [tag.term for tag in post.tags]

            post_list.append(post_dict)
        self.posts = pd.DataFrame(post_list)        
        self.next(self.feeds_aggregate)

    @step
    def feeds_aggregate(self, inputs):
        self.results = pd.concat([input.posts for input in inputs])
        self.next(self.end)
              
    @step
    def end(self):
        print('Workflow finished!')
        
if __name__ == '__main__':
    FeedsFlow()

Overwriting ../cd4ml/feeds_flow.py


In [5]:
!python ../cd4ml/feeds_flow.py run

[35m[1mMetaflow 2.7.12[0m[35m[22m executing [0m[31m[1mFeedsFlow[0m[35m[22m[0m[35m[22m for [0m[31m[1muser:eduardo[0m[35m[22m[K[0m[35m[22m[0m
[35m[22mValidating your flow...[K[0m[35m[22m[0m
[32m[1m    The graph looks good![K[0m[32m[1m[0m
[35m[22mRunning pylint...[K[0m[35m[22m[0m
[32m[1m    Pylint is happy![K[0m[32m[1m[0m
[35m2022-10-13 17:59:12.712 [0m[1mWorkflow starting (run-id 1665694752621826):[0m
[35m2022-10-13 17:59:12.737 [0m[32m[1665694752621826/start/1 (pid 65)] [0m[1mTask is starting.[0m
[35m2022-10-13 17:59:13.264 [0m[32m[1665694752621826/start/1 (pid 65)] [0m[1mForeach yields 3 child steps.[0m
[35m2022-10-13 17:59:13.264 [0m[32m[1665694752621826/start/1 (pid 65)] [0m[1mTask finished successfully.[0m
[35m2022-10-13 17:59:13.305 [0m[32m[1665694752621826/fetch_feed_data/2 (pid 71)] [0m[1mTask is starting.[0m
[35m2022-10-13 17:59:13.329 [0m[32m[1665694752621826/fetch_feed_data/3 (pid 72)] [0m[1m