# Importação, Organização e Transformação de Dados com Pandas

**Objetivo:** Apresentar de forma prática as etapas iniciais do fluxo de trabalho em Ciência de Dados usando Python e Pandas.

## 0. Configuração Inicial e Imports

Vamos importar as bibliotecas necessárias para esta aula.

In [None]:
import bs4
import pandas as pd
import numpy as np

import requests              # Para requisições HTTP (Web Scraping)
import sqlite3               # Para interagir com bancos de dados SQLite
import os                    # Para interagir com o sistema operacional (ex: verificar arquivos)
import time                  # Para pausas no scraping (boas práticas)

from bs4 import BeautifulSoup # Para parsear HTML (Web Scraping)

# Configurações do Pandas para melhor visualização (opcional)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 100)

print(f"Pandas Version: {pd.__version__}")
print(f"NumPy Version: {np.__version__}")
print(f"Requests Version: {requests.__version__}")
print(f"BeautifulSoup Version: {bs4.__version__}")
print(f"SQLite3 Version: {sqlite3.sqlite_version}")

## 1. Introdução: A Base da Análise de Dados

As etapas iniciais de **Coleta, Importação, Organização e Transformação Estrutural** são cruciais. Elas definem a matéria-prima e garantem que os dados estejam em um formato adequado para análise. Utilizaremos o princípio do **Tidy Data** como guia.


<svg aria-roledescription="flowchart-v2" role="graphics-document document" viewBox="0 0 1128.484375 169.75" style="max-width: 1128.484375px;" class="flowchart" xmlns="http://www.w3.org/2000/svg" width="100%" id="export-svg"><style xmlns="http://www.w3.org/1999/xhtml">@import url("https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.2.0/css/all.min.css"); p {margin: 0;}</style><style>#export-svg{font-family:arial,sans-serif;font-size:14px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#export-svg .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#export-svg .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#export-svg .error-icon{fill:#552222;}#export-svg .error-text{fill:#552222;stroke:#552222;}#export-svg .edge-thickness-normal{stroke-width:1px;}#export-svg .edge-thickness-thick{stroke-width:3.5px;}#export-svg .edge-pattern-solid{stroke-dasharray:0;}#export-svg .edge-thickness-invisible{stroke-width:0;fill:none;}#export-svg .edge-pattern-dashed{stroke-dasharray:3;}#export-svg .edge-pattern-dotted{stroke-dasharray:2;}#export-svg .marker{fill:#333333;stroke:#333333;}#export-svg .marker.cross{stroke:#333333;}#export-svg svg{font-family:arial,sans-serif;font-size:14px;}#export-svg p{margin:0;}#export-svg .label{font-family:arial,sans-serif;color:#333;}#export-svg .cluster-label text{fill:#333;}#export-svg .cluster-label span{color:#333;}#export-svg .cluster-label span p{background-color:transparent;}#export-svg .label text,#export-svg span{fill:#333;color:#333;}#export-svg .node rect,#export-svg .node circle,#export-svg .node ellipse,#export-svg .node polygon,#export-svg .node path{fill:#ECECFF;stroke:#B8B8FF;stroke-width:1px;}#export-svg .rough-node .label text,#export-svg .node .label text,#export-svg .image-shape .label,#export-svg .icon-shape .label{text-anchor:middle;}#export-svg .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#export-svg .rough-node .label,#export-svg .node .label,#export-svg .image-shape .label,#export-svg .icon-shape .label{text-align:center;}#export-svg .node.clickable{cursor:pointer;}#export-svg .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#export-svg .arrowheadPath{fill:#333333;}#export-svg .edgePath .path{stroke:#333333;stroke-width:1px;}#export-svg .flowchart-link{stroke:#333333;fill:none;}#export-svg .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#export-svg .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#export-svg .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#export-svg .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#export-svg .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#export-svg .cluster text{fill:#333;}#export-svg .cluster span{color:#333;}#export-svg div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#export-svg .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#export-svg rect.text{fill:none;stroke-width:0;}#export-svg .icon-shape,#export-svg .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#export-svg .icon-shape p,#export-svg .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#export-svg .icon-shape rect,#export-svg .image-shape rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#export-svg .node .neo-node{stroke:#B8B8FF;}#export-svg [data-look="neo"].node rect,#export-svg [data-look="neo"].cluster rect,#export-svg [data-look="neo"].node polygon{stroke:#B8B8FF;filter:drop-shadow( 1px 2px 2px rgba(185,185,185,1));}#export-svg [data-look="neo"].node path{stroke:#B8B8FF;stroke-width:1;}#export-svg [data-look="neo"].node .outer-path{filter:drop-shadow( 1px 2px 2px rgba(185,185,185,1));}#export-svg [data-look="neo"].node .neo-line path{stroke:#B8B8FF;filter:none;}#export-svg [data-look="neo"].node circle{stroke:#B8B8FF;filter:drop-shadow( 1px 2px 2px rgba(185,185,185,1));}#export-svg [data-look="neo"].node circle .state-start{fill:#000000;}#export-svg [data-look="neo"].statediagram-cluster rect{fill:#ECECFF;stroke:#B8B8FF;stroke-width:1;}#export-svg [data-look="neo"].icon-shape .icon{fill:#B8B8FF;filter:drop-shadow( 1px 2px 2px rgba(185,185,185,1));}#export-svg [data-look="neo"].icon-shape .icon-neo path{stroke:#B8B8FF;filter:drop-shadow( 1px 2px 2px rgba(185,185,185,1));}#export-svg :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}</style><g><marker orient="auto" markerHeight="14" markerWidth="10.5" markerUnits="userSpaceOnUse" refY="7" refX="7.75" viewBox="0 0 11.5 14" class="marker flowchart-v2" id="export-svg_flowchart-v2-pointEnd"><path style="stroke-width: 0; stroke-dasharray: 1, 0;" class="arrowMarkerPath" d="M 0 0 L 11.5 7 L 0 14 z"/></marker><marker orient="auto" markerHeight="14" markerWidth="11.5" markerUnits="userSpaceOnUse" refY="7" refX="4" viewBox="0 0 11.5 14" class="marker flowchart-v2" id="export-svg_flowchart-v2-pointStart"><polygon style="stroke-width: 0; stroke-dasharray: 1, 0;" class="arrowMarkerPath" points="0,7 11.5,14 11.5,0"/></marker><marker orient="auto" markerHeight="14" markerWidth="10.5" markerUnits="userSpaceOnUse" refY="7" refX="11.5" viewBox="0 0 11.5 14" class="marker flowchart-v2" id="export-svg_flowchart-v2-pointEnd-margin"><path style="stroke-width: 0; stroke-dasharray: 1, 0;" class="arrowMarkerPath" d="M 0 0 L 11.5 7 L 0 14 z"/></marker><marker orient="auto" markerHeight="14" markerWidth="11.5" markerUnits="userSpaceOnUse" refY="7" refX="1" viewBox="0 0 11.5 14" class="marker flowchart-v2" id="export-svg_flowchart-v2-pointStart-margin"><polygon style="stroke-width: 0; stroke-dasharray: 1, 0;" class="arrowMarkerPath" points="0,7 11.5,14 11.5,0"/></marker><marker orient="auto" markerHeight="14" markerWidth="14" markerUnits="userSpaceOnUse" refX="10.75" refY="5" viewBox="0 0 10 10" class="marker flowchart-v2" id="export-svg_flowchart-v2-circleEnd"><circle style="stroke-width: 0; stroke-dasharray: 1, 0;" class="arrowMarkerPath" r="5" cy="5" cx="5"/></marker><marker orient="auto" markerHeight="14" markerWidth="14" markerUnits="userSpaceOnUse" refY="5" refX="0" viewBox="0 0 10 10" class="marker flowchart-v2" id="export-svg_flowchart-v2-circleStart"><circle style="stroke-width: 0; stroke-dasharray: 1, 0;" class="arrowMarkerPath" r="5" cy="5" cx="5"/></marker><marker orient="auto" markerHeight="14" markerWidth="14" markerUnits="userSpaceOnUse" refX="12.25" refY="5" viewBox="0 0 10 10" class="marker flowchart-v2" id="export-svg_flowchart-v2-circleEnd-margin"><circle style="stroke-width: 0; stroke-dasharray: 1, 0;" class="arrowMarkerPath" r="5" cy="5" cx="5"/></marker><marker orient="auto" markerHeight="14" markerWidth="14" markerUnits="userSpaceOnUse" refY="5" refX="-2" viewBox="0 0 10 10" class="marker flowchart-v2" id="export-svg_flowchart-v2-circleStart-margin"><circle style="stroke-width: 0; stroke-dasharray: 1, 0;" class="arrowMarkerPath" r="5" cy="5" cx="5"/></marker><marker orient="auto" markerHeight="12" markerWidth="12" markerUnits="userSpaceOnUse" refY="7.5" refX="17.7" viewBox="0 0 15 15" class="marker cross flowchart-v2" id="export-svg_flowchart-v2-crossEnd"><path style="stroke-width: 2.5;" class="arrowMarkerPath" d="M 1,1 L 14,14 M 1,14 L 14,1"/></marker><marker orient="auto" markerHeight="12" markerWidth="12" markerUnits="userSpaceOnUse" refY="7.5" refX="-3.5" viewBox="0 0 15 15" class="marker cross flowchart-v2" id="export-svg_flowchart-v2-crossStart"><path style="stroke-width: 2.5; stroke-dasharray: 1, 0;" class="arrowMarkerPath" d="M 1,1 L 14,14 M 1,14 L 14,1"/></marker><marker orient="auto" markerHeight="12" markerWidth="12" markerUnits="userSpaceOnUse" refY="7.5" refX="17.7" viewBox="0 0 15 15" class="marker cross flowchart-v2" id="export-svg_flowchart-v2-crossEnd-margin"><path style="stroke-width: 2.5;" class="arrowMarkerPath" d="M 1,1 L 14,14 M 1,14 L 14,1"/></marker><marker orient="auto" markerHeight="12" markerWidth="12" markerUnits="userSpaceOnUse" refY="7.5" refX="-3.5" viewBox="0 0 15 15" class="marker cross flowchart-v2" id="export-svg_flowchart-v2-crossStart-margin"><path style="stroke-width: 2.5; stroke-dasharray: 1, 0;" class="arrowMarkerPath" d="M 1,1 L 14,14 M 1,14 L 14,1"/></marker><g class="root"><g class="clusters"><g data-look="neo" data-et="cluster" data-id="Ciclo de Exploração e Modelagem" id="Ciclo de Exploração e Modelagem" class="cluster"><rect height="153.75" width="491.140625" y="8" x="461.734375" style="fill:#ffffde"/><g transform="translate(681.234375, 8)" class="cluster-label"><foreignObject height="21" width="52.140625"><div xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="nodeLabel"><p>Explorar</p></span></div></foreignObject></g></g></g><g class="edgePaths"><path marker-end="url(#export-svg_flowchart-v2-pointEnd-margin)" data-points="W3sieCI6MTU0LjQwNjI1LCJ5Ijo5NC4yNX0seyJ4IjoxNzkuNDA2MjUsInkiOjk0LjI1fSx7IngiOjIwNC40MDYyNSwieSI6OTQuMjV9XQ==" data-id="L_A_B_0" data-et="edge" data-edge="true" style="stroke-dasharray: 0 0 37 9; stroke-dashoffset: 0;;" class="edge-thickness-normal edge-pattern-solid edge-thickness-normal edge-pattern-solid flowchart-link" id="L_A_B_0" d="M154.40625,94.25L179.40625,94.25L200.40625,94.25"/><path marker-end="url(#export-svg_flowchart-v2-pointEnd-margin)" data-points="W3sieCI6NDExLjczNDM3NSwieSI6OTQuMjV9LHsieCI6NDM2LjczNDM3NSwieSI6OTQuMjV9LHsieCI6NDYxLjczNDM3NSwieSI6OTQuMjV9LHsieCI6NDg2LjczNDM3NSwieSI6OTQuMjV9XQ==" data-id="L_B_T_0" data-et="edge" data-edge="true" style="stroke-dasharray: 0 0 62 9; stroke-dashoffset: 0;;" class="edge-thickness-normal edge-pattern-solid edge-thickness-normal edge-pattern-solid flowchart-link" id="L_B_T_0" d="M411.734375,94.25L436.734375,94.25L461.734375,94.25L482.734375,94.25"/><path marker-end="url(#export-svg_flowchart-v2-pointEnd-margin)" data-points="W3sieCI6NjEyLjM1OTM3NSwieSI6NzMuNjg1MDUzMzgwNzgyOTJ9LHsieCI6NjM3LjM1OTM3NSwieSI6NjUuNX0seyJ4Ijo2NjIuMzU5Mzc1LCJ5Ijo2NS41fV0=" data-id="L_T_V_0" data-et="edge" data-edge="true" style="stroke-dasharray: 0 0 38.13121032714844 9; stroke-dashoffset: 0;;" class="edge-thickness-normal edge-pattern-solid edge-thickness-normal edge-pattern-solid flowchart-link" id="L_T_V_0" d="M612.359375,73.68505338078292L627.3805869454299,68.76707651608703Q637.359375,65.5 647.859375,65.5L658.359375,65.5"/><path marker-end="url(#export-svg_flowchart-v2-pointEnd-margin)" data-points="W3sieCI6NzcyLjcxODc1LCJ5Ijo2NS41fSx7IngiOjc5Ny43MTg3NSwieSI6NjUuNX0seyJ4Ijo4MjIuNzE4NzUsInkiOjc0Ljc2NDg1Mzk3Nzg0NDkyfV0=" data-id="L_V_M_0" data-et="edge" data-edge="true" style="stroke-dasharray: 0 0 38.4246940612793 9; stroke-dashoffset: 0;;" class="edge-thickness-normal edge-pattern-solid edge-thickness-normal edge-pattern-solid flowchart-link" id="L_V_M_0" d="M772.71875,65.5L786.3879818372984,65.5Q797.71875,65.5 808.343389203467,69.43742923149635L818.9680284069341,73.3748584629927"/><path marker-end="url(#export-svg_flowchart-v2-pointEnd-margin)" data-points="W3sieCI6ODIyLjcxODc1LCJ5IjoxMTMuNzM1MTQ2MDIyMTU1MDh9LHsieCI6Nzk3LjcxODc1LCJ5IjoxMjN9LHsieCI6NzE3LjUzOTA2MjUsInkiOjEyM30seyJ4Ijo2MzcuMzU5Mzc1LCJ5IjoxMjN9LHsieCI6NjEyLjM1OTM3NSwieSI6MTE0LjgxNDk0NjYxOTIxNzA4fV0=" data-id="L_M_T_0" data-et="edge" data-edge="true" style="stroke-dasharray: 0 0 199.8626251220703 9; stroke-dashoffset: 0;;" class="edge-thickness-normal edge-pattern-solid edge-thickness-normal edge-pattern-solid flowchart-link" id="L_M_T_0" d="M822.71875,113.73514602215508L810.21875,118.36757301107754Q797.71875,123 784.3879818372984,123L717.5390625,123L648.5122748594061,123Q637.359375,123 626.7600965342039,119.52977359838702L616.1608180684077,116.05954719677405"/><path marker-end="url(#export-svg_flowchart-v2-pointEnd-margin)" data-points="W3sieCI6OTI3Ljg3NSwieSI6OTQuMjV9LHsieCI6OTUyLjg3NSwieSI6OTQuMjV9LHsieCI6OTc3Ljg3NSwieSI6OTQuMjV9LHsieCI6MTAwMi44NzUsInkiOjk0LjI1fV0=" data-id="L_M_E_0" data-et="edge" data-edge="true" style="stroke-dasharray: 0 0 62 9; stroke-dashoffset: 0;;" class="edge-thickness-normal edge-pattern-solid edge-thickness-normal edge-pattern-solid flowchart-link" id="L_M_E_0" d="M927.875,94.25L952.875,94.25L977.875,94.25L998.875,94.25"/></g><g class="edgeLabels"><g class="edgeLabel"><g transform="translate(0, 0)" data-id="L_A_B_0" class="label"><foreignObject height="0" width="0"><div class="labelBkg" xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="edgeLabel"></span></div></foreignObject></g></g><g class="edgeLabel"><g transform="translate(0, 0)" data-id="L_B_T_0" class="label"><foreignObject height="0" width="0"><div class="labelBkg" xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="edgeLabel"></span></div></foreignObject></g></g><g class="edgeLabel"><g transform="translate(0, 0)" data-id="L_T_V_0" class="label"><foreignObject height="0" width="0"><div class="labelBkg" xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="edgeLabel"></span></div></foreignObject></g></g><g class="edgeLabel"><g transform="translate(0, 0)" data-id="L_V_M_0" class="label"><foreignObject height="0" width="0"><div class="labelBkg" xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="edgeLabel"></span></div></foreignObject></g></g><g class="edgeLabel"><g transform="translate(0, 0)" data-id="L_M_T_0" class="label"><foreignObject height="0" width="0"><div class="labelBkg" xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="edgeLabel"></span></div></foreignObject></g></g><g class="edgeLabel"><g transform="translate(0, 0)" data-id="L_M_E_0" class="label"><foreignObject height="0" width="0"><div class="labelBkg" xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="edgeLabel"></span></div></foreignObject></g></g></g><g class="nodes"><g transform="translate(81.203125, 94.25)" data-look="neo" data-et="node" data-node="true" data-id="A" id="flowchart-A-0" class="node default"><rect stroke="url(#gradient)" height="45" width="146.40625" y="-22.5" x="-73.203125" data-id="A" style="" class="basic label-container"/><g transform="translate(-57.203125, -10.5)" style="" class="label"><rect/><foreignObject height="21" width="114.40625"><div xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="nodeLabel"><p>Importação/Coleta</p></span></div></foreignObject></g></g><g transform="translate(308.0703125, 94.25)" data-look="neo" data-et="node" data-node="true" data-id="B" id="flowchart-B-1" class="node default"><rect stroke="url(#gradient)" height="45" width="207.328125" y="-22.5" x="-103.6640625" data-id="B" style="" class="basic label-container"/><g transform="translate(-87.6640625, -10.5)" style="" class="label"><rect/><foreignObject height="21" width="175.328125"><div xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="nodeLabel"><p>Organização/Limpeza (Tidy)</p></span></div></foreignObject></g></g><g transform="translate(549.546875, 94.25)" data-look="neo" data-et="node" data-node="true" data-id="T" id="flowchart-T-3" class="node default"><rect stroke="url(#gradient)" height="45" width="125.625" y="-22.5" x="-62.8125" data-id="T" style="" class="basic label-container"/><g transform="translate(-46.8125, -10.5)" style="" class="label"><rect/><foreignObject height="21" width="93.625"><div xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="nodeLabel"><p>Transformação</p></span></div></foreignObject></g></g><g transform="translate(717.5390625, 65.5)" data-look="neo" data-et="node" data-node="true" data-id="V" id="flowchart-V-5" class="node default"><rect stroke="url(#gradient)" height="45" width="110.359375" y="-22.5" x="-55.1796875" data-id="V" style="" class="basic label-container"/><g transform="translate(-39.1796875, -10.5)" style="" class="label"><rect/><foreignObject height="21" width="78.359375"><div xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="nodeLabel"><p>Visualização</p></span></div></foreignObject></g></g><g transform="translate(875.296875, 94.25)" data-look="neo" data-et="node" data-node="true" data-id="M" id="flowchart-M-7" class="node default"><rect stroke="url(#gradient)" height="45" width="105.15625" y="-22.5" x="-52.578125" data-id="M" style="" class="basic label-container"/><g transform="translate(-36.578125, -10.5)" style="" class="label"><rect/><foreignObject height="21" width="73.15625"><div xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="nodeLabel"><p>Modelagem</p></span></div></foreignObject></g></g><g transform="translate(1061.6796875, 94.25)" data-look="neo" data-et="node" data-node="true" data-id="E" id="flowchart-E-11" class="node default"><rect stroke="url(#gradient)" height="45" width="117.609375" y="-22.5" x="-58.8046875" data-id="E" style="" class="basic label-container"/><g transform="translate(-42.8046875, -10.5)" style="" class="label"><rect/><foreignObject height="21" width="85.609375"><div xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="nodeLabel"><p>Comunicação</p></span></div></foreignObject></g></g></g></g></g><defs><filter width="130%" height="130%" id="drop-shadow"><feDropShadow flood-color="#FFFFFF" flood-opacity="0.06" stdDeviation="0" dy="4" dx="4"/></filter></defs><defs><filter width="150%" height="150%" id="drop-shadow-small"><feDropShadow flood-color="#FFFFFF" flood-opacity="0.06" stdDeviation="0" dy="2" dx="2"/></filter></defs></svg>


## 2. Coleta de Dados (Data Collection)

Esta etapa envolve identificar e obter dados de diversas fontes:
*   Bancos de Dados (Internos/Externos)
*   APIs (Web Services)
*   Arquivos (CSV, JSON, Excel, TXT, etc.)
*   Websites (via Web Scraping)
*   Sensores / IoT

Neste notebook, focaremos no processamento dos dados *após* a coleta (principalmente a partir de arquivos e web scraping simulado).

## 3. Web Scraping

Técnica para extrair dados de websites quando não há API ou arquivo disponível.

**Processo:** Requisição HTTP -> Recebimento HTML -> Parseamento -> Extração -> Armazenamento.

**Ferramentas:** `requests` (requisições), `BeautifulSoup` (parseamento).

⚠️ **IMPORTANTE: Considerações Éticas e Legais** ⚠️
*   **Verifique `robots.txt`:** Respeite as regras do site.
*   **Leia os Termos de Serviço (ToS):** Veja se o scraping é permitido.
*   **Não sobrecarregue:** Use intervalos (`time.sleep()`) entre requisições.
*   **Identifique-se:** Use um `User-Agent` descritivo.
*   **LGPD:** Cuidado com dados pessoais.

In [None]:
url = 'https://diegopatr.github.io/data-science-course/data-science-course/_attachments/tabela_dados.html'

df_scraped = pd.DataFrame() # Inicializar DataFrame vazio

print(f"Acessando a URL: {url}")

try:
    # 1. Requisição HTTP
    headers = {'User-Agent': 'MeuBotDeEstudo/1.0 (contato@exemplo.com)'} # Boa prática: identificar-se
    response = requests.get(url, timeout=10, headers=headers)
    response.raise_for_status() # Verifica erros (4xx, 5xx)
    print(f"Status Code: {response.status_code} - Conexão OK!")

    # 2. Parseamento HTML
    soup = BeautifulSoup(response.text, 'lxml')

    # 3. Extração (procurando a tabela com id='dados_principais')
    tabela = soup.find('table', id='dados_principais')

    if tabela:
        headers_scraped = [th.text.strip() for th in tabela.find_all('th')]
        rows = []
        for tr in tabela.find_all('tr')[1:]: # Pula header row
            cells = [td.text.strip() for td in tr.find_all('td')]
            if len(cells) == len(headers_scraped):
                 rows.append(cells)

        # 4. Armazenamento em DataFrame
        df_scraped = pd.DataFrame(rows, columns=headers_scraped)
        print("\nDados extraídos via Web Scraping:")
        display(df_scraped)
    else:
        print(f"Tabela com id 'dados_principais' não encontrada em {url}")

except requests.exceptions.ConnectionError as e:
    print(f"\nERRO DE CONEXÃO: Verifique se URL {url} está disponível. Detalhe: {e}")
except requests.exceptions.RequestException as e:
    print(f"\nERRO NA REQUISIÇÃO HTTP para {url}: {e}")
except Exception as e:
    print(f"\nERRO INESPERADO durante o scraping: {e}")

## 4. Importação de Dados

Carregar dados para DataFrames Pandas. O controle dos parâmetros de leitura é essencial para lidar com dados reais.

**Parâmetros Comuns:**
*   `filepath_or_buffer`: Caminho/URL do arquivo.
*   `sep`/`delimiter`: Separador de campos (CSV, TSV).
*   `header`: Linha do cabeçalho.
*   `names`: Nomes das colunas (se não houver header).
*   `dtype`: *Especificar tipos de dados* (muito recomendado!).
*   `na_values`: Valores a serem tratados como nulos (NaN).
*   `parse_dates`: Colunas a serem convertidas para data/hora.
*   `encoding`: Codificação do arquivo (`utf-8`, `latin-1`, etc.).
*   `skiprows`, `nrows`: Pular/limitar linhas.
*   `decimal`, `thousands`: Separadores numéricos não padrão.
*   `on_bad_lines`: Como tratar linhas com erro.

### 4.1. Importando CSV com Parâmetros

In [None]:
df_csv = pd.DataFrame() # Limpar/inicializar
try:
    # Definindo tipos e valores nulos explicitamente
    col_types = {
        'ID_Cliente': str,
        'Nome': str,
        'Idade': 'Int64', # Inteiro que suporta NaN
        'Cidade': str,
        # 'Data_Cadastro': str, # Deixar Pandas inferir ou parsear depois
        'Valor_Gasto': float
    }
    missing_values = ["", "NA", "N/A", "--"] # O que considerar NaN

    df_csv = pd.read_csv(
        'data/arquivo_dados.csv',
        sep=',',
        encoding='utf-8',
        header=0,              # Primeira linha é o cabeçalho
        dtype=col_types,       # Especificar tipos
        na_values=missing_values, # Definir nulos
        parse_dates=['Data_Cadastro'], # Tentar converter esta coluna para datetime
        # dayfirst=False       # Opcional: Ajuda a interpretar datas ambíguas (DD/MM vs MM/DD)
    )
    print("DataFrame carregado do CSV:")
    display(df_csv)
    print("\nInformações do DataFrame:")
    df_csv.info()

except FileNotFoundError:
    print("Erro: Arquivo 'arquivo_dados.csv' não encontrado.")
except pd.errors.ParserError as e:
     print(f"Erro de parseamento no CSV: {e}")
except Exception as e:
    print(f"Erro inesperado ao ler o CSV: {e}")

### 4.2. Importando JSON

In [None]:
df_json = pd.DataFrame() # Limpar/inicializar
try:
    # 'orient=records' espera uma lista de dicionários JSON
    df_json = pd.read_json('data/arquivo_dados.json', orient='records',
                           dtype={'ID_Cliente':str, 'Quantidade':'Int64', 'Preco_Unitario':float}) # Especificar tipos!
    print("DataFrame carregado do JSON:")
    display(df_json)
    df_json.info()
except FileNotFoundError:
    print("Erro: Arquivo 'arquivo_dados.json' não encontrado.")
except ValueError as e:
    print(f"Erro ao parsear o JSON (verifique formato/orient): {e}")
except Exception as e:
    print(f"Erro inesperado ao ler o JSON: {e}")

### 4.3. Importando Excel

A leitura de arquivos Excel (`.xlsx`, `.xls`) requer a instalação da biblioteca `openpyxl` (para `.xlsx`) ou `xlrd` (para `.xls` mais antigos).

In [None]:
excel_file_path = 'data/planilha_dados.xlsx'
if os.path.exists(excel_file_path):
     df_excel = pd.DataFrame()
     try:
          df_excel = pd.read_excel(excel_file_path,
                                   sheet_name='DadosExemplo', # Nome ou índice da aba
                                   dtype={'col1': int, 'col2': float}, # Especificar tipos
                                   na_values=['Nulo'])
          print("\nDataFrame carregado do Excel:")
          display(df_excel.head())
          df_excel.info()
     except FileNotFoundError:
          print(f"Erro: Arquivo '{excel_file_path}' não encontrado.")
     except ImportError:
          print("Erro: Instale 'openpyxl' (`pip install openpyxl` ou `!pip install openpyxl`) para ler arquivos .xlsx.")
     except Exception as e:
          print(f"Erro ao ler o arquivo Excel: {e}")
else:
    print("\nArquivo Excel não existe e não pôde ser criado, pulando leitura.")

### 4.4. Importando de Banco de Dados SQLite

In [None]:
db_filename = 'data/meu_banco.db'
table_name = 'minha_tabela'

conn_read = None
df_sql = pd.DataFrame() # Limpar/inicializar
try:
    conn_read = sqlite3.connect(db_filename)
    # Query para selecionar colunas específicas com uma condição
    query = f"SELECT id, coluna1, coluna2, coluna_numerica, condicao FROM {table_name} WHERE condicao = 'valor';"
    print(f"\nExecutando query: {query}")

    # Ler dados diretamente para um DataFrame
    df_sql = pd.read_sql_query(query, conn_read,
                               index_col='id', # Usar a coluna id do SQL como índice do DataFrame
                               parse_dates=None, # Nenhuma coluna de data para parsear neste exemplo
                               dtype={'coluna_numerica': float} # Especificar tipo para garantir
                              )

    print("\nDataFrame carregado do Banco de Dados SQL:")
    display(df_sql)
    df_sql.info()

except sqlite3.Error as e:
    print(f"Erro ao executar a consulta SQL ou conectar: {e}")
except pd.io.sql.DatabaseError as e:
     print(f"Erro do Pandas ao ler SQL: {e}")
except Exception as e:
    print(f"Erro inesperado na leitura do banco de dados: {e}")
finally:
    if conn_read:
        conn_read.close()
        print("\nConexão de leitura com o banco de dados fechada.")

### 4.5. Importando Arquivos de Largura Fixa (FWF)

In [None]:
df_fwf = pd.DataFrame() # Limpar/inicializar
try:
    # Especificar larguras ou posições
    # widths = [3, 10, 8, 1] # IDs de 3 chars, Nome 10, Valor 8, Status 1
    col_specs = [(0, 3), (3, 13), (13, 21), (21, 22)] # [start, end)
    col_names = ['ID', 'Nome', 'Valor', 'Status']

    df_fwf = pd.read_fwf(
        'data/arquivo_largura_fixa.txt',
        colspecs=col_specs,
        names=col_names,
        header=None, # Nomes definidos em 'names'
        skiprows=1,  # Pular linha de cabeçalho original
        encoding='utf-8',
        dtype={'ID': str, 'Valor': float, 'Status': str} # Especificar tipos
    )
    print("DataFrame carregado de arquivo de largura fixa (FWF):")
    display(df_fwf)
    df_fwf.info()

except FileNotFoundError:
    print("Erro: Arquivo 'arquivo_largura_fixa.txt' não encontrado.")
except Exception as e:
    print(f"Erro ao ler o arquivo FWF: {e}")

### 4.6. Exportação de Dados

O Pandas permite salvar DataFrames em diversos formatos.

In [None]:
# Exemplo usando o df_csv que carregamos (se ele foi carregado com sucesso)
if not df_csv.empty:
    try:
        # Salvar em CSV com separador ; e decimal ,
        df_csv.to_csv('output/dados_exportados.csv', index=False, sep=';', decimal=',', encoding='utf-8')
        print("Arquivo 'dados_exportados.csv' salvo com sucesso (separador ';', decimal ',').")
        
        # Salvar em Excel (requer openpyxl)
        # df_csv.to_excel('dados_exportados.xlsx', index=False, sheet_name='Clientes')
        # print("Arquivo 'dados_exportados.xlsx' salvo com sucesso.")
    except Exception as e:
        print(f"Erro ao exportar dados: {e}")
else:
    print("DataFrame df_csv está vazio, exportação pulada.")

## 5. Organização e Transformação Estrutural (Data Tidying)

Após importar, organizamos e transformamos a estrutura dos dados para o formato **Tidy Data**, facilitando a análise.

**Princípios Tidy:**
1.  Cada variável em sua coluna.
2.  Cada observação em sua linha.
3.  Cada tipo de unidade observacional em sua tabela.

### 5.1. Inspeção Inicial

Revisitar o DataFrame `df_csv` (se carregado) para verificar estrutura e conteúdo.

In [None]:
if not df_csv.empty:
    print("Revisando df_csv:")
    print("Dimensões:", df_csv.shape)
    print("\nTipos de Dados:\n", df_csv.dtypes)
    print("\nCabeçalho:")
    display(df_csv.head())
    print("\nValores Nulos por Coluna:\n", df_csv.isnull().sum())
    print("\nLinhas Duplicadas:", df_csv.duplicated().sum())
    print("\nResumo Estatístico (Numérico):")
    display(df_csv.describe(include=[np.number]))
    print("\nResumo Estatístico (Datas):")
    display(df_csv.describe(include=['datetime64[ns]']))
else:
    print("df_csv não foi carregado ou está vazio.")

### 5.2. Renomear Colunas

Padronizar nomes de colunas.

In [None]:
if not df_csv.empty:
    df_renomeado = df_csv.copy() # Trabalhar com uma cópia

    # Método 1: Atribuição direta (bom para renomear todas, cuidado com a ordem)
    # df_renomeado.columns = ['id_cliente', 'nome_cliente', 'idade', 'cidade_residencia', 'data_registro', 'gasto_total']

    # Método 2: rename() (bom para renomear algumas)
    df_renomeado = df_renomeado.rename(columns={
        'ID_Cliente': 'id_cli', 
        'Nome': 'nome', 
        'Idade': 'idade', # Manter 'idade' como exemplo
        'Cidade': 'cidade',
        'Valor_Gasto': 'valor',
        'Data_Cadastro': 'dt_cadastro'
    })

    print("Colunas após renomear:")
    print(df_renomeado.columns)
    display(df_renomeado.head(2))
else:
    df_renomeado = pd.DataFrame() # Inicializa vazio se df_csv não existe
    print("df_csv vazio, pulando renomeação.")

### 5.3. Ajustar Tipos de Dados (`dtypes`)

Converter colunas para tipos apropriados, se não foram ajustados na importação.

In [None]:
if not df_renomeado.empty:
    df_tipos = df_renomeado.copy()
    print("Tipos ANTES das conversões:")
    df_tipos.info()

    # Exemplo: Se 'valor' fosse string (object), converteríamos:
    # df_tipos['valor'] = df_tipos['valor'].astype(str) # Forçar para string para o exemplo
    # df_tipos['valor'] = pd.to_numeric(df_tipos['valor'], errors='coerce')

    # Exemplo: Converter 'idade' (Int64) para float (pode perder informação se houver NaN)
    # Nota: A conversão Int64 -> float transforma <NA> em NaN
    try:
         df_tipos['idade'] = df_tipos['idade'].astype(float)
    except Exception as e:
         print(f"Erro ao converter idade para float: {e}")

    # Exemplo: Se 'dt_cadastro' fosse string, converteríamos:
    # df_tipos['dt_cadastro'] = df_tipos['dt_cadastro'].astype(str)
    # df_tipos['dt_cadastro'] = pd.to_datetime(df_tipos['dt_cadastro'], errors='coerce', format='%Y-%m-%d')

    print("\nTipos DEPOIS das conversões:")
    df_tipos.info()
    display(df_tipos.head(3))
else:
    df_tipos = pd.DataFrame()
    print("df_renomeado vazio, pulando ajuste de tipos.")

### 5.4. Reshaping: Wide vs Long Format

*   **Wide:** Observação espalhada em múltiplas colunas (e.g., `vendas_2022`, `vendas_2023`).
*   **Long:** Formato Tidy, cada observação em uma linha.

#### `pd.melt()` (Wide para Long - Como `gather`)

In [None]:
df_wide = pd.DataFrame({
    'Aluno': ['João', 'Maria', 'Pedro'],
    'Nota_P1': [7.5, 8.0, 6.0],
    'Nota_P2': [8.5, 7.0, 9.0],
    'Nota_Trabalho': [9.0, 9.5, 8.0]
})
print("DataFrame Wide (Original):")
display(df_wide)

df_long = pd.melt(
    df_wide,
    id_vars=['Aluno'],                     # Coluna(s) identificadora(s)
    value_vars=['Nota_P1', 'Nota_P2', 'Nota_Trabalho'], # Colunas a empilhar
    var_name='Avaliacao',                # Nome da nova coluna com nomes antigos
    value_name='Nota'                    # Nome da nova coluna com valores
)

print("\nDataFrame Long (Após melt):")
# sort_values ajuda a visualizar a transformação
display(df_long.sort_values(by=['Aluno', 'Avaliacao']).reset_index(drop=True))

#### `df.pivot_table()` (Long para Wide - Como `spread`)

Usaremos `pivot_table` que é mais flexível que `pivot` pois permite agregação se houver linhas duplicadas para a mesma combinação de índice/coluna.

In [None]:
# Usando o df_long do exemplo anterior
print("DataFrame Long (Original):")
display(df_long.head())

try:
    df_pivoted = pd.pivot_table(
        df_long,
        index='Aluno',      # Coluna(s) que formarão o novo índice
        columns='Avaliacao', # Coluna cujos valores virarão novas colunas
        values='Nota',      # Coluna com os valores a preencher
        aggfunc='first'     # Função de agregação (se houver duplicatas, 'first' pega o primeiro valor)
                            # Se não houver duplicatas, a agregação não tem efeito prático além de permitir o uso.
    )
    # pivot_table coloca a coluna 'index' como índice, reset_index() a transforma de volta em coluna
    df_pivoted = df_pivoted.reset_index()
    # A ordem das colunas pode mudar, vamos reordenar para comparar com o original
    df_pivoted = df_pivoted[['Aluno', 'Nota_P1', 'Nota_P2', 'Nota_Trabalho']]
    # Remover o nome do índice das colunas (gerado pelo pivot_table)
    df_pivoted.columns.name = None

    print("\nDataFrame Wide (Após pivot_table):")
    display(df_pivoted)

except Exception as e:
    print(f"Erro ao executar pivot_table: {e}")

### 5.5. Separar e Unir Colunas

#### `str.split()` (Separar - Como `separate`)

In [None]:
df_sep = pd.DataFrame({'Codigo_Completo': ['PROD-A-10', 'PROD-B-25', 'SERV-C-05', 'PROD-D']}) # Adicionado um caso com menos partes
print("DataFrame Original:")
display(df_sep)

# Separa a coluna 'Codigo_Completo' em três novas colunas usando '-' como delimitador
# expand=True cria novas colunas no DataFrame
# O número de colunas criadas é baseado no máximo de splits encontrados
# Se uma linha tem menos splits, as colunas extras ficam com None
split_cols = df_sep['Codigo_Completo'].str.split('-', expand=True)

# Renomear as colunas geradas automaticamente (0, 1, 2...)
split_cols.columns = [f'parte_{i+1}' for i in range(split_cols.shape[1])]

# Juntar as novas colunas ao DataFrame original (ou substituir)
df_sep = pd.concat([df_sep, split_cols], axis=1)

print("\nDataFrame após split:")
display(df_sep)

#### Unir Colunas (Como `unite`)

Geralmente feito com concatenação de strings.

In [None]:
df_unir = pd.DataFrame({
    'Prefixo': ['USR', 'ADM', 'USR'],
    'ID_Num': [101, 5, 22],
    'Status': ['Ativo', 'Inativo', 'Ativo']
})
print("DataFrame Original:")
display(df_unir)

# Unir colunas para formar um identificador único
# Importante converter números para string antes de concatenar
# str.zfill(3) garante que o ID tenha 3 dígitos com zeros à esquerda
df_unir['ID_Completo'] = df_unir['Prefixo'] + '_' + \
                        df_unir['ID_Num'].astype(str).str.zfill(3) + '_' + \
                        df_unir['Status']

print("\nDataFrame após unir colunas:")
display(df_unir)

## 6. Integração no Fluxo de Trabalho

Estas etapas (Coleta -> Importação -> Organização/Transformação) são interconectadas e muitas vezes iterativas. O objetivo é obter um DataFrame *Tidy* pronto para as próximas fases.

Lembrando:

<svg aria-roledescription="flowchart-v2" role="graphics-document document" viewBox="0 0 1128.484375 169.75" style="max-width: 1128.484375px;" class="flowchart" xmlns="http://www.w3.org/2000/svg" width="100%" id="export-svg"><style xmlns="http://www.w3.org/1999/xhtml">@import url("https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.2.0/css/all.min.css"); p {margin: 0;}</style><style>#export-svg{font-family:arial,sans-serif;font-size:14px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#export-svg .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#export-svg .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#export-svg .error-icon{fill:#552222;}#export-svg .error-text{fill:#552222;stroke:#552222;}#export-svg .edge-thickness-normal{stroke-width:1px;}#export-svg .edge-thickness-thick{stroke-width:3.5px;}#export-svg .edge-pattern-solid{stroke-dasharray:0;}#export-svg .edge-thickness-invisible{stroke-width:0;fill:none;}#export-svg .edge-pattern-dashed{stroke-dasharray:3;}#export-svg .edge-pattern-dotted{stroke-dasharray:2;}#export-svg .marker{fill:#333333;stroke:#333333;}#export-svg .marker.cross{stroke:#333333;}#export-svg svg{font-family:arial,sans-serif;font-size:14px;}#export-svg p{margin:0;}#export-svg .label{font-family:arial,sans-serif;color:#333;}#export-svg .cluster-label text{fill:#333;}#export-svg .cluster-label span{color:#333;}#export-svg .cluster-label span p{background-color:transparent;}#export-svg .label text,#export-svg span{fill:#333;color:#333;}#export-svg .node rect,#export-svg .node circle,#export-svg .node ellipse,#export-svg .node polygon,#export-svg .node path{fill:#ECECFF;stroke:#B8B8FF;stroke-width:1px;}#export-svg .rough-node .label text,#export-svg .node .label text,#export-svg .image-shape .label,#export-svg .icon-shape .label{text-anchor:middle;}#export-svg .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#export-svg .rough-node .label,#export-svg .node .label,#export-svg .image-shape .label,#export-svg .icon-shape .label{text-align:center;}#export-svg .node.clickable{cursor:pointer;}#export-svg .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#export-svg .arrowheadPath{fill:#333333;}#export-svg .edgePath .path{stroke:#333333;stroke-width:1px;}#export-svg .flowchart-link{stroke:#333333;fill:none;}#export-svg .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#export-svg .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#export-svg .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#export-svg .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#export-svg .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#export-svg .cluster text{fill:#333;}#export-svg .cluster span{color:#333;}#export-svg div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#export-svg .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#export-svg rect.text{fill:none;stroke-width:0;}#export-svg .icon-shape,#export-svg .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#export-svg .icon-shape p,#export-svg .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#export-svg .icon-shape rect,#export-svg .image-shape rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#export-svg .node .neo-node{stroke:#B8B8FF;}#export-svg [data-look="neo"].node rect,#export-svg [data-look="neo"].cluster rect,#export-svg [data-look="neo"].node polygon{stroke:#B8B8FF;filter:drop-shadow( 1px 2px 2px rgba(185,185,185,1));}#export-svg [data-look="neo"].node path{stroke:#B8B8FF;stroke-width:1;}#export-svg [data-look="neo"].node .outer-path{filter:drop-shadow( 1px 2px 2px rgba(185,185,185,1));}#export-svg [data-look="neo"].node .neo-line path{stroke:#B8B8FF;filter:none;}#export-svg [data-look="neo"].node circle{stroke:#B8B8FF;filter:drop-shadow( 1px 2px 2px rgba(185,185,185,1));}#export-svg [data-look="neo"].node circle .state-start{fill:#000000;}#export-svg [data-look="neo"].statediagram-cluster rect{fill:#ECECFF;stroke:#B8B8FF;stroke-width:1;}#export-svg [data-look="neo"].icon-shape .icon{fill:#B8B8FF;filter:drop-shadow( 1px 2px 2px rgba(185,185,185,1));}#export-svg [data-look="neo"].icon-shape .icon-neo path{stroke:#B8B8FF;filter:drop-shadow( 1px 2px 2px rgba(185,185,185,1));}#export-svg :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}</style><g><marker orient="auto" markerHeight="14" markerWidth="10.5" markerUnits="userSpaceOnUse" refY="7" refX="7.75" viewBox="0 0 11.5 14" class="marker flowchart-v2" id="export-svg_flowchart-v2-pointEnd"><path style="stroke-width: 0; stroke-dasharray: 1, 0;" class="arrowMarkerPath" d="M 0 0 L 11.5 7 L 0 14 z"/></marker><marker orient="auto" markerHeight="14" markerWidth="11.5" markerUnits="userSpaceOnUse" refY="7" refX="4" viewBox="0 0 11.5 14" class="marker flowchart-v2" id="export-svg_flowchart-v2-pointStart"><polygon style="stroke-width: 0; stroke-dasharray: 1, 0;" class="arrowMarkerPath" points="0,7 11.5,14 11.5,0"/></marker><marker orient="auto" markerHeight="14" markerWidth="10.5" markerUnits="userSpaceOnUse" refY="7" refX="11.5" viewBox="0 0 11.5 14" class="marker flowchart-v2" id="export-svg_flowchart-v2-pointEnd-margin"><path style="stroke-width: 0; stroke-dasharray: 1, 0;" class="arrowMarkerPath" d="M 0 0 L 11.5 7 L 0 14 z"/></marker><marker orient="auto" markerHeight="14" markerWidth="11.5" markerUnits="userSpaceOnUse" refY="7" refX="1" viewBox="0 0 11.5 14" class="marker flowchart-v2" id="export-svg_flowchart-v2-pointStart-margin"><polygon style="stroke-width: 0; stroke-dasharray: 1, 0;" class="arrowMarkerPath" points="0,7 11.5,14 11.5,0"/></marker><marker orient="auto" markerHeight="14" markerWidth="14" markerUnits="userSpaceOnUse" refX="10.75" refY="5" viewBox="0 0 10 10" class="marker flowchart-v2" id="export-svg_flowchart-v2-circleEnd"><circle style="stroke-width: 0; stroke-dasharray: 1, 0;" class="arrowMarkerPath" r="5" cy="5" cx="5"/></marker><marker orient="auto" markerHeight="14" markerWidth="14" markerUnits="userSpaceOnUse" refY="5" refX="0" viewBox="0 0 10 10" class="marker flowchart-v2" id="export-svg_flowchart-v2-circleStart"><circle style="stroke-width: 0; stroke-dasharray: 1, 0;" class="arrowMarkerPath" r="5" cy="5" cx="5"/></marker><marker orient="auto" markerHeight="14" markerWidth="14" markerUnits="userSpaceOnUse" refX="12.25" refY="5" viewBox="0 0 10 10" class="marker flowchart-v2" id="export-svg_flowchart-v2-circleEnd-margin"><circle style="stroke-width: 0; stroke-dasharray: 1, 0;" class="arrowMarkerPath" r="5" cy="5" cx="5"/></marker><marker orient="auto" markerHeight="14" markerWidth="14" markerUnits="userSpaceOnUse" refY="5" refX="-2" viewBox="0 0 10 10" class="marker flowchart-v2" id="export-svg_flowchart-v2-circleStart-margin"><circle style="stroke-width: 0; stroke-dasharray: 1, 0;" class="arrowMarkerPath" r="5" cy="5" cx="5"/></marker><marker orient="auto" markerHeight="12" markerWidth="12" markerUnits="userSpaceOnUse" refY="7.5" refX="17.7" viewBox="0 0 15 15" class="marker cross flowchart-v2" id="export-svg_flowchart-v2-crossEnd"><path style="stroke-width: 2.5;" class="arrowMarkerPath" d="M 1,1 L 14,14 M 1,14 L 14,1"/></marker><marker orient="auto" markerHeight="12" markerWidth="12" markerUnits="userSpaceOnUse" refY="7.5" refX="-3.5" viewBox="0 0 15 15" class="marker cross flowchart-v2" id="export-svg_flowchart-v2-crossStart"><path style="stroke-width: 2.5; stroke-dasharray: 1, 0;" class="arrowMarkerPath" d="M 1,1 L 14,14 M 1,14 L 14,1"/></marker><marker orient="auto" markerHeight="12" markerWidth="12" markerUnits="userSpaceOnUse" refY="7.5" refX="17.7" viewBox="0 0 15 15" class="marker cross flowchart-v2" id="export-svg_flowchart-v2-crossEnd-margin"><path style="stroke-width: 2.5;" class="arrowMarkerPath" d="M 1,1 L 14,14 M 1,14 L 14,1"/></marker><marker orient="auto" markerHeight="12" markerWidth="12" markerUnits="userSpaceOnUse" refY="7.5" refX="-3.5" viewBox="0 0 15 15" class="marker cross flowchart-v2" id="export-svg_flowchart-v2-crossStart-margin"><path style="stroke-width: 2.5; stroke-dasharray: 1, 0;" class="arrowMarkerPath" d="M 1,1 L 14,14 M 1,14 L 14,1"/></marker><g class="root"><g class="clusters"><g data-look="neo" data-et="cluster" data-id="Ciclo de Exploração e Modelagem" id="Ciclo de Exploração e Modelagem" class="cluster"><rect height="153.75" width="491.140625" y="8" x="461.734375" style="fill:#ffffde"/><g transform="translate(681.234375, 8)" class="cluster-label"><foreignObject height="21" width="52.140625"><div xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="nodeLabel"><p>Explorar</p></span></div></foreignObject></g></g></g><g class="edgePaths"><path marker-end="url(#export-svg_flowchart-v2-pointEnd-margin)" data-points="W3sieCI6MTU0LjQwNjI1LCJ5Ijo5NC4yNX0seyJ4IjoxNzkuNDA2MjUsInkiOjk0LjI1fSx7IngiOjIwNC40MDYyNSwieSI6OTQuMjV9XQ==" data-id="L_A_B_0" data-et="edge" data-edge="true" style="stroke-dasharray: 0 0 37 9; stroke-dashoffset: 0;;" class="edge-thickness-normal edge-pattern-solid edge-thickness-normal edge-pattern-solid flowchart-link" id="L_A_B_0" d="M154.40625,94.25L179.40625,94.25L200.40625,94.25"/><path marker-end="url(#export-svg_flowchart-v2-pointEnd-margin)" data-points="W3sieCI6NDExLjczNDM3NSwieSI6OTQuMjV9LHsieCI6NDM2LjczNDM3NSwieSI6OTQuMjV9LHsieCI6NDYxLjczNDM3NSwieSI6OTQuMjV9LHsieCI6NDg2LjczNDM3NSwieSI6OTQuMjV9XQ==" data-id="L_B_T_0" data-et="edge" data-edge="true" style="stroke-dasharray: 0 0 62 9; stroke-dashoffset: 0;;" class="edge-thickness-normal edge-pattern-solid edge-thickness-normal edge-pattern-solid flowchart-link" id="L_B_T_0" d="M411.734375,94.25L436.734375,94.25L461.734375,94.25L482.734375,94.25"/><path marker-end="url(#export-svg_flowchart-v2-pointEnd-margin)" data-points="W3sieCI6NjEyLjM1OTM3NSwieSI6NzMuNjg1MDUzMzgwNzgyOTJ9LHsieCI6NjM3LjM1OTM3NSwieSI6NjUuNX0seyJ4Ijo2NjIuMzU5Mzc1LCJ5Ijo2NS41fV0=" data-id="L_T_V_0" data-et="edge" data-edge="true" style="stroke-dasharray: 0 0 38.13121032714844 9; stroke-dashoffset: 0;;" class="edge-thickness-normal edge-pattern-solid edge-thickness-normal edge-pattern-solid flowchart-link" id="L_T_V_0" d="M612.359375,73.68505338078292L627.3805869454299,68.76707651608703Q637.359375,65.5 647.859375,65.5L658.359375,65.5"/><path marker-end="url(#export-svg_flowchart-v2-pointEnd-margin)" data-points="W3sieCI6NzcyLjcxODc1LCJ5Ijo2NS41fSx7IngiOjc5Ny43MTg3NSwieSI6NjUuNX0seyJ4Ijo4MjIuNzE4NzUsInkiOjc0Ljc2NDg1Mzk3Nzg0NDkyfV0=" data-id="L_V_M_0" data-et="edge" data-edge="true" style="stroke-dasharray: 0 0 38.4246940612793 9; stroke-dashoffset: 0;;" class="edge-thickness-normal edge-pattern-solid edge-thickness-normal edge-pattern-solid flowchart-link" id="L_V_M_0" d="M772.71875,65.5L786.3879818372984,65.5Q797.71875,65.5 808.343389203467,69.43742923149635L818.9680284069341,73.3748584629927"/><path marker-end="url(#export-svg_flowchart-v2-pointEnd-margin)" data-points="W3sieCI6ODIyLjcxODc1LCJ5IjoxMTMuNzM1MTQ2MDIyMTU1MDh9LHsieCI6Nzk3LjcxODc1LCJ5IjoxMjN9LHsieCI6NzE3LjUzOTA2MjUsInkiOjEyM30seyJ4Ijo2MzcuMzU5Mzc1LCJ5IjoxMjN9LHsieCI6NjEyLjM1OTM3NSwieSI6MTE0LjgxNDk0NjYxOTIxNzA4fV0=" data-id="L_M_T_0" data-et="edge" data-edge="true" style="stroke-dasharray: 0 0 199.8626251220703 9; stroke-dashoffset: 0;;" class="edge-thickness-normal edge-pattern-solid edge-thickness-normal edge-pattern-solid flowchart-link" id="L_M_T_0" d="M822.71875,113.73514602215508L810.21875,118.36757301107754Q797.71875,123 784.3879818372984,123L717.5390625,123L648.5122748594061,123Q637.359375,123 626.7600965342039,119.52977359838702L616.1608180684077,116.05954719677405"/><path marker-end="url(#export-svg_flowchart-v2-pointEnd-margin)" data-points="W3sieCI6OTI3Ljg3NSwieSI6OTQuMjV9LHsieCI6OTUyLjg3NSwieSI6OTQuMjV9LHsieCI6OTc3Ljg3NSwieSI6OTQuMjV9LHsieCI6MTAwMi44NzUsInkiOjk0LjI1fV0=" data-id="L_M_E_0" data-et="edge" data-edge="true" style="stroke-dasharray: 0 0 62 9; stroke-dashoffset: 0;;" class="edge-thickness-normal edge-pattern-solid edge-thickness-normal edge-pattern-solid flowchart-link" id="L_M_E_0" d="M927.875,94.25L952.875,94.25L977.875,94.25L998.875,94.25"/></g><g class="edgeLabels"><g class="edgeLabel"><g transform="translate(0, 0)" data-id="L_A_B_0" class="label"><foreignObject height="0" width="0"><div class="labelBkg" xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="edgeLabel"></span></div></foreignObject></g></g><g class="edgeLabel"><g transform="translate(0, 0)" data-id="L_B_T_0" class="label"><foreignObject height="0" width="0"><div class="labelBkg" xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="edgeLabel"></span></div></foreignObject></g></g><g class="edgeLabel"><g transform="translate(0, 0)" data-id="L_T_V_0" class="label"><foreignObject height="0" width="0"><div class="labelBkg" xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="edgeLabel"></span></div></foreignObject></g></g><g class="edgeLabel"><g transform="translate(0, 0)" data-id="L_V_M_0" class="label"><foreignObject height="0" width="0"><div class="labelBkg" xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="edgeLabel"></span></div></foreignObject></g></g><g class="edgeLabel"><g transform="translate(0, 0)" data-id="L_M_T_0" class="label"><foreignObject height="0" width="0"><div class="labelBkg" xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="edgeLabel"></span></div></foreignObject></g></g><g class="edgeLabel"><g transform="translate(0, 0)" data-id="L_M_E_0" class="label"><foreignObject height="0" width="0"><div class="labelBkg" xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="edgeLabel"></span></div></foreignObject></g></g></g><g class="nodes"><g transform="translate(81.203125, 94.25)" data-look="neo" data-et="node" data-node="true" data-id="A" id="flowchart-A-0" class="node default"><rect stroke="url(#gradient)" height="45" width="146.40625" y="-22.5" x="-73.203125" data-id="A" style="" class="basic label-container"/><g transform="translate(-57.203125, -10.5)" style="" class="label"><rect/><foreignObject height="21" width="114.40625"><div xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="nodeLabel"><p>Importação/Coleta</p></span></div></foreignObject></g></g><g transform="translate(308.0703125, 94.25)" data-look="neo" data-et="node" data-node="true" data-id="B" id="flowchart-B-1" class="node default"><rect stroke="url(#gradient)" height="45" width="207.328125" y="-22.5" x="-103.6640625" data-id="B" style="" class="basic label-container"/><g transform="translate(-87.6640625, -10.5)" style="" class="label"><rect/><foreignObject height="21" width="175.328125"><div xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="nodeLabel"><p>Organização/Limpeza (Tidy)</p></span></div></foreignObject></g></g><g transform="translate(549.546875, 94.25)" data-look="neo" data-et="node" data-node="true" data-id="T" id="flowchart-T-3" class="node default"><rect stroke="url(#gradient)" height="45" width="125.625" y="-22.5" x="-62.8125" data-id="T" style="" class="basic label-container"/><g transform="translate(-46.8125, -10.5)" style="" class="label"><rect/><foreignObject height="21" width="93.625"><div xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="nodeLabel"><p>Transformação</p></span></div></foreignObject></g></g><g transform="translate(717.5390625, 65.5)" data-look="neo" data-et="node" data-node="true" data-id="V" id="flowchart-V-5" class="node default"><rect stroke="url(#gradient)" height="45" width="110.359375" y="-22.5" x="-55.1796875" data-id="V" style="" class="basic label-container"/><g transform="translate(-39.1796875, -10.5)" style="" class="label"><rect/><foreignObject height="21" width="78.359375"><div xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="nodeLabel"><p>Visualização</p></span></div></foreignObject></g></g><g transform="translate(875.296875, 94.25)" data-look="neo" data-et="node" data-node="true" data-id="M" id="flowchart-M-7" class="node default"><rect stroke="url(#gradient)" height="45" width="105.15625" y="-22.5" x="-52.578125" data-id="M" style="" class="basic label-container"/><g transform="translate(-36.578125, -10.5)" style="" class="label"><rect/><foreignObject height="21" width="73.15625"><div xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="nodeLabel"><p>Modelagem</p></span></div></foreignObject></g></g><g transform="translate(1061.6796875, 94.25)" data-look="neo" data-et="node" data-node="true" data-id="E" id="flowchart-E-11" class="node default"><rect stroke="url(#gradient)" height="45" width="117.609375" y="-22.5" x="-58.8046875" data-id="E" style="" class="basic label-container"/><g transform="translate(-42.8046875, -10.5)" style="" class="label"><rect/><foreignObject height="21" width="85.609375"><div xmlns="http://www.w3.org/1999/xhtml" style="display: table-cell; white-space: normal; line-height: 1.5; max-width: 200px; text-align: center;"><span class="nodeLabel"><p>Comunicação</p></span></div></foreignObject></g></g></g></g></g><defs><filter width="130%" height="130%" id="drop-shadow"><feDropShadow flood-color="#FFFFFF" flood-opacity="0.06" stdDeviation="0" dy="4" dx="4"/></filter></defs><defs><filter width="150%" height="150%" id="drop-shadow-small"><feDropShadow flood-color="#FFFFFF" flood-opacity="0.06" stdDeviation="0" dy="2" dx="2"/></filter></defs></svg>

## 7. Conclusão

Nesta aula prática, cobrimos:
*   **Coleta:** Fontes de dados (conceito).
*   **Web Scraping:** Extração de HTML com `requests` e `BeautifulSoup` (e suas implicações éticas/legais).
*   **Importação:** Leitura de arquivos (`CSV`, `JSON`, `Excel`, `FWF`) e BD (`SQLite`) com `Pandas`, usando parâmetros para controle fino (`dtype`, `na_values`, `parse_dates`, etc.).
*   **Organização:** Inspeção inicial, renomeação, ajuste de tipos de dados.
*   **Transformação Estrutural:** Remodelagem (`melt`, `pivot_table`) e manipulação de colunas (`split`, concatenação) para alcançar o formato Tidy.

Dominar estas etapas é fundamental para garantir a qualidade e a adequação dos dados para análises subsequentes.

**Próximos Passos:** Limpeza de Dados (tratamento de nulos, outliers) e Análise Exploratória de Dados (EDA).