# Web Scrapping

Coletar dados automaticamente, sem a intermediação de uma API. Um scrapping é capaz de consultar um servidor web, solicitar dados, analisar e extrair informações necessárias.
Geralmente é usada quando não temos outras maneiras de extrair os dados. Geralmente os BD públicos não estão disponíveis, mas os dados ficam disponíveis em endpoints. 
***Cuidado com a LGPD***!

**BeuatifulSoup**: uma das mais utilizadas em Python para fazer web scrapping. Constroi uma árvore a partir de vários elementos de uma página e fornece uma interface simples para acessá-los.

## Demonstração

In [1]:
import requests
from bs4 import BeautifulSoup as bsoup
import pandas as pd

In [2]:
# https://statisticstimes.com/tech/top-computer-languages.php
html = requests.get('https://statisticstimes.com/tech/top-computer-languages.php').content
soup = bsoup(html, 'html.parser')

In [3]:
primeiro_paragrafo = soup.findAll('p')
primeiro_paragrafo

[<p>Python is the top programming language in TIOBE and PYPL Index. C closely follow Top-ranked Python in TIOBE. 
 In PYPL, a gap is wider as top-ranked Python has taken a lead of close to 10% from 2nd ranked Java.</p>,
 <p><b>TIOBE:</b> Python, C, Java and C++ are way ahead of others in TIOBE Index. C++ is about to surpass Java.
 C# and Visual Basic are very close to each other at 5th and 6th number. These four have negative
 trends in the past five years: Java, C, C#, and PHP. PHP was at 3rd position in Mar 2010 is now at 13th. 
 Positions of Java and C have not been much affected, but their ratings are constantly declining. The rating
 of Java has declined from 26.49% in June 2001 to 10.47% in Jun 2022.</p>,
 <p><b>PYPL:</b> Acc to PYPL, which publishes separate ranking for five countries, Python is the top language in all five countries
 (US, India, Germany, United Kingdom, France). Python has taken a huge lead in these five countries over the 2nd number
 of Java, and its shares ar

In [4]:
ta_bela = soup.find('table', {'id': 'table_id1'}).find('tbody')
ta_bela

<tbody>
<tr><td class="data1">1</td><td class="data1"></td><td class="name">Python</td><td class="data1"> 27.61 %</td><td class="data1">-2.8 %</td></tr>
<tr><td class="data1">2</td><td class="data1"></td><td class="name">Java</td><td class="data1"> 17.64 %</td><td class="data1">-0.7 %</td></tr>
<tr><td class="data1">3</td><td class="data1"></td><td class="name">JavaScript</td><td class="data1"> 9.21 %</td><td class="data1">+0.4 %</td></tr>
<tr><td class="data1">4</td><td class="data1"></td><td class="name">C#</td><td class="data1"> 7.79 %</td><td class="data1">+0.8 %</td></tr>
<tr><td class="data1">5</td><td class="data1"></td><td class="name">C/C++</td><td class="data1"> 7.01 %</td><td class="data1">+0.4 %</td></tr>
<tr><td class="data1">6</td><td class="data1"></td><td class="name">PHP</td><td class="data1"> 5.27 %</td><td class="data1">-1.0 %</td></tr>
<tr><td class="data1">7</td><td class="data1"></td><td class="name">R</td><td class="data1"> 4.26 %</td><td class="data1">+0.5 %</td

In [5]:
import numpy as np
linhas = ta_bela.find_all('tr')
linguagens = []
pontos = []

for linha in linhas:
    dados = linha.find_all('td')
    linguagens.append(dados[2].text)
    pontos.append(float(dados[3].text.replace(' %', '')))

print(linguagens)
print(np.sum(pontos))

['Python', 'Java', 'JavaScript', 'C#', 'C/C++', 'PHP', 'R', 'TypeScript', 'Objective-C', 'Swift', 'Matlab', 'Kotlin', 'Go', 'Rust', 'Ruby', 'VBA', 'Ada', 'Scala', 'Visual Basic', 'Dart', 'Abap', 'Lua', 'Groovy', 'Perl', 'Julia', 'Cobol', 'Haskell', 'Delphi/Pascal']
100.0


In [6]:
dados = pd.DataFrame(linguagens, columns=['Linguagem'])
dados['Pontos (%)'] = pontos
dados

Unnamed: 0,Linguagem,Pontos (%)
0,Python,27.61
1,Java,17.64
2,JavaScript,9.21
3,C#,7.79
4,C/C++,7.01
5,PHP,5.27
6,R,4.26
7,TypeScript,2.43
8,Objective-C,2.21
9,Swift,2.17
