# Primeiros passos com o BigQuery no Google Colab
*   Baseado no notebook https://colab.research.google.com/notebooks/bigquery.ipynb

*   Execute os passos abaixo, conforme explicado nos slides:
1.   Use o [Cloud Resource Manager](https://console.cloud.google.com/cloud-resource-manager) para **criar um projeto na Google Cloud Platform (GCP)**, se você ainda não tem um.
2.   [Habilite as APIs do BigQuery](https://console.cloud.google.com/flows/enableapi?apiid=bigquery) para o projeto.

* Ou consulte o [Tutorial em vídeo](https://www.youtube.com/watch?v=JLXLCv5nUCE) 

In [None]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


## Pré-processamento dos centroides dos municipios

### Solução 1 - Processamento do arquivo de banco de dados (dbf)




In [1]:
!pip install dbf

Collecting dbf
  Downloading dbf-0.99.1-py3-none-any.whl (107 kB)
[?25l[K     |███                             | 10 kB 22.2 MB/s eta 0:00:01[K     |██████▏                         | 20 kB 11.9 MB/s eta 0:00:01[K     |█████████▏                      | 30 kB 9.6 MB/s eta 0:00:01[K     |████████████▎                   | 40 kB 8.6 MB/s eta 0:00:01[K     |███████████████▎                | 51 kB 5.2 MB/s eta 0:00:01[K     |██████████████████▍             | 61 kB 5.8 MB/s eta 0:00:01[K     |█████████████████████▍          | 71 kB 5.7 MB/s eta 0:00:01[K     |████████████████████████▌       | 81 kB 6.4 MB/s eta 0:00:01[K     |███████████████████████████▌    | 92 kB 6.3 MB/s eta 0:00:01[K     |██████████████████████████████▋ | 102 kB 5.2 MB/s eta 0:00:01[K     |████████████████████████████████| 107 kB 5.2 MB/s 
[?25hCollecting aenum
  Downloading aenum-3.1.5-py3-none-any.whl (128 kB)
[?25l[K     |██▌                             | 10 kB 20.0 MB/s eta 0:00:01[K     |██

In [2]:
#Fonte oficial https://www.ibge.gov.br/geociencias/organizacao-do-territorio/estrutura-territorial/27385-localidades.html?=&t=downloads

!wget https://github.com/renatocol/Latitude_Longitude_Brasil/raw/master/BR_Localidades_2010.dbf

--2021-11-15 22:33:11--  https://github.com/renatocol/Latitude_Longitude_Brasil/raw/master/BR_Localidades_2010.dbf
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/renatocol/Latitude_Longitude_Brasil/master/BR_Localidades_2010.dbf [following]
--2021-11-15 22:33:12--  https://raw.githubusercontent.com/renatocol/Latitude_Longitude_Brasil/master/BR_Localidades_2010.dbf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17794056 (17M) [application/octet-stream]
Saving to: ‘BR_Localidades_2010.dbf’


2021-11-15 22:33:12 (103 MB/s) - ‘BR_Localidades_2010.dbf’ saved [17794056/17794056]



In [3]:
import dbf
import pandas as pd

table = dbf.Table(filename='./BR_Localidades_2010.dbf')
table.open(dbf.READ_ONLY)
df = pd.DataFrame(table)
table.close()

print(df)

          0                     1           2   ...         19          20   21
0          1  110001505000001       URBANO      ... -11.935540  337.735719  0.0
1          2  110001515000001       URBANO      ... -12.437239  215.244429  0.0
2          3  110001520000001       URBANO      ... -12.601415  181.044807  0.0
3          4  110001525000001       URBANO      ... -11.919792  191.576571  0.0
4          5  110001530000001       URBANO      ... -13.079806  157.285277  0.0
...      ...                   ...         ...  ...        ...         ...  ...
21881  21882  530010805180237       URBANO      ... -15.939671  911.712363  0.0
21882  21883  530010805180238       URBANO      ... -15.936009  926.632968  0.0
21883  21884  530010805180314       URBANO      ... -15.939968  902.635257  0.0
21884  21885  530010805200120       URBANO      ... -15.939726  921.346973  0.0
21885  21886  530010805200123       URBANO      ... -15.947606  953.389949  0.0

[21886 rows x 22 columns]


In [4]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
0,1,110001505000001,URBANO,110001505006.0,Redondo ...,11000150500,...,110001505,ALTA FLORESTA D'OESTE ...,1100015,ALTA FLORESTA D'OESTE ...,CACOAL ...,LESTE RONDONIENSE ...,RONDÔNIA ...,1,5,CIDADE ...,ALTA FLORESTA D'OESTE ...,-61.999824,-11.93554,337.735719,0.0
1,2,110001515000001,URBANO,,...,11000151500,...,110001515,FILADÉLFIA D'OESTE ...,1100015,ALTA FLORESTA D'OESTE ...,CACOAL ...,LESTE RONDONIENSE ...,RONDÔNIA ...,2,15,VILA ...,FILADÉLFIA D'OESTE ...,-62.043898,-12.437239,215.244429,0.0
2,3,110001520000001,URBANO,,...,11000152000,...,110001520,IZIDOLÂNDIA ...,1100015,ALTA FLORESTA D'OESTE ...,CACOAL ...,LESTE RONDONIENSE ...,RONDÔNIA ...,2,20,VILA ...,IZIDOLÂNDIA ...,-62.175549,-12.601415,181.044807,0.0
3,4,110001525000001,URBANO,,...,11000152500,...,110001525,NOVA GEASE D'OESTE ...,1100015,ALTA FLORESTA D'OESTE ...,CACOAL ...,LESTE RONDONIENSE ...,RONDÔNIA ...,2,25,VILA ...,NOVA GEASE D'OESTE ...,-62.31865,-11.919792,191.576571,0.0
4,5,110001530000001,URBANO,,...,11000153000,...,110001530,ROLIM DE MOURA DO GUAPORÉ ...,1100015,ALTA FLORESTA D'OESTE ...,CACOAL ...,LESTE RONDONIENSE ...,RONDÔNIA ...,2,30,VILA ...,ROLIM DE MOURA DO GUAPORÉ ...,-62.276812,-13.079806,157.285277,0.0


In [9]:

# Equivalente SQL: select col9 as cod_ibge, col16 as categoria, col18 as long, col19 as lat from df
df_geo = df[[9,16,18,19]].rename(columns={9:"cod_ibge", 16:"categoria", 18:"long", 19:"lat"})
df_geo['cod_ibge'] = df_geo['cod_ibge'].str.strip()
df_geo['categoria'] = df_geo['categoria'].str.strip()
df_geo = df_geo[df_geo['categoria']=='CIDADE']
df_geo.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5565 entries, 0 to 21855
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   cod_ibge   5565 non-null   object 
 1   categoria  5565 non-null   object 
 2   long       5565 non-null   float64
 3   lat        5565 non-null   float64
dtypes: float64(2), object(2)
memory usage: 217.4+ KB


In [10]:
df_geo['lat_long'] = df_geo[['lat','long']].apply(lambda x: f"{str(x['lat']).replace(',','.')},{str(x['long']).replace(',','.')}", axis=1)
df_geo.head()

Unnamed: 0,cod_ibge,categoria,long,lat,lat_long
0,1100015,CIDADE,-61.999824,-11.93554,"-11.9355403048,-61.9998238963"
6,1100023,CIDADE,-63.033269,-9.908463,"-9.90846286657,-63.033269278"
7,1100031,CIDADE,-60.544314,-13.499763,"-13.4997634597,-60.5443135812"
9,1100049,CIDADE,-61.442944,-11.433865,"-11.4338650287,-61.4429442118"
18,1100056,CIDADE,-60.818426,-13.195033,"-13.195033032,-60.8184261647"


### Solução 2 - Converter o arquivo dbf para xlsx usando o excel. Processar o arquivo xlsx com o pandas
#### Vantagem: mais rápido do que o pandas processar o arquivo dbf.
#### URL do arquivo exportado para xlsx https://github.com/alexlopespereira/enapespcd2021/raw/main/data/originais/centroide_municipios/BR_Localidades_2010_v1.xlsx

In [11]:
df_xlsx = pd.read_excel('https://github.com/alexlopespereira/enapespcd2021/raw/main/data/originais/centroide_municipios/BR_Localidades_2010_v1.xlsx')
df_xlsx.head()

Unnamed: 0,"ID,N,10,0","CD_GEOCODI,C,20","TIPO,C,10","CD_GEOCODB,C,20","NM_BAIRRO,C,60","CD_GEOCODS,C,20","NM_SUBDIST,C,60","CD_GEOCODD,C,20","NM_DISTRIT,C,60","CD_GEOCODM,C,20","NM_MUNICIP,C,60","NM_MICRO,C,100","NM_MESO,C,100","NM_UF,C,60","CD_NIVEL,C,1","CD_CATEGOR,C,5","NM_CATEGOR,C,50","NM_LOCALID,C,60","LONG,N,24,6","LAT,N,24,6","ALT,N,24,5","GMRotation,N,24,5"
0,1,110001505000001,URBANO,110001500000.0,Redondo,11000150500,,110001505,ALTA FLORESTA D'OESTE,1100015,ALTA FLORESTA D'OESTE,CACOAL,LESTE RONDONIENSE,RONDÔNIA,1,5,CIDADE,ALTA FLORESTA D'OESTE,-61.999824,-11.93554,337.735719,0
1,2,110001515000001,URBANO,,,11000151500,,110001515,FILADÉLFIA D'OESTE,1100015,ALTA FLORESTA D'OESTE,CACOAL,LESTE RONDONIENSE,RONDÔNIA,2,15,VILA,FILADÉLFIA D'OESTE,-62.043898,-12.437239,215.244429,0
2,3,110001520000001,URBANO,,,11000152000,,110001520,IZIDOLÂNDIA,1100015,ALTA FLORESTA D'OESTE,CACOAL,LESTE RONDONIENSE,RONDÔNIA,2,20,VILA,IZIDOLÂNDIA,-62.175549,-12.601415,181.044807,0
3,4,110001525000001,URBANO,,,11000152500,,110001525,NOVA GEASE D'OESTE,1100015,ALTA FLORESTA D'OESTE,CACOAL,LESTE RONDONIENSE,RONDÔNIA,2,25,VILA,NOVA GEASE D'OESTE,-62.31865,-11.919792,191.576571,0
4,5,110001530000001,URBANO,,,11000153000,,110001530,ROLIM DE MOURA DO GUAPORÉ,1100015,ALTA FLORESTA D'OESTE,CACOAL,LESTE RONDONIENSE,RONDÔNIA,2,30,VILA,ROLIM DE MOURA DO GUAPORÉ,-62.276812,-13.079806,157.285277,0


In [16]:
dfxlsx_geo = df_xlsx[['CD_GEOCODM,C,20','NM_CATEGOR,C,50','LONG,N,24,6','LAT,N,24,6']].rename(columns={'CD_GEOCODM,C,20':"cod_ibge", 'NM_CATEGOR,C,50':"categoria", 'LONG,N,24,6':"long", 'LAT,N,24,6':"lat"})
dfxlsx_geo.head()

Unnamed: 0,cod_ibge,categoria,long,lat
0,1100015,CIDADE,-61.999824,-11.93554
1,1100015,VILA,-62.043898,-12.437239
2,1100015,VILA,-62.175549,-12.601415
3,1100015,VILA,-62.31865,-11.919792
4,1100015,VILA,-62.276812,-13.079806


In [19]:
# dfxlsx_geo['cod_ibge'] = dfxlsx_geo['cod_ibge'].str.strip()
dfxlsx_geo['categoria'] = dfxlsx_geo['categoria'].str.strip()
dfxlsx_geo = dfxlsx_geo[dfxlsx_geo['categoria']=='CIDADE']
dfxlsx_geo['lat_long'] = dfxlsx_geo[['lat','long']].apply(lambda x: f"{str(x['lat']).replace(',','.')},{str(x['long']).replace(',','.')}", axis=1)
dfxlsx_geo.head()

Unnamed: 0,cod_ibge,categoria,long,lat,lat_long
0,1100015,CIDADE,-61.999824,-11.93554,"-11.9355403047646,-61.9998238962936"
6,1100023,CIDADE,-63.033269,-9.908463,"-9.9084628665672,-63.0332692780484"
7,1100031,CIDADE,-60.544314,-13.499763,"-13.4997634596963,-60.5443135812009"
9,1100049,CIDADE,-61.442944,-11.433865,"-11.4338650286852,-61.4429442118224"
18,1100056,CIDADE,-60.818426,-13.195033,"-13.1950330320399,-60.8184261646815"


## Join com a tabela de PIB per capita

In [26]:
import pandas as pd
## Defina o id do seu projeto no bigquery!!!!!
project_id = 'enap-331414' # Defina o id do seu projeto no bigquery!!!!!
## Defina o id do seu projeto no bigquery!!!!!

df_pibpercapita = pd.io.gbq.read_gbq('''
SELECT pop.*, dsc.nome_municipio, pib.pib, pib.pib/pop.populacao as pibpercapita FROM `basedosdados.br_ibge_populacao.municipio` pop
LEFT JOIN `basedosdados.br_ibge_pib.municipio` pib on pop.id_municipio = pib.id_municipio and pib.ano = pop.ano
LEFT JOIN (
    select distinct (sc.id_municipio), sc.nome_municipio from `basedosdados.br_geobr_mapas.setor_censitario_2010` sc
    ) as dsc on dsc.id_municipio = pop.id_municipio
''', project_id=project_id)

df_pibpercapita.head()

Unnamed: 0,ano,sigla_uf,id_municipio,populacao,nome_municipio,pib,pibpercapita
0,1991,RO,1100015,31981.0,Alta Floresta D'oeste,,
1,1992,RO,1100015,34768.0,Alta Floresta D'oeste,,
2,1993,RO,1100015,37036.0,Alta Floresta D'oeste,,
3,1994,RO,1100015,39325.0,Alta Floresta D'oeste,,
4,1995,RO,1100015,41574.0,Alta Floresta D'oeste,,


In [30]:
#Algebra relacional: Join. Ou seja, junção de tabelas.
df_merge = df_pibpercapita.merge(df_geo[['cod_ibge','lat_long']], how='left', left_on='id_municipio', right_on='cod_ibge') 
df_merge.head()

Unnamed: 0,ano,sigla_uf,id_municipio,populacao,nome_municipio,pib,pibpercapita,cod_ibge,lat_long
0,1991,RO,1100015,31981.0,Alta Floresta D'oeste,,,1100015,"-11.9355403048,-61.9998238963"
1,1992,RO,1100015,34768.0,Alta Floresta D'oeste,,,1100015,"-11.9355403048,-61.9998238963"
2,1993,RO,1100015,37036.0,Alta Floresta D'oeste,,,1100015,"-11.9355403048,-61.9998238963"
3,1994,RO,1100015,39325.0,Alta Floresta D'oeste,,,1100015,"-11.9355403048,-61.9998238963"
4,1995,RO,1100015,41574.0,Alta Floresta D'oeste,,,1100015,"-11.9355403048,-61.9998238963"


In [31]:
del df_merge['cod_ibge']
df_merge.head()

Unnamed: 0,ano,sigla_uf,id_municipio,populacao,nome_municipio,pib,pibpercapita,lat_long
0,1991,RO,1100015,31981.0,Alta Floresta D'oeste,,,"-11.9355403048,-61.9998238963"
1,1992,RO,1100015,34768.0,Alta Floresta D'oeste,,,"-11.9355403048,-61.9998238963"
2,1993,RO,1100015,37036.0,Alta Floresta D'oeste,,,"-11.9355403048,-61.9998238963"
3,1994,RO,1100015,39325.0,Alta Floresta D'oeste,,,"-11.9355403048,-61.9998238963"
4,1995,RO,1100015,41574.0,Alta Floresta D'oeste,,,"-11.9355403048,-61.9998238963"


In [32]:
project_id = 'enap-331414'

In [33]:
df_merge.to_gbq("enapdatasets.pibpercapita",
              project_id=project_id,
              chunksize=40000,
              if_exists='replace',
              )

5it [00:20,  4.06s/it]
