# Data Acquisition and Load Data

There are many ways to get a dataset like:
- API
- Scrapping
- Download file


## Files

To keep order, all scripts to get data, must stay in `src` directory.

```
├── src                <- Source code for use in this project.
    ├── __init__.py    <- Makes src a Python module
    │
    ├── make_dataset.py <- Scripts to download or generate data

```

### Download files
- Must save script in `src/make_dataset.py` 
- Case orther notebook need import, do:<br/>
`from <package>.<module> import <class>`

## Imports

In [1]:
# Data analysis and data wrangling
import numpy as np
import pandas as pd

# Plotting
import seaborn as sns
import matplotlib.pyplot as plt

# PCA
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D

# Preprocessing
from sklearn.preprocessing import LabelEncoder

# Machine learning
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
import xgboost as xgb

# Metrics
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import KFold

# Dataset
from sklearn.datasets import load_iris

# Other
from IPython.display import Image
import configparser
import subprocess
import warnings
import pprint
import time
import os

---

## Cell Format

In [2]:
# Guarantees visualization inside the jupyter
%matplotlib inline

# OPTIONAL: Load the "autoreload" extension so that code can change
%load_ext autoreload

# Format the data os all table (float_format 3)
pd.set_option('display.float_format', '{:.6}'.format)

# Print xxxx rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Supress unnecessary warnings so that presentation looks clean
warnings.filterwarnings('ignore')

# Pretty print
pp = pprint.PrettyPrinter(indent=4)

### Prepare Principal Directory

In [3]:
def prepare_directory_work(end_directory: str='notebooks'):
    # Current path
    curr_dir = os.path.dirname (os.path.realpath ("__file__")) 
    
    if curr_dir.endswith(end_directory):
        os.chdir('..')
        return curr_dir
    
    return f'Current working directory: {curr_dir}'

In [4]:
prepare_directory_work(end_directory='notebooks')

'/home/campos/projetos/artificial_inteligence/data-science/flow_analysis/notebooks'

### API

It's very important with process be automated.

In [15]:
import requests
import json
import datetime


url = 'https://www.alphavantage.co/query?function=CURRENCY_EXCHANGE_RATE&from_currency=USD&to_currency=BRL&apikey=O3MU8PIQJXVFJ3F3'
response = requests.get(url)

# get data
data_reaktime = response.json()
pp.pprint(data_reaktime)

now = datetime.datetime.now()
now = str(now.strftime("%Y-%m-%d"))

# save
with open('data/dumps/' + 'USD-BRL-' + now + '.json', mode='w+') as writer:
    writer.write(json.dumps(datas))
    print('Data Storaged!')

{   'Realtime Currency Exchange Rate': {   '1. From_Currency Code': 'USD',
                                           '2. From_Currency Name': 'United '
                                                                    'States '
                                                                    'Dollar',
                                           '3. To_Currency Code': 'BRL',
                                           '4. To_Currency Name': 'Brazilian '
                                                                  'Real',
                                           '5. Exchange Rate': '3.76610000',
                                           '6. Last Refreshed': '2019-07-10 '
                                                                '16:06:13',
                                           '7. Time Zone': 'UTC',
                                           '8. Bid Price': '-',
                                           '9. Ask Price': '-'}}
Data Storaged!


### Scraping page
 It's very important with process be automated.

- Examples
  - Download gastos públicos senadores
  - Download stock price

#### Gastos públicos senadores

- Install beatiful
`pip install beautifulsoup4`

In [6]:
# Site to get csv
url = 'https://www12.senado.leg.br/transparencia/dados-abertos-transparencia/dados-abertos-ceaps'

In [7]:
# from <package>.<module> import <class>
from src.dump_data.dump_data import *

In [9]:
dump_file_csv(url)

Try analysing page ...
http://www.senado.gov.br/transparencia/LAI/verba/2019.csv
http://www.senado.gov.br/transparencia/LAI/verba/2018.csv
http://www.senado.gov.br/transparencia/LAI/verba/2017.csv
http://www.senado.gov.br/transparencia/LAI/verba/2016.csv
http://www.senado.gov.br/transparencia/LAI/verba/2015.csv
http://www.senado.gov.br/transparencia/LAI/verba/2014.csv
http://www.senado.gov.br/transparencia/LAI/verba/2013.csv
http://www.senado.gov.br/transparencia/LAI/verba/2012.csv
http://www.senado.gov.br/transparencia/LAI/verba/2011.csv
http://www.senado.gov.br/transparencia/LAI/verba/2010.csv
http://www.senado.gov.br/transparencia/LAI/verba/2009.csv
http://www.senado.gov.br/transparencia/LAI/verba/2008.csv
data/dumps/2019.csv downloaded!
data/dumps/2018.csv downloaded!
data/dumps/2017.csv downloaded!
data/dumps/2016.csv downloaded!
data/dumps/2015.csv downloaded!
data/dumps/2014.csv downloaded!
data/dumps/2013.csv downloaded!
data/dumps/2012.csv downloaded!
data/dumps/2011.csv downl

## Load data

- Using `open`
- Using pandas `read_csv`

#### open

In [17]:
open('data/raw/enrollments.csv', 'rb')

<_io.BufferedReader name='data/raw/enrollments.csv'>

#### pandas

In [18]:
import pandas as pd

In [20]:
%%time

dataframe_name = pd.read_csv('data/raw/enrollments.csv', 
                            encoding='utf8',
                            delimiter=',',
                            verbose=True)

Tokenization took: 1.38 ms
Type conversion took: 2.01 ms
Parser memory cleanup took: 0.01 ms
CPU times: user 9.94 ms, sys: 6 µs, total: 9.94 ms
Wall time: 8.8 ms
