# **Data Aquisition Notebook**

This notebook aims to centralize the code that captures the data necessary for the development of the trading advisor project. We will use data sources such as Nasdaq Data Link, Yahoo Finance, and Investpy.

## **Initial Setup**

This initial setup section is responsible for managing and installing the necessary packages for running the notebook that acquires the data used in the project. It's worth noting that I also provide the requirements.txt file where all the packages are centralized, and it can be executed in a more silent manner.

### Install Libs

In [3]:
%pip install --upgrade pip --q --no-cache
%pip install pandas --q --no-cache
%pip install python-dotenv --q --no-cache
%pip install yfinance --q --no-cache
%pip install Nasdaq-Data-Link --q --no-cache
%pip install investpy --q --no-cache

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Import Libs

In [4]:
import os
from pathlib import Path
import pandas as pd
from dotenv import load_dotenv
import nasdaqdatalink as ndl
import yfinance as yf
import investpy as inv

### Load enviroments var

In [5]:
load_dotenv()

True

### Fullfilment constants vars

In [6]:
NASDAQ_API_KEY = os.getenv("NASDAQ_API_KEY")
ndl.ApiConfig.api_key = NASDAQ_API_KEY

In [7]:
country_name = "brazil"
start_date = "2019-01-01"

### Create a file path default

In [8]:
file_path_raw = str(Path(os.getcwd()).parent/"data/raw")

## **Collecting Data**

This data collection section is responsible for retrieving data from the sources for the respective processing. We will collect fundamental and balance sheet data for publicly listed companies, historical data for key macroeconomic indices, and, finally, historical stock price data for the companies' shares.

### Fundamentals

* Listing all publicly traded companies in Brazil.

In [9]:
df_compaines = inv.stocks.get_stocks(country = country_name)
df_compaines.head(2)

Unnamed: 0,country,name,full_name,isin,currency,symbol
0,brazil,ABC BRASIL PN,Banco ABC Brasil SA,BRABCBACNPR4,BRL,ABCB4
1,brazil,BRASILAGRO ON,BrasilAgro - Co ON NM,BRAGROACNOR7,BRL,AGRO3


* Creating a list with the company symbols and adding the (.SA) characters. Adding these characters is necessary as it's required to explicitly specify to the Yahoo Finance API that these companies are from South America (SA).

In [10]:
tickers = list(df_compaines["symbol"])
tickers_filtered = [tickers_filtered + ".SA" for tickers_filtered in tickers if len(tickers_filtered) <= 5]

* Creating the get_fundamentals_data function to make it feasible to extract data from the Yahoo API. Notice that we've created a dictionary with the names of the main columns we will use. It's worth noting that the Yahoo API returns numerous columns that wouldn't be used in the project.

In [12]:
def get_fundamentals_data(tickers):

    df_list = []

    for ticker in tickers:
        try:
            company = yf.Ticker(ticker)
            fundamental_data = company.info

            fundamental_data = {
                'ticker': fundamental_data.get('symbol'),
                'long_name': fundamental_data.get('longName'),
                'sector': fundamental_data.get('sector'),
                'industry': fundamental_data.get('industry'),
                'market_cap': fundamental_data.get('marketCap'),
                'enterprise_value': fundamental_data.get('enterpriseValue'),
                'total_revenue': fundamental_data.get('totalRevenue'),
                'profit_margins': fundamental_data.get('profitMargins'),
                'operating_margins': fundamental_data.get('operatingMargins'),
                'net_income': fundamental_data.get('netIncome'),
                'dividend_rate': fundamental_data.get('dividendRate'),
                'beta': fundamental_data.get('beta'),
                'ebitda': fundamental_data.get('ebitda'),
                'trailing_pe': fundamental_data.get('trailingPE'),
                'forward_pe': fundamental_data.get('forwardPE'),
                'volume': fundamental_data.get('volume'),
                'average_volume': fundamental_data.get('averageVolume'),
                'fifty_two_week_low': fundamental_data.get('fiftyTwoWeekLow'),
                'fifty_two_week_high': fundamental_data.get('fiftyTwoWeekHigh'),
                'price_to_sales_trailing_12_months': fundamental_data.get('priceToSalesTrailing12Months'),
                'fifty_day_average': fundamental_data.get('fiftyDayAverage'),
                'two_hundred_day_average': fundamental_data.get('twoHundredDayAverage'),
                'trailing_annual_dividend_rate': fundamental_data.get('trailingAnnualDividendRate'),
                'trailing_annual_dividend_yield': fundamental_data.get('trailingAnnualDividendYield'),
                'book_value': fundamental_data.get('bookValue'),
                'price_to_book': fundamental_data.get('priceToBook'),
                'total_cash': fundamental_data.get('totalCash'),
                'total_cash_per_share': fundamental_data.get('totalCashPerShare'),
                'total_debt': fundamental_data.get('totalDebt'),
                'earnings_quarterly_growth': fundamental_data.get('earningsQuarterlyGrowth'),
                'revenue_growth': fundamental_data.get('revenueGrowth'),
                'gross_margins': fundamental_data.get('grossMargins'),
                'ebitda_margins': fundamental_data.get('ebitdaMargins'),
                'operating_margins': fundamental_data.get('operatingMargins'),
                'return_on_assets': fundamental_data.get('returnOnAssets'),
                'return_on_equity': fundamental_data.get('returnOnEquity'),
                'gross_profits': fundamental_data.get('grossProfits')
            }

            df_ticker = pd.DataFrame([fundamental_data])
            df_list.append(df_ticker)

        except Exception as e:
            print(f"Error processing ticker {ticker}: {e}")

    if df_list:
        df_fundamentals = pd.concat(df_list, ignore_index=True)
    else:
        df_fundamentals = pd.DataFrame()

    return df_fundamentals

In [13]:
df_fundamentals = get_fundamentals_data(tickers_filtered)

Error processing ticker ENBR3.SA: 404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v6/finance/quoteSummary/ENBR3.SA?modules=financialData&modules=quoteType&modules=defaultKeyStatistics&modules=assetProfile&modules=summaryDetail&ssl=true
Error processing ticker GFSA1.SA: 404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v6/finance/quoteSummary/GFSA1.SA?modules=financialData&modules=quoteType&modules=defaultKeyStatistics&modules=assetProfile&modules=summaryDetail&ssl=true
Error processing ticker CELP6.SA: 404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v6/finance/quoteSummary/CELP6.SA?modules=financialData&modules=quoteType&modules=defaultKeyStatistics&modules=assetProfile&modules=summaryDetail&ssl=true
Error processing ticker JBDU1.SA: 404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v6/finance/quoteSummary/JBDU1.SA?modules=financialData&modules=quoteType&modules=defaultKeyStatistics&modules=assetProfile

  df_fundamentals = pd.concat(df_list, ignore_index=True)


* When executing the get_fundamentals_data function, we encounter exceptions in data retrieval for some companies, for example (Error processing ticker JBDU1.SA: 404 Client Error: Not Found for url: ...). This error occurs because the Yahoo API couldn't find data for the company, either because the company is no longer listed on the stock exchange or has changed its symbol, as in the case of Via Varejo, which has changed its symbol several times (VVIA3, VIIA3, and BHIA3).

In [14]:
df_fundamentals.head(2)

Unnamed: 0,ticker,long_name,sector,industry,market_cap,enterprise_value,total_revenue,profit_margins,operating_margins,net_income,...,total_cash,total_cash_per_share,total_debt,earnings_quarterly_growth,revenue_growth,gross_margins,ebitda_margins,return_on_assets,return_on_equity,gross_profits
0,ABCB4.SA,Banco ABC Brasil S.A.,Financial Services,Banks - Regional,4265434112,14773393408,1941778944,0.41576,0.38826,,...,7774305792,35.162,18298464256,0.001,0.003,0.0,0.0,0.0153,0.1568,1973086000
1,AGRO3.SA,BrasilAgro - Companhia Brasileira de Proprieda...,Consumer Defensive,Farm Products,2466479872,2912933120,1249437056,0.21493,0.25031,,...,383836992,3.885,872075008,6.801,0.671,0.25252,0.21201,0.03839,0.1217,315504000


In [15]:
Path(file_path_raw).mkdir(parents=True, exist_ok=True)
df_fundamentals.to_csv(file_path_raw + "/fundamentals_raw.csv", index=False)

### Stocks

* The stocks section is responsible for collecting historical price data of stocks traded on the stock exchange.

In [16]:
def get_stocks_data(tickers_filtered, start_date):

    df = yf.download(tickers_filtered,start=start_date)
    _stacked = df.stack()
    _stacked.reset_index(inplace=True)
    _stacked.rename(columns={"level_1" : "ticker"}, inplace=True)
    df_stocks = _stacked

    return df_stocks

In [17]:
df_stocks = get_stocks_data(tickers_filtered, start_date)

[*********************100%%**********************]  389 of 389 completed


83 Failed downloads:
['BIDI3.SA', 'LINX3.SA', 'CEPE5.SA', 'MEND6.SA', 'TCNO3.SA', 'JPSA3.SA', 'NATU3.SA', 'CCXC3.SA', 'ITEC3.SA', 'TCNO4.SA', 'SMLS3.SA', 'BTOW3.SA', 'BKBR3.SA', 'BBRK3.SA', 'EEEL4.SA', 'LCAM3.SA', 'GPCP3.SA', 'LLIS3.SA', 'SEDU3.SA', 'JBDU4.SA', 'DMMO3.SA', 'MTIG4.SA', 'CELP6.SA', 'GFSA1.SA', 'PARD3.SA', 'CRDE3.SA', 'RLOG3.SA', 'TRPN3.SA', 'TESA3.SA', 'PNVL4.SA', 'JBDU1.SA', 'CELP5.SA', 'CESP3.SA', 'SULA4.SA', 'TIMP3.SA', 'DTEX3.SA', 'ELPL3.SA', 'TCR11.SA', 'CARD3.SA', 'CELP3.SA', 'JBDU3.SA', 'IGTA3.SA', 'CAMB4.SA', 'CESP6.SA', 'IDVL3.SA', 'BTTL3.SA', 'ELEK3.SA', 'IDNT3.SA', 'WIZS3.SA', 'BIDI4.SA', 'MEND5.SA', 'BSEV3.SA', 'CPRE3.SA', 'TIET3.SA', 'IDVL9.SA', 'PCAR4.SA', 'BRDT3.SA', 'LAME4.SA', 'CESP5.SA', 'RANI4.SA', 'LIQO3.SA', 'ELEK4.SA', 'TIET2.SA', 'SPRI3.SA', 'TOYB4.SA', 'LAME3.SA', 'IDVL4.SA', 'GNDI3.SA', 'MMXM3.SA', 'VVAR3.SA', 'OMGE3.SA', 'EEEL3.SA', 'BRML3.SA', 'CELP7.SA', 'TIET4.SA', 'SULA3.SA', 'TOYB3.SA', 'CCPR3.SA', 'CNTO3.SA', 'VIVT4.SA', 'HGTX3.SA', 'JBDU




* When executing the get_stocks_data function, we encounter a progress bar that indicates the completion percentage and the actions that couldn't be collected for reasons already known. These reasons may include the company changing its ticker symbol or no longer being traded on the stock exchange. In the case of stocks, it may also be that the company is not traded on the spot market

In [18]:
df_stocks.head(2)

Unnamed: 0,Date,ticker,Adj Close,Close,High,Low,Open,Volume
0,2019-01-02,AALR3.SA,13.116831,13.25,13.5,13.25,13.31,264200.0
1,2019-01-02,ABCB4.SA,13.077144,17.120001,17.200001,16.35,16.469999,571700.0


In [19]:
Path(file_path_raw).mkdir(parents=True, exist_ok=True)
df_stocks.to_csv(file_path_raw + "/stocks_raw.csv", index=False)

### Macroeconomic

* TODO - FIX 404 NasdaqDataLink

* The macroeconomics section is responsible for collecting the country's key macroeconomic data, such as IPCA, GDP, SELIC, and others.

In [20]:
def get_marcoeconomic_data(start_date):

    df_selic = pd.DataFrame()
    df_consumer_confidence = pd.DataFrame()
    df_pib = pd.DataFrame()
    df_incc = pd.DataFrame()
    df_ipca = pd.DataFrame()
    df_dolar = pd.DataFrame()

    df_selic['selic'] = ndl.get('BCB/432', start_date = start_date, collapse = "monthly")
    df_consumer_confidence['confidence'] = ndl.get('BCB/4393', start_date = start_date, collapse = "monthly")
    df_pib['pib'] = ndl.get("BCB/4380", start_date = start_date, collapse = "monthly")
    df_incc['incc'] = ndl.get('BCB/192', start_date = start_date, collapse = "monthly")
    df_ipca['ipca'] = ndl.get("BCB/13522", start_date = start_date, collapse = "monthly")
    df_dolar['dolar'] = ndl.get('BCB/10813', start_date = start_date, collapse = "monthly")

    df_macroeconomic = pd.concat([df_selic, df_consumer_confidence, df_pib, df_incc, df_ipca, df_dolar], axis=1)
    df_macroeconomic = df_macroeconomic.reset_index()
    
    return df_macroeconomic

In [None]:
df_macroeconomic = get_marcoeconomic_data(start_date)

In [None]:
df_macroeconomic.head(2)

In [73]:
Path(file_path_raw).mkdir(parents=True, exist_ok=True)
df_macroeconomic.to_csv(file_path_raw + "/macroeconomic_raw.csv", index=False)

* We can identify that at the end of the generated dataset, there are null values in the last two rows. This occurs due to the regulatory body's meeting frequency for these indices, which occurs every 45 days.