# Importing Finance Data

In [29]:
import pandas_datareader.data as web
from pandas_datareader import wb

In [6]:
from datetime import datetime

In [12]:
import pandas as pd

### Importing CSV Data Manually

In order to import data from a csv like this, the pandas module has a useful function: `read_csv`.

This function takes as an argument the location of the csv file you want to import, and outputs a pandas DataFrame containing the data from the file.

In [27]:
apple_prices = pd.read_csv("aapl_prices.csv")

print(apple_prices)

          date   close      volume    open      high     low
0   2019/08/13  208.97  47539790.0  201.02  212.1400  200.83
1   2019/08/12  200.48  22481890.0  199.62  202.0516  199.15
2   2019/08/09  200.99  24619750.0  201.30  202.7600  199.29
3   2019/08/08  203.43  27009520.0  200.20  203.5300  199.39
4   2019/08/07  199.04  33364400.0  195.41  199.5600  193.82
..         ...     ...         ...     ...       ...     ...
60  2019/05/17  189.00  32879090.0  186.93  190.9000  186.76
61  2019/05/16  190.08  33031360.0  189.91  192.4689  188.84
62  2019/05/15  190.92  26544720.0  186.27  191.7500  186.02
63  2019/05/14  188.66  36529680.0  186.41  189.7000  185.41
64  2019/05/13  185.72  57430620.0  187.71  189.4800  182.85

[65 rows x 6 columns]


### Importing Data Using Datareader

Many financial institutions, stock markets, and world banks provide large amounts of the data they store to the public.

Most of this data is well organized, live updated, and accessible through the use of an application programming interface (API), which gives programming languages like Python a way to download and import it.

### Pandas-Datareader Module

The `pandas-datareader` module is designed specifically to interact with some of the world’s most popular finance data APIs, and import their data into an easily digestible pandas DataFrame.

Each finance API is accessed using a different function exposed by `pandas-datareader`. Generally accessing each API requires a different set of arguments and information that needs to be provided by the programmer.

In [31]:
start = datetime(2005, 1, 1)
end = datetime(2008, 1, 1)
indicator_id = 'NY.GDP.PCAP.KD'

gdp_per_capita = wb.download(indicator=indicator_id, start=start, end=end, country=['US', 'CA', 'MX'])

print(gdp_per_capita)


                    NY.GDP.PCAP.KD
country       year                
Canada        2008    42063.633052
Mexico        2008     9276.054837
United States 2008    53854.160612


### Getting NASDAQ Symbols

The NASDAQ stock exchange identifies each of it’s stocks using a unique symbol. It also provides a useful API for accessing the symbols that are currently trading on it.

Pandas-datareader provides several functions for importing data from NASDAQ’s API through it’s `nasdaq_trader` sub-module.

In [32]:
from pandas_datareader.nasdaq_trader import get_nasdaq_symbols

symbols = get_nasdaq_symbols()
print(symbols)

        Nasdaq Traded                                      Security Name  \
Symbol                                                                     
A                True            Agilent Technologies, Inc. Common Stock   
AA               True                    Alcoa Corporation Common Stock    
AAA              True  Investment Managers Series Trust II AXS First ...   
AAAU             True             Goldman Sachs Physical Gold ETF Shares   
AAC              True  Ares Acquisition Corporation Class A Ordinary ...   
...               ...                                                ...   
ZXYZ.A           True                 Nasdaq Symbology Test Common Stock   
ZXZZT            True                                  NASDAQ TEST STOCK   
ZYME             True                      Zymeworks Inc. - Common Stock   
ZYNE             True       Zynerba Pharmaceuticals, Inc. - Common Stock   
ZYXI             True                         Zynex, Inc. - Common Stock   

       List

### Filtering Data by Date

Many of the APIs pandas-datareader connects with allow us to filter the data we get back by time.

Financial institutions tend to keep track of data dating back several decades, and when we’re importing that data, it’s useful to be able to specify exactly when we want it to be from.

One API that does just that is the Federal Reserve Bank of St. Louis (FRED), which we can access by first importing the pandas_datareader.data sub-module and then calling it’s DataReader function:

In [33]:
start = datetime(2019, 1, 1)
end = datetime(2019, 2, 1)

sap_data = web.DataReader('SP500', 'fred', start, end)
print(sap_data)
print(type(sap_data))

              SP500
DATE               
2019-01-01      NaN
2019-01-02  2510.03
2019-01-03  2447.89
2019-01-04  2531.94
2019-01-07  2549.69
2019-01-08  2574.41
2019-01-09  2584.96
2019-01-10  2596.64
2019-01-11  2596.26
2019-01-14  2582.61
2019-01-15  2610.30
2019-01-16  2616.10
2019-01-17  2635.96
2019-01-18  2670.71
2019-01-21      NaN
2019-01-22  2632.90
2019-01-23  2638.70
2019-01-24  2642.33
2019-01-25  2664.76
2019-01-28  2643.85
2019-01-29  2640.00
2019-01-30  2681.05
2019-01-31  2704.10
2019-02-01  2706.53
<class 'pandas.core.frame.DataFrame'>


### API Keys

Many finance APIs require us to pass along extra information when requesting data, one common argument is an API key.

An API key is a unique string used to identify and authenticate entities requesting data.

In [None]:
print(dr.get_data_tiingo('AAPL', api_key='my-api-key'))     # won't run without api key

### Using the Shift Operation

Once we’ve imported a DataFrame full of finance data, there’s some pretty cool ways we can manipulate it.

In this exercise we’ll look at the shift operation, a DataFrame function which shifts all the rows in a column up or down.

<img src="https://content.codecademy.com/programs/python-for-finance/importing-finance-data/data-frame-shift.gif"  width="30%" height="30%">

To demonstrate shift’s power, let’s use it on some financial data. Using data from the FRED API we’ll calculate the amount of GDP growth over the last 10 years.

Start by creating two datetime variables, `start` and `end`. Set start as January 1 2008, and end as January 1, 2018.

In [7]:
start = datetime(2008, 1, 1)
end = datetime(2018, 1, 1)

Now call the `web.DataReader` function to get the GDP data from FRED, and store it in a variable called `gdp`.

In [8]:
gdp = web.DataReader('GDP', 'fred', start, end)
print(gdp)

                  GDP
DATE                 
2008-01-01  14706.538
2008-04-01  14865.701
2008-07-01  14898.999
2008-10-01  14608.208
2009-01-01  14430.901
2009-04-01  14381.236
2009-07-01  14448.882
2009-10-01  14651.248
2010-01-01  14764.611
2010-04-01  14980.193
2010-07-01  15141.605
2010-10-01  15309.471
2011-01-01  15351.444
2011-04-01  15557.535
2011-07-01  15647.681
2011-10-01  15842.267
2012-01-01  16068.824
2012-04-01  16207.130
2012-07-01  16319.540
2012-10-01  16420.386
2013-01-01  16629.050
2013-04-01  16699.551
2013-07-01  16911.068
2013-10-01  17133.114
2014-01-01  17144.281
2014-04-01  17462.703
2014-07-01  17743.227
2014-10-01  17852.540
2015-01-01  17991.348
2015-04-01  18193.707
2015-07-01  18306.960
2015-10-01  18332.079
2016-01-01  18425.306
2016-04-01  18611.617
2016-07-01  18775.459
2016-10-01  18968.041
2017-01-01  19148.194
2017-04-01  19304.506
2017-07-01  19561.896
2017-10-01  19894.750
2018-01-01  20155.486


To calculate the growth over each three month period, we’ll want to subtract each increment’s GDP data from the data in the next increment.

To do this, subtract the result of shifting the GDP column by 1, from the unshifted GDP column, and store it in a new column on the DataFrame called `growth`.

In [10]:
gdp['growth'] = gdp['GDP'] - gdp['GDP'].shift(1)
print(gdp)

                  GDP   growth
DATE                          
2008-01-01  14706.538      NaN
2008-04-01  14865.701  159.163
2008-07-01  14898.999   33.298
2008-10-01  14608.208 -290.791
2009-01-01  14430.901 -177.307
2009-04-01  14381.236  -49.665
2009-07-01  14448.882   67.646
2009-10-01  14651.248  202.366
2010-01-01  14764.611  113.363
2010-04-01  14980.193  215.582
2010-07-01  15141.605  161.412
2010-10-01  15309.471  167.866
2011-01-01  15351.444   41.973
2011-04-01  15557.535  206.091
2011-07-01  15647.681   90.146
2011-10-01  15842.267  194.586
2012-01-01  16068.824  226.557
2012-04-01  16207.130  138.306
2012-07-01  16319.540  112.410
2012-10-01  16420.386  100.846
2013-01-01  16629.050  208.664
2013-04-01  16699.551   70.501
2013-07-01  16911.068  211.517
2013-10-01  17133.114  222.046
2014-01-01  17144.281   11.167
2014-04-01  17462.703  318.422
2014-07-01  17743.227  280.524
2014-10-01  17852.540  109.313
2015-01-01  17991.348  138.808
2015-04-01  18193.707  202.359
2015-07-

### Calculating Basic Financial Statistics
Two useful calculations that can be made on financial data are variance and covariance.

To illustrate these concepts, let’s use the example of a DataFrame which measures stock and bond prices over time.

##### Variance
Variance measures how far a set of numbers are spread out from their average. In finance, this is used to determine the volatility of investments.

```
        dataframe['stocks'].var()
        dataframe['bonds'].var()
```

In variance calculations, stocks tend to have a larger value than bonds.

That’s because the stock prices are more spread out than bonds, indicating that stocks are a more volatile investment.

##### Covariance

Covariance, in a financial context, describes the relationship between the returns on two different investments over a period of time, and can be used to help balance a portfolio.

```
        dataframe.cov()
```        

Calling `cov()` on our stocks/bonds produces a matrix which defines the covariance values between each column pair in the DataFrame.

In our example data, when stock prices go up, bonds go down. We can use the covariance function to see this numerically.

In the code editor, there is some data that was originally obtained from the Thrift Savings Plan (TSP) API...

In [15]:
tsp_data = pd.read_csv("tsp_data.csv", header = 0, index_col = 0)

print(tsp_data.head(10))

                    L Income   L 2020   L 2030   L 2040   L 2050   G Fund  \
Date                                                                        
2019-01-02 0:00:00   19.6889  26.7883  29.6180  31.8164  18.2119  15.9949   
2019-01-03 0:00:00   19.6271  26.6553  29.3183  31.4315  17.9603  15.9961   
2019-01-04 0:00:00   19.7511  26.9151  29.8906  32.1665  18.4410  15.9973   
2019-01-07 0:00:00   19.7808  26.9710  30.0121  32.3227  18.5436  16.0009   
2019-01-08 0:00:00   19.8170  27.0415  30.1738  32.5311  18.6806  16.0021   
2019-01-09 0:00:00   19.8508  27.1066  30.3210  32.7207  18.8050  16.0033   
2019-01-10 0:00:00   19.8655  27.1350  30.3851  32.8031  18.8590  16.0045   
2019-01-11 0:00:00   19.8683  27.1385  30.3880  32.8062  18.8607  16.0057   
2019-01-14 0:00:00   19.8503  27.1011  30.2959  32.6858  18.7810  16.0093   
2019-01-15 0:00:00   19.8835  27.1651  30.4403  32.8710  18.9019  16.0105   

                     F Fund   C Fund   S Fund   I Fund  
Date              

Print the result of calling the variance function on the entire tsp_data DataFrame, notice how it outputs a DataFrame with the variance for each column.

In [17]:
print(tsp_data.var())

L Income    0.062194
L 2020      0.179628
L 2030      0.670104
L 2040      1.053787
L 2050      0.434321
G Fund      0.005991
F Fund      0.237778
C Fund      3.319402
S Fund      3.631685
I Fund      0.793820
dtype: float64


Print out the result of calling the `cov()` function on the DataFrame and see if you can spot any trends among the columns.

In [18]:
print(tsp_data.cov())

          L Income    L 2020    L 2030    L 2040    L 2050    G Fund  \
L Income  0.062194  0.105525  0.200417  0.249990  0.159678  0.017801   
L 2020    0.105525  0.179628  0.343624  0.429226  0.274497  0.029519   
L 2030    0.200417  0.343624  0.670104  0.840046  0.538868  0.052727   
L 2040    0.249990  0.429226  0.840046  1.053787  0.676359  0.064969   
L 2050    0.159678  0.274497  0.538868  0.676359  0.434321  0.041061   
G Fund    0.017801  0.029519  0.052727  0.064969  0.041061  0.005991   
F Fund    0.101581  0.166371  0.286665  0.350549  0.220029  0.036658   
C Fund    0.447764  0.766153  1.487608  1.863328  1.194421  0.119597   
S Fund    0.406779  0.712232  1.450433  1.833292  1.184367  0.089845   
I Fund    0.193118  0.336594  0.685383  0.865807  0.558977  0.043314   

            F Fund    C Fund    S Fund    I Fund  
L Income  0.101581  0.447764  0.406779  0.193118  
L 2020    0.166371  0.766153  0.712232  0.336594  
L 2030    0.286665  1.487608  1.450433  0.685383  
L 2

## Review

Let’s review importing a csv file. In the workspace we have an apple_prices.csv file which has historical stock price data for Apple.

Use pandas to import this file into a variable called `apple_prices` and print out the resulting DataFrame.

In [22]:
apple_prices = pd.read_csv("aapl_prices.csv")
print(apple_prices.head(10))

         date   close      volume    open      high       low
0  2019/08/13  208.97  47539790.0  201.02  212.1400  200.8300
1  2019/08/12  200.48  22481890.0  199.62  202.0516  199.1500
2  2019/08/09  200.99  24619750.0  201.30  202.7600  199.2900
3  2019/08/08  203.43  27009520.0  200.20  203.5300  199.3900
4  2019/08/07  199.04  33364400.0  195.41  199.5600  193.8200
5  2019/08/06  197.00  35824790.0  196.31  198.0670  194.0400
6  2019/08/05  193.34  52392970.0  197.99  198.6490  192.5800
7  2019/08/02  204.02  40862120.0  205.53  206.4300  201.6300
8  2019/08/01  208.43  54017920.0  213.90  218.0300  206.7435
9  2019/07/31  213.04  69281360.0  216.42  221.3700  211.3000


Now, let’s calculate the variance on our Apple stock data to see how volatile the stock is...

In [23]:
print("Variance: ", apple_prices['open'].var())

Variance:  103.82291877403847


Finally, let’s grab some finance data from the FRED API.

FRED stores historical prices on gasoline for New York State identified by the code `APUS12A74714`. Use the web.DataReader function to grab historical data between January 1, 2008 and January 1, 2018 and store it in a variable called `gas_prices`.



In [26]:
start = datetime(2008, 1, 1)
end = datetime(2018, 1, 1)

gas_prices = web.DataReader('APUS12A74714', 'fred', start, end)
gas_prices.rename(columns={'APUS12A74714': 'gas_price'}, inplace=True)
print(gas_prices.head(10))
print(type(gas_prices))

            gas_price
DATE                 
2008-01-01      3.125
2008-02-01      3.081
2008-03-01      3.221
2008-04-01      3.411
2008-05-01      3.856
2008-06-01      4.131
2008-07-01      4.175
2008-08-01      3.855
2008-09-01      3.634
2008-10-01      3.148
<class 'pandas.core.frame.DataFrame'>
