# Data Analysis - Introduction to Pandas

**Author**: [Gabriele Pompa](https://www.linkedin.com/in/gabrielepompa/): gabriele.pompa@unisi.com

# Table of contents

[Executive Summary](#summary)

**TODO**

### **Resources**: 

**TODO**

# Executive Summary <a name="summary"></a>

**TODO**

These are the basic imports that we need to work with NumPy, Pandas and to plot data using Matplotlib functionalities

In [3]:
# for NumPy arrays
import numpy as np

# for Pandas Series and DataFrame
import pandas as pd

# for Matplotlib plotting
import matplotlib.pyplot as plt

# to do inline plots in the Notebook
%matplotlib inline

# Introduction

Before talking about specific Input/Output (IO) protocols, it is important to mention that typical operating system functionalities (like creating and deleting files, folders, etc) are accessible from Python code using [os module](https://docs.python.org/3/library/os.html). This is a module we will include in our basic imports sectsion hereafter.

In [4]:
import os

we use `os.makedirs()` function to create the `Data` folder, under our `IT_For_Business_And_Finance_2019_20` class folder, where we will put all our data files. Function `os.path.exists()` returns `True` if the folder (or file) path it receives in input already exists, otherwise `False`.

In [5]:
dataFolderPath = "../Data"

if not os.path.exists(dataFolderPath):
    os.makedirs(dataFolderPath)

Notice the use of `..` syntax. The double dots `..` in file path Strings refers to _one directory above_ in the directory tree. Therefore, since the notebook you are reading is located in the `IT_For_Business_And_Finance_2019_20/Notebooks` folder, `../Data` points (and is thus equivalent) to `IT_For_Business_And_Finance_2019_20/Data`.

# IO without Pandas

When it comes to IO operations, Python is very flexible and offers several options. We'll review here two typical ways to transfer Python objects across machines:
- [JSON](https://docs.python.org/3/library/json.html#module-json) module, which implements human-readable encoding and decoding of basic Python object hierarchies. It is mostly suitable for Python Lists and Dicts.
- [Pickle](https://docs.python.org/3/library/pickle.html) module, which implements binary protocols for serializing and de-serializing a Python object structure. It convers a broad spectrum of Python data-structures.


## JSON format: `json` module

[JSON](https://docs.python.org/3/tutorial/inputoutput.html#saving-structured-data-with-json) is the acronym for JavaScript Object Notation. It is a popular data interchange format. 

The `json` Python module can take Python hierarchies (like nested Lists with Dicts inside etc.), _serialize_ them as `.json` files (that is, convert to String representations) and then _deserialize_ them (that is, reconstruct back the original Python object).

Pros:
- JSON format is the standard to send data over a network connection.
- `.json` files are, in general, human-readable.

Cons: 
- not all Python objects are serializable using `json` (e.g. NumPy arrays cannot be serialized in this way).

Let's make an example. We want to save the `refData` Dict of Python Lists

In [6]:
refData = {
    'S&P Rating': ['A', 'BB', 'AA', 'CCC'],
    'Spread': [100, 300, 70, 700],
    'Country': ['USA', 'ITA', 'UK', 'ITA']
}

refData

{'S&P Rating': ['A', 'BB', 'AA', 'CCC'],
 'Spread': [100, 300, 70, 700],
 'Country': ['USA', 'ITA', 'UK', 'ITA']}

First-of all we import the `json` module

In [7]:
import json

We define the complete file path using the `os.path.join()` function, which concatenates the `dataFolderPath` to `Data` folder, together with `"refData.json"`, which is going to be the name of the `.json` file containing the serialized `refData` object.

In [8]:
filePath = os.path.join(dataFolderPath, "refData.json")

To create and open a new file `filePath`, we use [`open(filename, mode)` function](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files), giving it the complete path `filePath` to the file to open and mode `'w'` to open it in write-mode. Function `open()` returns a [file-object](https://docs.python.org/3/glossary.html#term-file-object) (which mediates the between IO operations and the underlying resource). We capture it in the `file` variable.

In [9]:
with open(filePath, 'w') as file:
    %time json.dump(refData, file, indent="\t")

Wall time: 0 ns


Here and alsewhere we use the syntax

```python
%time statement
```
to execute a statement and measure its execution time (Wall time) with the [`%time` magic function](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-time).

Function 

````python
json.dump(obj, file_object[, indent])
```

takes the `refData` object and serializes it as a text file, using the `file_object` file object. The optional argument `indent` is used to pretty-print nested levels of the `refData` object. Here we have used the `"\t"` character so that nested levels are distantiated of one tab. Take a look at `refData.json` file in `Data` folder... you can actually read it!

Notice the use of the [`with` statement](https://www.geeksforgeeks.org/with-statement-in-python/) which:
- manages the opening of the file `filePath`, calling `open()` function, 
- assign the file-object to the `file` variable, through the `as` keyword,
- manages the closing of the file after the end of the indented block

Now that we have serialized the `refData` object as the `refData.json` file, we can assess whether the file-object is effectively now closed using the `.closed` attribute of the `file` file-object

In [10]:
file.closed

True

Let's now reload the serialized object and retrieve the original `refData` object. Same opening through `open()`, but now in reading-mode, using mode `'r'`

In [11]:
with open(filePath, 'r') as file:
    %time refData_reloaded = json.load(file)

Wall time: 0 ns


The deserialization (from text file to Python object) is managed by function

```python
json.load(file_object)
```

which loads the contents of the file referred by `file_object` and convert them into a Python object

In [12]:
refData_reloaded

{'S&P Rating': ['A', 'BB', 'AA', 'CCC'],
 'Spread': [100, 300, 70, 700],
 'Country': ['USA', 'ITA', 'UK', 'ITA']}

Now that we have finished our IO operation, we can delete our `refData.json` file. We define a utility function to do this.

In [22]:
def removeFile(fileName):
    """
    removeFile(fileName) function remove file 'fileName', if it exists. It also success/failure message on screen.
    
    Parameters:
        fileName (str): name of the file ('Data' folder is assumed)
        
    Returns:
        None
    """

    if os.path.isfile(os.path.join(dataFolderPath, fileName)):
        os.remove(os.path.join(dataFolderPath, fileName))

        # double-check if file still exists
        fileStillExists = os.path.isfile(os.path.join(dataFolderPath, fileName))

        if fileStillExists:
            print("Failure: file {} still exists...".format(fileName))
        else:
            print("Success: file {} successfully removed!".format(fileName))
            
    else:
        print("File {} already removed.".format(fileName))

Notice the use of `os`'s functions:
- `os.path.isfile()` which returns `True` if the file in input exists and `False`, otherwise;
- `os.remove()` which removes the file in input.

In [14]:
removeFile(filePath)

Success: file ../Data\refData.json successfully removed!


Take a look in `Data` folder to see that effectively `refData.json` file is not there anymore...

Unfortunately, not all object that you work with in Python are serializable (and thus, transferrable) using the JSON format. A counter-example? NumPy arrays...

In [15]:
# unserializableFilePath = os.path.join(dataFolderPath, "dummyArray.json")
#
# with open(unserializableFilePath, 'w') as file:
#    
#    # raises a TypeError: Object of type ndarray is not JSON serializable
#    %time json.dump(np.array([1,2,3]), file)

## `pickle` module

Contrary to JSON, pickle is a protocol which allows the serialization of arbitrarily complex Python objects. In Python, it is implemented in the `pickle` module.

Pros:
- Pickle works with arbitrary Python obkects (NumPy array and Pandas Series/DataFrames too).

Cons: 
- Pickle format is not cross-platform. That is, a file serialized on a Mac OS might be impossible to de-serialize on a Windows machine (and viceversa).
- `.pkl` files are not human-readable.

In real life, especially if you have to pass data across different machines, don't use Pickle.

Let's make an example. We want to save the `mat` NumPy array

In [16]:
rows = int(1e6)

In [17]:
mat = np.array([[i*k for i in range(1,rows+1)] for k in range(1,6)]).T

In [18]:
mat

array([[      1,       2,       3,       4,       5],
       [      2,       4,       6,       8,      10],
       [      3,       6,       9,      12,      15],
       ...,
       [ 999998, 1999996, 2999994, 3999992, 4999990],
       [ 999999, 1999998, 2999997, 3999996, 4999995],
       [1000000, 2000000, 3000000, 4000000, 5000000]])

In [19]:
mat.shape

(1000000, 5)

In [20]:
mat.dtype

dtype('int32')

First-of all we import the `pickle` module

In [21]:
import pickle

In [22]:
filePath = os.path.join(dataFolderPath, "mat.pkl")

with open(filePath, 'wb') as file:
    %time pickle.dump(mat, file)

Wall time: 27.9 ms


Notice the use of `'wb'` mode when opening the file to store `mat` array. It's going to be a binary file.

Function 

````python
pickle.dump(obj, file_object)
```

takes the `mat` object and serializes it as a binary file, using the `file_object` file object.

In [23]:
file.closed

True

Let's now reload it, using the `'rb'` mode to read the binary file `"mat.pkl"`

In [24]:
with open(filePath, 'rb') as file:
    %time mat_reloaded = pickle.load(file)

Wall time: 19.9 ms


In [25]:
file.closed

True

In [26]:
mat_reloaded

array([[      1,       2,       3,       4,       5],
       [      2,       4,       6,       8,      10],
       [      3,       6,       9,      12,      15],
       ...,
       [ 999998, 1999996, 2999994, 3999992, 4999990],
       [ 999999, 1999998, 2999997, 3999996, 4999995],
       [1000000, 2000000, 3000000, 4000000, 5000000]])

Let's clean-up Data folder...

In [31]:
removeFile(filePath)

File ../Data\mat.pkl already removed.


In case you have several object that you want to keep together in a unique file, wrap them in a Python Dict

In [32]:
mat_dict = {'mat': mat,
            'mat_squared': mat**2}

In [33]:
mat_dict['mat']

array([[      1,       2,       3,       4,       5],
       [      2,       4,       6,       8,      10],
       [      3,       6,       9,      12,      15],
       ...,
       [ 999998, 1999996, 2999994, 3999992, 4999990],
       [ 999999, 1999998, 2999997, 3999996, 4999995],
       [1000000, 2000000, 3000000, 4000000, 5000000]])

In [34]:
mat_dict['mat_squared']

array([[          1,           4,           9,          16,          25],
       [          4,          16,          36,          64,         100],
       [          9,          36,          81,         144,         225],
       ...,
       [ -731379964,  1369447440,  2007514916,  1182822464, -1104629916],
       [ -729379967,  1377447428,  2025514889,  1214822416, -1054629991],
       [ -727379968,  1385447424,  2043514880,  1246822400, -1004630016]],
      dtype=int32)

In [41]:
filePath = os.path.join(dataFolderPath, "mat_dict.pkl")

In [42]:
with open(filePath, 'wb') as file:
    %time pickle.dump(mat_dict, file)

Wall time: 51.9 ms


In [43]:
with open(filePath, 'rb') as file:
    %time mat_dict_reloaded = pickle.load(file)

Wall time: 42.9 ms


In [44]:
mat_dict_reloaded['mat']

array([[      1,       2,       3,       4,       5],
       [      2,       4,       6,       8,      10],
       [      3,       6,       9,      12,      15],
       ...,
       [ 999998, 1999996, 2999994, 3999992, 4999990],
       [ 999999, 1999998, 2999997, 3999996, 4999995],
       [1000000, 2000000, 3000000, 4000000, 5000000]])

In [45]:
mat_dict_reloaded['mat_squared']

array([[          1,           4,           9,          16,          25],
       [          4,          16,          36,          64,         100],
       [          9,          36,          81,         144,         225],
       ...,
       [ -731379964,  1369447440,  2007514916,  1182822464, -1104629916],
       [ -729379967,  1377447428,  2025514889,  1214822416, -1054629991],
       [ -727379968,  1385447424,  2043514880,  1246822400, -1004630016]])

In [46]:
removeFile(filePath)

Success: file ../Data\mat_dict.pkl successfully removed!


# IO with Pandas

Pandas supports IO operations from/to many file formats. As a rule of thumb:

- to import data into Pandas, you can use `read_*` functions (like `pd.read_sql()`, `pd.read_csv()`, `pd.read_excel()`, etc.);
- to export data from Pandas, you can use `to_*` methods of Pandas DataFrames (like for a `df` DataFrame, `df.to_sql()`, `df.to_csv()`, `df.to_excel()`, etc.);

Here we'll review some of the most common file formats:

- [SQL](https://it.wikipedia.org/wiki/Structured_Query_Language), using the `sqlite3` module;
- [CSV](https://it.wikipedia.org/wiki/Comma-separated_values)
- [Excel](https://it.wikipedia.org/wiki/Microsoft_Excel)

## SQL

In a nutshell, SQL stands for Structured Query Language and is a domain-specific language to manage data held in Relational Databases. 

Let's suppose we want to save in a SQL table our reference data DataFrame `df_refData`

In [6]:
df_refData = pd.DataFrame(data={
                             'S&P Rating': ['A', 'BB', 'AA', 'CCC'],
                             'Spread': [100, 300, 70, 700],
                             'Country': ['USA', 'ITA', 'UK', 'ITA'],
                             'Market Cap': [430.0, 45.0, 161.25, 5.00]
                            },
                       index=['Firm_1', 'Firm_2', 'Firm_3', 'Firm_4'])

df_refData

Unnamed: 0,S&P Rating,Spread,Country,Market Cap
Firm_1,A,100,USA,430.0
Firm_2,BB,300,ITA,45.0
Firm_3,AA,70,UK,161.25
Firm_4,CCC,700,ITA,5.0


We'll first follow a step-by-step approach, typing real SQL queries and executing them using the `sqlite3` module. Then we'll explain the real-life approache using `df.to_sql()` method and `pd.read_sql()` function.

### SQL queries from Python: `sqlite3` module

SQLite is library that implements a SQL database engine. The built-in [module `sqlite3`](https://wiki.python.org/moin/SQLite) implements the interface between Python and SQLite, such that you can define SQL tables using Python code and store Pandas DataFrames into them.

First of all we import `sqlite3`, giving it the alias `sq3` 

In [7]:
import sqlite3 as sq3

Next we have to open a _connection_ to the SQL engine, which creates an empty `refData.db` file in our `Data` folder

In [8]:
filePath = os.path.join(dataFolderPath, "refData.db")

In [9]:
con = sq3.connect(filePath)

The `sq3.connect()` returns a connector `con` that manages the interaction between Python code and SQLite engine

In [67]:
type(con)

sqlite3.Connection

Let's now write the SQL query to create the table `refData`, as a `query` Python string. For details on SQL syntax, there is a good [SQLite tutorial](https://www.sqlitetutorial.net/). Anyway, the meaning of `query` is just creating an empty table `refData` of five columns:
- `Firms`: a column of text strings constrained to store non-missing values. We will store here the index of `df_refData`;
- `SnP_Rating`: column of text strings that will store the `'S&P Rating'` column;
- `Spread`: column of integer numbers that will store the `'Spread'` column;
- `Country`: column of text strings that will store the `'Country'` column;
- `Market_Cap`: column of float numbers that will store the `'Market Cap'` column;

For those who are not familiar with SQL, notice that `TEXT`, `INT` and `REAL` are the `sqlite3` name for `str`, `int` and `float` Python data-types, respectively. For details, see section [SQLite Data Types](https://www.sqlitetutorial.net/sqlite-data-types/)

In [10]:
query = """CREATE TABLE refData (
                Firms TEXT NOT NULL,
                SnP_Rating TEXT,
                Spread INT,
                Country TEXT,
                Market_Cap REAL
)"""

print(query)

CREATE TABLE refData (
                Firms TEXT NOT NULL,
                SnP_Rating TEXT,
                Spread INT,
                Country TEXT,
                Market_Cap REAL
)


We now execute the query using the `.execute()` method of the connector

In [11]:
con.execute(query)

<sqlite3.Cursor at 0x1ef43b97110>

Using the `.commit()` method, we actually implement the changes due to the run of the `query` to the `"refData.db"` file

In [12]:
con.commit()

You can actually open the `"refData.db"` using [DB Browser for SQLite](https://sqlitebrowser.org/) and there you'll see an empty table `refData` of five columns and column types as requested (under 'Browse Data' tab).

We can store `df_refData` into `refData` table row-by-row, using the `.iterrows()` method, which in a for loop returns the index and a Pandas Series of the given row. Check it out

In [16]:
for index, row in df_refData.iterrows():
    print("Index: {} \nRow: \n{} \n\n".format(index, row))

Index: Firm_1 
Row: 
S&P Rating      A
Spread        100
Country       USA
Market Cap    430
Name: Firm_1, dtype: object 


Index: Firm_2 
Row: 
S&P Rating     BB
Spread        300
Country       ITA
Market Cap     45
Name: Firm_2, dtype: object 


Index: Firm_3 
Row: 
S&P Rating        AA
Spread            70
Country           UK
Market Cap    161.25
Name: Firm_3, dtype: object 


Index: Firm_4 
Row: 
S&P Rating    CCC
Spread        700
Country       ITA
Market Cap      5
Name: Firm_4, dtype: object 




The `INSERT INTO refData VALUES...` query is responsible to store the values of each row. We write and `.execute()` one query per row, replacing the `{}` brackets in the string declaration with the appropriate values of each column of the given `row`

In [18]:
for index, row in df_refData.iterrows():
    query = "INSERT INTO refData VALUES ('{}', '{}', {}, '{}', {})".\
    format(index, row['S&P Rating'], row['Spread'], row["Country"], row["Market Cap"])
    
    print(query)
    con.execute(query)

con.commit()

INSERT INTO refData VALUES ('Firm_1', 'A', 100, 'USA', 430.0)
INSERT INTO refData VALUES ('Firm_2', 'BB', 300, 'ITA', 45.0)
INSERT INTO refData VALUES ('Firm_3', 'AA', 70, 'UK', 161.25)
INSERT INTO refData VALUES ('Firm_4', 'CCC', 700, 'ITA', 5.0)


When all the `df_refData.iterrows()` is over we `.commit()` the changes. Check it out with DB Browser.

Now that we have stored values into the `refData` table, we can retrieve them using the standard `SELECT *` query. Rember always to commit changes after you have exectud your query, otherwise nothing will happen.

In [19]:
query = "SELECT * FROM refData"
cursor = con.execute(query)
con.commit()
cursor

<sqlite3.Cursor at 0x1ef43b97b20>

In this case, we capture the output of `.execute()` in a `cursor`, which we can use to fetch the data as per the `SELECT *` query

In [20]:
data=cursor.fetchall()
data

[('Firm_1', 'A', 100, 'USA', 430.0),
 ('Firm_2', 'BB', 300, 'ITA', 45.0),
 ('Firm_3', 'AA', 70, 'UK', 161.25),
 ('Firm_4', 'CCC', 700, 'ITA', 5.0)]

As expected, the values are there, but the `data` returned are not the original `df_refData` Pandas DataFrame, but in the form of a Python List. With some List comprehension effort, we can reconstruct our original DataFrame... 

In [21]:
df_refData_reloaded = pd.DataFrame(data=[t[1:] for t in data],
                                   index=[t[0] for t in data],
                                   columns=['S&P Rating', 'Spread', 'Country', 'Market Cap'])

df_refData_reloaded

Unnamed: 0,S&P Rating,Spread,Country,Market Cap
Firm_1,A,100,USA,430.0
Firm_2,BB,300,ITA,45.0
Firm_3,AA,70,UK,161.25
Firm_4,CCC,700,ITA,5.0


In the next section we'll see that Pandas provides much more efficient methods to save and retrieve data with and SQL engine.

When you stop working with a SQL engine, close the connection

In [23]:
con.close()

We clean-up our `Data` folder deleting the `"refData.db"` file (if still open, shut down DB Browser otherwise you won't be able to remove the file)

In [26]:
removeFile(filePath)

File ../Data\refData.db already removed.


### Pandas and SQL: `df.to_sql()` and `pd.read_sql()`

In [None]:
query = "SELECT * FROM refData"

df_refData_reloaded = pd.read_sql(sql=query, con=con)

df_refData_reloaded

In [None]:
query = """SELECT * FROM refData"""

df_refData_reloaded = pd.read_sql(sql=query, con=con, index_col="Firms")

df_refData_reloaded = df_refData_reloaded.rename(columns={old_col: new_col for old_col, new_col 
                                                          in zip(df_refData_reloaded.columns, df_refData.columns)})
df_refData_reloaded

In [None]:
query = "SELECT * FROM refData WHERE Market_Cap > 100"

pd.read_sql(sql=query, con=con, index_col="Firms")

In [None]:
df_refData_reloaded["Market Cap"][df_refData_reloaded["Market Cap"] > 100] 

In [63]:
con.close()

In [None]:
df_refData.to_sql(name="refData", con=con, index_label="Firms")

In [None]:
query = "SELECT * FROM refData"

df_refData_reloaded = pd.read_sql(sql=query, con=con, index_col="Firms")

df_refData_reloaded = df_refData_reloaded.rename(columns={old_col: new_col for old_col, new_col 
                                                          in zip(df_refData_reloaded.columns, df_refData.columns)})
df_refData_reloaded

### another example - parsing dates

In [None]:
df = pd.DataFrame(data=np.array([[i**k for i in range(1,11)] for k in range(1,6)]).T, 
                  index=pd.date_range('2020-01-01', periods=10, freq='B'), 
                  columns=['x', 'x^2', 'x^3', 'x^4', 'x^5'])
df

In [None]:
df.index

In [None]:
df.index[0]

In [None]:
# create table
con = sq3.connect(dataFolderPath + "df.db")

In [None]:
df.to_sql(name="df", con=con, index_label="Dates")

In [None]:
query = "SELECT * FROM df"
df_reloaded = pd.read_sql(sql=query, con=con, index_col="Dates")
df_reloaded

In [None]:
df_reloaded.index

In [None]:
df_reloaded.index[0]

In [None]:
type(df_reloaded.index[0])

In [None]:
query = "SELECT * FROM df"
df_reloaded = pd.read_sql(sql=query, con=con, index_col="Dates", parse_dates="Dates")

In [None]:
df_reloaded

In [None]:
df_reloaded.index

In [None]:
df_reloaded.index[0]

In [None]:
con.close()

---

## PANDAS + .csv

In [None]:
df_refData

In [None]:
%time df_refData.to_csv(path_or_buf = dataFolderPath + "df_refData.json")

In [None]:
%time df_refData_reloaded = pd.read_csv(filepath_or_buffer = dataFolderPath + "df_refData.json", index_col = 0)

In [None]:
df_refData_reloaded

In [None]:
if os.path.isfile(dataFolderPath + "df_refData.json"):
    os.remove(dataFolderPath + "df_refData.json")

# double-check if file still exists
os.path.isfile(dataFolderPath + "df_refData.json")

### another example - parsing dates

In [None]:
df

In [None]:
%time df.to_csv(path_or_buf = dataFolderPath + "df.json")

In [None]:
%time df_reloaded = pd.read_csv(filepath_or_buffer = dataFolderPath + "df.json", index_col = 0)

In [None]:
df_reloaded.index

In [None]:
df_reloaded.index[0]

In [None]:
type(df_reloaded.index[0])

In [None]:
%time df_reloaded = pd.read_csv(filepath_or_buffer = dataFolderPath + "df.json", index_col = 0, parse_dates = True)

In [None]:
df_reloaded

In [None]:
df_reloaded.index

In [None]:
df_reloaded.index[0]

In [None]:
if os.path.isfile(dataFolderPath + "df.json"):
    os.remove(dataFolderPath + "df.json")

# double-check if file still exists
os.path.isfile(dataFolderPath + "df.json")

---

## PANDAS + Excel (FORSE)

In [None]:
df_refData

In [None]:
%time df_refData.to_excel(excel_writer = dataFolderPath + "df_refData.xlsx", sheet_name = "reference data table")

In [None]:
%time df_refData_reloaded = pd.read_excel(io = dataFolderPath + "df_refData.xlsx", index_col = 0, sheet_name = "reference data table")

In [None]:
df_refData_reloaded

In [None]:
if os.path.isfile(dataFolderPath + "df_refData.xlsx"):
    os.remove(dataFolderPath + "df_refData.xlsx")

# double-check if file still exists
os.path.isfile(dataFolderPath + "df_refData.xlsx")

### another example - parsing dates

In [None]:
df

In [None]:
%time df.to_excel(excel_writer = dataFolderPath + "df.xlsx")

In [None]:
%time df_reloaded = pd.read_excel(io = dataFolderPath + "df.xlsx", index_col = 0)

In [None]:
df_reloaded

In [None]:
df_reloaded.index

In [None]:
df_reloaded.index[0]

In [None]:
if os.path.isfile(dataFolderPath + "df.xlsx"):
    os.remove(dataFolderPath + "df.xlsx")

# double-check if file still exists
os.path.isfile(dataFolderPath + "df.xlsx")

---

---

## PANDAS + Yahoo Finance

In [None]:
# for Yahoo Finance API
import yfinance as yf

In [None]:
data = yf.download("^GSPC", period="max")

In [None]:
data.loc['2010-01-01':, 'High'].plot()

In [None]:
data.head()

In [None]:
spx = yf.Ticker("^GSPC")
spx_hist = spx.history(period="max")

In [None]:
spx_hist.tail()

In [None]:
data2 = yf.download("SPY AAPL", start="2017-01-01", end="2017-04-30", group_by = 'ticker')

In [None]:
data2.head()

In [None]:
data2['SPY']