<h1>Accessing Data</h1>

<ul>
<li>Reading and writing pandas data from files</li>
<li>Working with data in CSV, JSON, HTML, Excel, and HDF5 formats</li>
<li>Accessing data on the web and in the cloud</li>
<li>Reading and writing from/to SQL databases</li>
<li>Reading data from remote web data services</li>

In [2]:
# import pandas and numpy
import numpy as np
import pandas as pd

# Set some pandas options for controlling output
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)

<h3>Reading a CSV file into a DataFrame</h3>

In [22]:
# read in msft.csv into a DataFrame
msft = pd.read_csv("data/msft.csv")
msft.head()

         Date   Open   High    Low  Close   Volume  Adj Close
0  2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
1  2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
2  2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63
3  2014-07-16  83.77  84.91  83.66  84.91  1755600      84.91

Wow, that was easy! pandas has realized that the first line of the file contains the names of
the columns and bulk read in the data to DataFrame.

<h3>Specifying the index column when reading a CSV file</h3>

In [4]:
# use column 0 as the index
msft = pd.read_csv("data/msft.csv", index_col=0)
msft.head()

             Open   High    Low  Close   Volume  Adj Close
Date                                                      
2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63
2014-07-16  83.77  84.91  83.66  84.91  1755600      84.91

In [14]:
msft['MyDate'] = msft.index

In [15]:
msft

             Open   High    Low  Close   Volume  Adj Close      MyDate
Date                                                                  
2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93  2014-07-21
2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35  2014-07-18
2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63  2014-07-17
2014-07-16  83.77  84.91  83.66  84.91  1755600      84.91  2014-07-16

In [16]:
# examine the types of the columns in this DataFrame
msft.dtypes

Open         float64
High         float64
Low          float64
Close        float64
Volume         int64
Adj Close    float64
MyDate        object
dtype: object

To force the types of columns, use the dtypes parameter of pd.read_csv(). The following
forces the Volume column to also be float64:

In [19]:
# specify that the Volume column should be a float64
msft = pd.read_csv("data/msft.csv", dtype = { 'Volume' : np.float})
msft.dtypes

Date          object
Open         float64
High         float64
Low          float64
Close        float64
Volume       float64
Adj Close    float64
dtype: object

It is also possible to specify the column names at the time of reading the data using the
names parameter:

In [21]:
# specify a new set of names for the columns
# all lower case, remove space in Adj Close
# also, header=0 skips the header row
df = pd.read_csv("data/msft.csv", header=0, names=['open', 'high', 'low',
'close', 'volume', 'adjclose'])
df

             open   high    low  close   volume  adjclose
2014-07-21  83.46  83.53  81.81  81.93  2359300     81.93
2014-07-18  83.30  83.40  82.52  83.35  4020800     83.35
2014-07-17  84.35  84.63  83.33  83.63  1974000     83.63
2014-07-16  83.77  84.91  83.66  84.91  1755600     84.91

In [24]:
# specify a new set of names for the columns
# all lower case, remove space in Adj Close
# also, header=0 skips the header row
df = pd.read_csv("data/msft.csv", header=0, names=['open', 'high', 'low',
'close', 'volume', 'adjclose'])
df

             open   high    low  close   volume  adjclose
2014-07-21  83.46  83.53  81.81  81.93  2359300     81.93
2014-07-18  83.30  83.40  82.52  83.35  4020800     83.35
2014-07-17  84.35  84.63  83.33  83.63  1974000     83.63
2014-07-16  83.77  84.91  83.66  84.91  1755600     84.91

Note that because we specified the names of the columns, we need to skip over the column
names’ row in the file, which was performed with header=0. If not, pandas will assume
the first row is part of the data, which will cause some issues later in processing.

<h3>Specify specific columns to load</h3>

It is also possible to specify which columns to load when reading the file. This can be
useful if you have a lot of columns in the file and some are of no interest to your analysis
and you want to save the time and memory required to read and store them. Specifying
which columns to read is accomplished with the usecols parameter, which can be passed
a list of column names or column offsets.

In [26]:
# read in data only in the Date and Close columns
# and index by the Date column
df2 = pd.read_csv("data/msft.csv", usecols=['Date', 'Close'], index_col=['Date'])
df2

            Close
Date             
2014-07-21  81.93
2014-07-18  83.35
2014-07-17  83.63
2014-07-16  84.91

<h3>Saving DataFrame to a csv file</h3>

In [35]:
# save df2 to a new csv file
# also specify naming the index as date
df2.to_csv("data/msft_modified.csv", index_label='date')
df2

            Close
Date             
2014-07-21  81.93
2014-07-18  83.35
2014-07-17  83.63
2014-07-16  84.91

It was necessary to tell the method that the index label should be saved with a column
name of date using index_label=date. Otherwise, the index does not have a name
added to the first row of the file, which makes it difficult to read back properly.

In [36]:
# view the start of the file just saved
!head data/msft_modified.csv # Linux or osx

date,Close
2014-07-21,81.93
2014-07-18,83.35
2014-07-17,83.63
2014-07-16,84.91


<h3>General field-delimited data</h3>

In [38]:
# use read_table with sep=',' to read a CSV
df = pd.read_table("data/msft.csv", sep=',', index_col=['Date'])
df.head()

             Open   High    Low  Close   Volume  Adj Close
Date                                                      
2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63
2014-07-16  83.77  84.91  83.66  84.91  1755600      84.91

In [39]:
# save as pipe delimited
df.to_csv("data/msft_piped.txt", sep='|')

In [40]:
!head -n 5 data/msft_piped.txt

Date|Open|High|Low|Close|Volume|Adj Close
2014-07-21|83.46|83.53|81.81|81.93|2359300|81.93
2014-07-18|83.3|83.4|82.52|83.35|4020800|83.35
2014-07-17|84.35|84.63|83.33|83.63|1974000|83.63
2014-07-16|83.77|84.91|83.66|84.91|1755600|84.91


<h3>Handling noise rows in field-delimited data</h3>

Skip unwanted rows

<pre>#!head data\msft2.csv
This is fun because the data does not start on the first line
Date,Open,High,Low,Close,Volume,Adj Close

And there is space between the header row and data

2014-07-21,83.46,83.53,81.81,81.93,2359300,81.93
2014-07-18,83.30,83.40,82.52,83.35,4020800,83.35
2014-07-17,84.35,84.63,83.33,83.63,1974000,83.63
2014-07-16,83.77,84.91,83.66,84.91,1755600,84.91
2014-07-15,84.30,84.38,83.20,83.58,1874700,83.58
2014-07-14,83.66,84.64,83.11,84.40,1432100,84.40</pre>

In [43]:
# read, but skip rows 0, 2 and 3
df = pd.read_csv("data/msft2.csv", skiprows=[0, 2, 3])
df

         Date   Open   High    Low  Close   Volume  Adj Close
0  2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
1  2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63
2  2014-07-16  83.77  84.91  83.66  84.91  1755600      84.91
3  2014-07-15  84.30  84.38  83.20  83.58  1874700      83.58
4  2014-07-14  83.66  84.64  83.11  84.40  1432100      84.40

Another common situation is where a file has content at the end of the file, which should
be ignored to prevent an error, such as the following.

<pre>
# another messy file, with the mess at the end
!cat data/msft_with_footer.csv # osx or Linux
Date,Open,High,Low,Close,Volume,Adj Close
2014-07-21,83.46,83.53,81.81,81.93,2359300,81.93
2014-07-18,83.30,83.40,82.52,83.35,4020800,83.35
Uh oh, there is stuff at the end.
</pre>

This will cause an exception during reading, but it can be handled using the skip_footer
parameter, which specifies how many lines at the end of the file to ignore:

In [54]:
# skip only two lines at the end
df = pd.read_csv("data/msft_with_footer.csv", skipfooter=2)
df

  


         Date   Open   High    Low  Close   Volume  Adj Close
0  2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
1  2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35

In [55]:
# skip only two lines at the end
df = pd.read_csv("data/msft_with_footer.csv", skipfooter=2, engine='python')
df

         Date   Open   High    Low  Close   Volume  Adj Close
0  2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
1  2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35

Suppose the file is large and you only want to read the first few rows, as you only want the
data at the start of the file and do not want to read it all into the memory. This can be
handled with the nrows parameter:

In [56]:
# only process the first three rows
pd.read_csv("data/msft.csv", nrows=3)

         Date   Open   High    Low  Close   Volume  Adj Close
0  2014-07-21  83.46  83.53  81.81  81.93  2359300      81.93
1  2014-07-18  83.30  83.40  82.52  83.35  4020800      83.35
2  2014-07-17  84.35  84.63  83.33  83.63  1974000      83.63

The following example skips 100 rows and then reads in the next
5

In [57]:
# skip 100 lines, then only process the next five
pd.read_csv("data/msft.csv", skiprows=100, nrows=5,header=0,
            names=['open', 'high', 'low', 'close', 'vol','adjclose'])

Empty DataFrame
Columns: [open, high, low, close, vol, adjclose]
Index: []

<b>Note</b>
<p>Note that the preceding example also skipped reading the header line, so it was necessary
to inform the process to not look for a header and use the specified names.</p>

<h3>Reading and writing data in an Excel format</h3>

pandas supports reading data in Excel 2003 and newer formats using the
pd.read_excel() function or via the ExcelFile class. Internally, both techniques use
either the XLRD or OpenPyXL packages, so you will need to ensure that either is installed
first in your Python environment.

In [60]:
# read excel file
# only reads first sheet (aapl in this case)
df = pd.read_excel("data/stocks.xlsx")
df.head()

        Date   Open   High    Low  Close    Volume  Adj Close
0  7/21/2014  94.99  95.00  93.72  93.94  38887700      93.94
1  7/18/2014  93.62  94.74  93.02  94.43  49898600      94.43
2  7/17/2014  95.03  95.28  92.57  93.09  57152000      93.09
3  7/16/2014  96.97  97.10  94.74  94.78  53396300      94.78
4  7/15/2014  96.80  96.85  95.03  95.32  45477900      95.32

This has read only content from the first worksheet in the Excel file (the aapl worksheet)
and used the contents of the first row as column names. To read the other worksheet, you
can pass the name of the worksheet using the sheetname parameter:

In [64]:
# read from the aapl worksheet
aapl = pd.read_excel("data/stocks.xlsx", sheet_name='msft')
aapl.head()

        Date    Open   High    Low  Close    Volume  Adj Close
0  7/21/2014  150.99  95.00  93.72  93.94  38887700      93.94
1  7/18/2014   93.62  94.74  93.02  94.43  49898600      94.43
2  7/17/2014   95.03  95.28  92.57  93.09  57152000      93.09
3  7/16/2014   96.97  97.10  94.74  94.78  53396300      94.78
4  7/15/2014   96.80  96.85  95.03  95.32  45477900      95.32

Excel files can be written using the .to_excel() method of DataFrame. Writing to the
XLS format requires the inclusion of the XLWT package, so make sure it is loaded in your
Python environment.

In [68]:
# save to an .XLS file, in worksheet 'Sheet1'
df.to_excel("data/stocks2.xls", sheet_name='MSFT')

To write more than one DataFrame to a single Excel file and each DataFrame object on a
separate worksheet, use the ExcelWriter object, along with the with keyword.
ExcelWriter is part of pandas, but you will need to make sure it is imported, as it is not in
the top level namespace of pandas. The following writes two DataFrame objects to two
different worksheets in one Excel file:

In [70]:
# write multiple sheets
# requires use of the ExcelWriter class
from pandas import ExcelWriter
with ExcelWriter("data/all_stocks.xls") as writer:
    aapl.to_excel(writer, sheet_name='AAPL')
    df.to_excel(writer, sheet_name='MSFT')

Writing to XLSX files uses the same function but specifies .XLSX through the file
extension:

In [72]:
# write to xlsx
df.to_excel("data/msft2.xlsx")

<h3>Reading and writing JSON files</h3>

To demonstrate saving as JSON, we will save the Excel data we just read in to a JSON file
and then take a look at the contents:

In [73]:
df

        Date   Open   High    Low  Close    Volume  Adj Close
0  7/21/2014  94.99  95.00  93.72  93.94  38887700      93.94
1  7/18/2014  93.62  94.74  93.02  94.43  49898600      94.43
2  7/17/2014  95.03  95.28  92.57  93.09  57152000      93.09
3  7/16/2014  96.97  97.10  94.74  94.78  53396300      94.78
4  7/15/2014  96.80  96.85  95.03  95.32  45477900      95.32

In [74]:
df.to_json("data/stocks.json")

JSON-based data can be read with the pd.read_json() function:

In [75]:
# read data in from JSON
df_from_json = pd.read_json("data/stocks.json")
df_from_json.head(5)

        Date   Open   High    Low  Close    Volume  Adj Close
0 2014-07-21  94.99  95.00  93.72  93.94  38887700      93.94
1 2014-07-18  93.62  94.74  93.02  94.43  49898600      94.43
2 2014-07-17  95.03  95.28  92.57  93.09  57152000      93.09
3 2014-07-16  96.97  97.10  94.74  94.78  53396300      94.78
4 2014-07-15  96.80  96.85  95.03  95.32  45477900      95.32

<h3>Reading HTML data from the Web</h3>

<p>pandas has very nice support for reading data from HTML files (or HTML from URLs).
Underneath the covers, pandas makes use of the LXML, Html5Lib, and BeautifulSoup4
packages, which provide some very impressive capabilities for reading and writing HTML
tables.</p>
<p>The pd.read_html() function will read HTML from a file (or URL) and parse all HTML
tables found in the content into one or more pandas DataFrame object. The function
always returns a list of DataFrame objects (actually, zero or more, depending on the
number of tables found in the HTML).</p>

In [78]:
# the URL to read
url = "http://www.fdic.gov/bank/individual/failed/banklist.html"

# read it
banks = pd.read_html(url)

# examine a subset of the first table read
banks

[                             Bank Name           City  ST   CERT  \
 0                 The First State Bank  Barboursville  WV  14361   
 1                   Ericson State Bank        Ericson  NE  18265   
 2     City National Bank of New Jersey         Newark  NJ  21111   
 3                        Resolute Bank         Maumee  OH  58317   
 4                Louisa Community Bank         Louisa  KY  58112   
 ..                                 ...            ...  ..    ...   
 556                 Superior Bank, FSB       Hinsdale  IL  32646   
 557                Malta National Bank          Malta  OH   6629   
 558    First Alliance Bank & Trust Co.     Manchester  NH  34264   
 559  National State Bank of Metropolis     Metropolis  IL   3815   
 560                   Bank of Honolulu       Honolulu  HI  21029   
 
                    Acquiring Institution       Closing Date  
 0                         MVB Bank, Inc.      April 3, 2020  
 1             Farmers and Merchants Bank  F

In [80]:
banks[0][0:5]

                          Bank Name           City  ST   CERT  \
0              The First State Bank  Barboursville  WV  14361   
1                Ericson State Bank        Ericson  NE  18265   
2  City National Bank of New Jersey         Newark  NJ  21111   
3                     Resolute Bank         Maumee  OH  58317   
4             Louisa Community Bank         Louisa  KY  58112   

               Acquiring Institution       Closing Date  
0                     MVB Bank, Inc.      April 3, 2020  
1         Farmers and Merchants Bank  February 14, 2020  
2                    Industrial Bank   November 1, 2019  
3                 Buckeye State Bank   October 25, 2019  
4  Kentucky Farmers Bank Corporation   October 25, 2019  

In [81]:
df

        Date   Open   High    Low  Close    Volume  Adj Close
0  7/21/2014  94.99  95.00  93.72  93.94  38887700      93.94
1  7/18/2014  93.62  94.74  93.02  94.43  49898600      94.43
2  7/17/2014  95.03  95.28  92.57  93.09  57152000      93.09
3  7/16/2014  96.97  97.10  94.74  94.78  53396300      94.78
4  7/15/2014  96.80  96.85  95.03  95.32  45477900      95.32

Write DataFrame to HTML

In [84]:
# Write DataFrame to HTML Document
df.head(2).to_html("data/stocks.html")

<h3>Reading and writing HDF5 format files</h3>

HDF5 is a data model, library, and file format to store and manage data. It is commonly
used in scientific computing environments. It supports an unlimited variety of data types
and is designed for flexible and efficient I/O and for high volume and complex data.

HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5.
The HDF5 Technology suite includes tools and applications to manage, manipulate, view,
and analyze data in the HDF5 format. HDF5 is:

<ul>
<li>A versatile data model that can represent very complex data objects and a wide
    variety of metadata</li>
<li>A completely portable file format with no limit on the number or size of data objects
in the collection</li>
<li>A software library that runs on a range of computational platforms, from laptops to
massively parallel systems, and implements a high-level API with C, C++, Fortran
90, and Java interfaces</li>
<li>A rich set of integrated performance features that allow for access time and storage
space optimizations</li>
<li>Tools and applications to manage, manipulate, view, and analyze the data in the
collection    </li>
</ul>

<p>HDFStore is a hierarchical, dictionary-like object that reads and writes pandas objects to
the HDF5 format. Under the covers, HDFStore uses the PyTables library, so make sure that
it is installed if you want to use this format.</p>
<p>The following demonstrates writing DataFrame into an HDF5 format. The output shows
you that the HDF5 store has a root level object named df, which is a frame and whose
shape is eight rows of three columns:</p>

In [86]:
# seed for replication
np.random.seed(123456)

# create a DataFrame of dates and random numbers in three columns
df = pd.DataFrame(np.random.randn(8, 3), index=pd.date_range('1/1/2000', periods=8),
columns=['A', 'B', 'C'])
df

                   A         B         C
2000-01-01  0.469112 -0.282863 -1.509059
2000-01-02 -1.135632  1.212112 -0.173215
2000-01-03  0.119209 -1.044236 -0.861849
2000-01-04 -2.104569 -0.494929  1.071804
2000-01-05  0.721555 -0.706771 -1.039575
2000-01-06  0.271860 -0.424972  0.567020
2000-01-07  0.276232 -1.087401 -0.673690
2000-01-08  0.113648 -1.478427  0.524988

In [89]:
# create HDF5 store
store = pd.HDFStore('data/store.h5')
store['df'] = df # persisting happened here
store

<class 'pandas.io.pytables.HDFStore'>
File path: data/store.h5

The following reads the HDF5 store and retrieves DataFrame:

In [90]:
# read in data from HDF5
store = pd.HDFStore("data/store.h5")
df = store['df']
df

                   A         B         C
2000-01-01  0.469112 -0.282863 -1.509059
2000-01-02 -1.135632  1.212112 -0.173215
2000-01-03  0.119209 -1.044236 -0.861849
2000-01-04 -2.104569 -0.494929  1.071804
2000-01-05  0.721555 -0.706771 -1.039575
2000-01-06  0.271860 -0.424972  0.567020
2000-01-07  0.276232 -1.087401 -0.673690
2000-01-08  0.113648 -1.478427  0.524988

Changes to DataFrame made after that point are not persisted, at least not until the object
is assigned to the data store object again. The following demonstrates this by making a
change to DataFrame and then reassigning it to the HDF5 store, thereby updating the data
store:

In [106]:
# this changes the DataFrame, but did not persist
df.loc['2000-01-01'].A = 1
df

                   A         B         C
2000-01-01  1.000000 -0.282863 -1.509059
2000-01-02  1.000000  1.212112 -0.173215
2000-01-03  0.119209 -1.044236 -0.861849
2000-01-04  1.000000 -0.494929  1.071804
2000-01-05  0.721555 -0.706771 -1.039575
2000-01-06  0.271860 -0.424972  0.567020
2000-01-07  0.276232 -1.087401 -0.673690
2000-01-08  0.113648 -1.478427  0.524988

In [93]:
# to persist the change, assign the DataFrame to the
# HDF5 store object
store['df'] = df

# it is now persisted

In [94]:
# the following loads the store and
# shows the first two rows, demonstrating
# the the persisting was done
pd.HDFStore("data/store.h5")['df'].head(2) # it's now in there

                   A         B         C
2000-01-01  0.469112 -0.282863 -1.509059
2000-01-02 -1.135632  1.212112 -0.173215

<h3>Accessing data on the web and in the
cloud</h3>

It is quite common to read data off the web and from the cloud. pandas makes it extremely
easy to read data from the web and cloud. All of the pandas functions we have examined
can also be given an HTTP URL, FTP address, or S3 address instead of a local file path,
and all work just the same as they work with a local file.

In [None]:
# read csv directly from Yahoo! Finance from a URL
df = pd.read_csv("http://ichart.yahoo.com/table.csv?s=MSFT&" +
"a=5&b=1&c=2014&" +
"d=5&e=30&f=2014&" +
"g=d&ignore=.csv")

<h3>Reading and writing from/to SQL databases</h3>

<p>pandas can read data from any SQL databases that support Python data adapters, that
respect the Python DB-API. Reading is performed using the pandas.io.sql.read_sql()
function and writing to SQL databases using the .to_sql() method of DataFrame.</p>
<p>As an example of writing, the following reads the stock data from msft.csv and
aapl.csv. It then makes a connection to a SQLite3 database file. If the file does not exist,
it creates it on the fly. It then writes the MSFT data to a table named STOCK_DATA. If the
table did not exist, it is created. If it exists, all the data is replaced with the MSFT data. It
then appends the AAPL stock data to that table:</p>

In [None]:
# reference SQLite
import sqlite3

# read in the stock data from CSV
msft = pd.read_csv("data/msft.csv")
msft["Symbol"]="MSFT"
aapl = pd.read_csv("data/aapl.csv")
aapl["Symbol"]="AAPL"

# create connection
connection = sqlite3.connect("data/stocks.sqlite")

# .to_sql() will create SQL to store the DataFrame
# in the specified table. if_exists specifies
# what to do if the table already exists
msft.to_sql("STOCK_DATA", connection, if_exists="replace")
aapl.to_sql("STOCK_DATA", connection, if_exists="append")
# commit the SQL and close the connection
connection.commit()
connection.close()

Data can be read using SQL from the database using the pd.io.sql.read_sql() function.
The following queries the data from stocks.sqlite using SQL and reports it to the user:

In [None]:
# connect to the database file
connection = sqlite3.connect("data/stocks.sqlite")

# query all records in STOCK_DATA
# returns a DataFrame
# inde_col specifies which column to make the DataFrame index
stocks = pd.io.sql.read_sql("SELECT * FROM STOCK_DATA;",
connection, index_col='index')

# close the connection
connection.close()

# report the head of the data retrieved
stocks.head()

It is also possible to use the WHERE clause in the SQL, as well as to select columns. To
demonstrate, the following selects the records where MSFT’s volume is greater than
29200100:

In [None]:
# open the connection
connection = sqlite3.connect("data/stocks.sqlite")

# construct the query string
query = "SELECT * FROM STOCK_DATA WHERE Volume>29200100 AND Symbol='MSFT';"

# execute and close connection
items = pd.io.sql.read_sql(query, connection, index_col='index')

connection.close()
# report the query result

Items

A final point, is that most of the code of these examples was SQLite3 code. The only
pandas part of these examples is the use of the .to_sql() and .read_sql() methods. As
these functions take a connection object, which can be any Python DB-API-ompatible data
adapter, you can more or less work with any supported database data by simply creating
an appropriate connection object. The code at the pandas level should remain the same for
any supported database

<h3>Reading data from remote data services</h3>

<p>pandas has direct support for various web-based data source classes in the
pandas.io.data namespace. The primary class of interest is
pandas.io.data.DataReader, which is implemented to read data from various supported
sources and return it to the application directly as DataFrame.</p>
<p>Currently, support exists for the following sources via the DataReader class:</p>
<ul>
    <li>Daily historical prices’ stock from either Yahoo! and Google Finance</li>
    <li>Yahoo! Options</li>
<li>The Federal Reserve Economic Data library</li>
    <li>Kenneth French’s Data Library</li>
<li>The World Bank</li>

The specific source of data is specified via the DataReader object’s data_source
parameter. The specific items to be retrieved are specified using the name parameter. If the
data source supports selecting data between a range of dates, these dates can be specified
with the start and end parameters. We will now take a look at reading data from each of
these sources.

<h3>Reading stock data from Yahoo! and Google
Finance</h3>

Yahoo! Finance is specified by passing 'yahoo' as the data_source parameter. The
following retrieves data from Yahoo! Finance, specifically, the data for MSFT between
2012-01-01 and 2014-01-27:

In [122]:
# import pandas.io.data namespace, alias as web
import pandas_datareader.data as web

# and datetime for the dates
import datetime

# start end end dates
start = datetime.datetime(2012, 1, 1)
end = datetime.datetime(2014, 1, 27)

# read the MSFT stock data from yahoo! and view the head
yahoo = web.DataReader('MSFT', 'yahoo', start, end)
yahoo.head()

                 High        Low       Open      Close      Volume  Adj Close
Date                                                                         
2012-01-03  26.959999  26.389999  26.549999  26.770000  64731500.0  21.959635
2012-01-04  27.469999  26.780001  26.820000  27.400000  80516100.0  22.476425
2012-01-05  27.730000  27.290001  27.379999  27.680000  56081400.0  22.706108
2012-01-06  28.190001  27.530001  27.530001  28.110001  99455500.0  23.058842
2012-01-09  28.100000  27.719999  28.049999  27.740000  59706800.0  22.755325

The source of the data can be changed to Google Finance with a change of the
data_source parameter to 'google':

In [123]:
# read from google and display the head of the data
goog = web.DataReader("MSFT", 'google', start, end)
goog.head()

NotImplementedError: data_source='google' is not implemented