## Introduction
The pandas DataFrame object extends the capabilities of the Series object into
two-dimensions. A Series object adds an index to a NumPy array but can only
associate a single data item per index label, a DataFrame integrates multiple Series
objects by aligning them along common index labels.

the DataFrame uses the [] operator for selection, but it is now applied to the selection
of columns of data.

## Creating DataFrame from scratch

In [244]:
# reference NumPy and pandas
import numpy as np
import pandas as pd

# Set some pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)

In [245]:
# Using Numpy 2d array
pd.DataFrame(np.array([[10, 11], [20, 21]]))

    0   1
0  10  11
1  20  21

In [246]:
# Create a DataFrame for a list of Series objects
df1 = pd.DataFrame([pd.Series(np.arange(10, 15)),
pd.Series(np.arange(15, 20))])
df1

    0   1   2   3   4
0  10  11  12  13  14
1  15  16  17  18  19

In [247]:
# what's the shape of this DataFrame
df1.shape # it is two rows by 5 columns

(2, 5)

In [248]:
# specify column names
df = pd.DataFrame(np.array([[10, 11], [20, 21]]), columns=['a', 'b'])
df

    a   b
0  10  11
1  20  21

In [249]:
# what are the names of the columns?
df.columns[0]


'a'

In [250]:
# rename the columns
df.columns = ['c1', 'c2']
df

   c1  c2
0  10  11
1  20  21

In [251]:
# create a DataFrame with named columns and rows
df = pd.DataFrame(np.array([[0, 1], [2, 3]]),
columns=['c1', 'c2'],
index=['r1', 'r2'])
df

    c1  c2
r1   0   1
r2   2   3

In [252]:
# create a DataFrame with two Series objects
# and a dictionary
s1 = pd.Series(np.arange(1, 6, 1))
s2 = pd.Series(np.arange(6, 11, 1))
pd.DataFrame({'c1': s1, 'c2': s2})

   c1  c2
0   1   6
1   2   7
2   3   8
3   4   9
4   5  10

In [253]:
# demonstrate alignment during creation
s3 = pd.Series(np.arange(12, 14), index=[1, 2])
df = pd.DataFrame({'c1': s1, 'c2': s2, 'c3': s3})
df

   c1  c2    c3
0   1   6   NaN
1   2   7  12.0
2   3   8  13.0
3   4   9   NaN
4   5  10   NaN

## S&P 500 & Monthly stock historical prices Datasets

In [254]:
# read in the data and print the first five rows
# use the Symbol column as the index, and
# only read in columns in positions 0, 2, 3, 7
 
sp500 = pd.read_csv("sp500.csv",
 index_col='Symbol',
 usecols=[0, 2, 3, 7])

sp500.head()

                        Sector   Price  Book Value
Symbol                                            
MMM                Industrials  141.14      26.668
ABT                Health Care   39.60      15.573
ABBV               Health Care   53.95       2.954
ACN     Information Technology   79.79       8.326
ACE                 Financials  102.91      86.897

In [255]:
len(sp500)

500

In [256]:
sp500.index[:5]

Index(['MMM', 'ABT', 'ABBV', 'ACN', 'ACE'], dtype='object', name='Symbol')

In [257]:
sp500.columns

Index(['Sector', 'Price', 'Book Value'], dtype='object')

In [258]:
# read in the data
one_mon_hist = pd.read_csv("omh.csv")
 
# examine the first three rows
one_mon_hist[:3]

         Date   MSFT    AAPL
0  2014-12-01  48.62  115.07
1  2014-12-02  48.46  114.63
2  2014-12-03  48.08  115.93

Notice: This type of data is referred to as a time series. Which is a series of data points indexed (or listed or graphed) in time order.

In [259]:
# get first and second columns (1 and 2) by location

sp500.iloc[:5 ,[1, 2]]

         Price  Book Value
Symbol                    
MMM     141.14      26.668
ABT      39.60      15.573
ABBV     53.95       2.954
ACN      79.79       8.326
ACE     102.91      86.897

In [260]:
df2 = sp500.iloc[:5,[1]]
df2

         Price
Symbol        
MMM     141.14
ABT      39.60
ABBV     53.95
ACN      79.79
ACE     102.91

In [261]:
# It's a DataFrame not a Series
type(df2)

pandas.core.frame.DataFrame

In [262]:
# create a new DataFrame with integers as the column names
# make sure to use .copy() or change will be in-place
df = sp500.copy()
df.columns=[0, 1, 2]
df.head()

                             0       1       2
Symbol                                        
MMM                Industrials  141.14  26.668
ABT                Health Care   39.60  15.573
ABBV               Health Care   53.95   2.954
ACN     Information Technology   79.79   8.326
ACE                 Financials  102.91  86.897

In [263]:
# It's a Series this is because of single integar column index without using []
type(df.iloc[:,1])

pandas.core.series.Series

In [264]:
type(sp500[['Price', 'Sector']])

pandas.core.frame.DataFrame

In [265]:
sp500.Price.head()

Symbol
MMM     141.14
ABT      39.60
ABBV     53.95
ACN      79.79
ACE     102.91
Name: Price, dtype: float64

## Selecting rows and values of a DataFrame using the index

In [266]:
# The following code returns rows starting with the ABT label through the ACN label
sp500['ABT':'ACN']

                        Sector  Price  Book Value
Symbol                                           
ABT                Health Care  39.60      15.573
ABBV               Health Care  53.95       2.954
ACN     Information Technology  79.79       8.326

In [267]:
# get row with label MMM
# returned as a Series
sp500.loc['MMM']

Sector        Industrials
Price              141.14
Book Value         26.668
Name: MMM, dtype: object

In [268]:
sp500.loc['ABT':'ACN', ['Price', 'Sector']]

        Price                  Sector
Symbol                               
ABT     39.60             Health Care
ABBV    53.95             Health Care
ACN     79.79  Information Technology

In [269]:
# extract the first four rows and just the Price column

rcopy = sp500[0:3]['Price'].copy()
rcopy

Symbol
MMM     141.14
ABT      39.60
ABBV     53.95
Name: Price, dtype: float64

## Scalar lookup by label or location

In [270]:
# by label in both the index and column
sp500.at['MMM', 'Price']

141.14

In [271]:
# by location. Row 0, column 1
sp500.iat[0, 1]

141.14

## Selecting rows of a DataFrame by Boolean selection


In [272]:
# what rows have a price < 100?
sp500[sp500.Price < 100]

                        Sector  Price  Book Value
Symbol                                           
ABT                Health Care  39.60      15.573
ABBV               Health Care  53.95       2.954
ACN     Information Technology  79.79       8.326
ADBE    Information Technology  64.30      13.262
AES                  Utilities  13.61       5.781
...                        ...    ...         ...
XYL                Industrials  38.42      12.127
YHOO    Information Technology  35.02      12.768
YUM     Consumer Discretionary  74.77       5.147
ZION                Financials  28.43      30.191
ZTS                Health Care  30.53       2.150

[407 rows x 3 columns]

In [273]:
# get only the Price where Price is < 10 and > 0
r = sp500[(sp500.Price < 10) &
 (sp500.Price > 0)] [['Price']]
r

        Price
Symbol       
FTR      5.81
HCBK     9.80
HBAN     9.10
SLM      8.82
WIN      9.38

## Modifying the structure and content of DataFrame

In [274]:
# rename the Book Value column to not have a space
# this returns a copy with the column renamed
df = sp500.rename(columns=
 {'Book Value': 'BookValue'})
df[:2]

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABT     Health Care   39.60     15.573

In [275]:
# this changes the column in-place

sp500.rename(columns=
 {'Book Value': 'BookValue'},
 inplace=True)

# we can see the column is changed
sp500.columns

Index(['Sector', 'Price', 'BookValue'], dtype='object')

## Adding and inserting columns

In [276]:
# The simplest way is by merging a new Series into the DataFrame object, 
# along the index using the [] operator assigning the Series to a new column,
# with a name not already in the .columns index. 

copy = sp500.copy()
copy['TwicePrice'] = sp500.Price * 2
copy[:2]

             Sector   Price  BookValue  TwicePrice
Symbol                                            
MMM     Industrials  141.14     26.668      282.28
ABT     Health Care   39.60     15.573       79.20

In [277]:
# If you want to add the column at a different location in the DataFrame object,
# instead of at the rightmost position, use the .insert() method

copy = sp500.copy()
 
# insert sp500.Price * 2 as the
# second column in the DataFrame
copy.insert(1, 'TwicePrice', sp500.Price * 2)
copy[:2]

             Sector  TwicePrice   Price  BookValue
Symbol                                            
MMM     Industrials      282.28  141.14     26.668
ABT     Health Care       79.20   39.60     15.573

## Replacing the contents of a column

In [278]:
copy = sp500.copy()
 
# replace the Price column data with the new values
# instead of adding a new column
copy.Price = sp500.Price * 2
copy[:5]

                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  282.28     26.668
ABT                Health Care   79.20     15.573
ABBV               Health Care  107.90      2.954
ACN     Information Technology  159.58      8.326
ACE                 Financials  205.82     86.897

## Deleting columns in a DataFrame
* del will simply delete the Series from the DataFrame (in-place)
* pop() will both delete the Series and return the Series as a result (also in-place)
* drop(labels, axis=1) will return a new DataFrame with the column(s) removed (the original DataFrame object is not modified)



In [279]:
# Example of using del to delete a column
copy = sp500[:2].copy()
del copy['BookValue']
copy

             Sector   Price
Symbol                     
MMM     Industrials  141.14
ABT     Health Care   39.60

In [280]:
# Example of using pop to remove a column from a DataFrame
# this will remove Sector and return it as a series
# Sector column removed in-place

copy = sp500[:2].copy()
popped = copy.pop('Sector')
copy

         Price  BookValue
Symbol                   
MMM     141.14     26.668
ABT      39.60     15.573

In [281]:
type(popped)

pandas.core.series.Series

In [282]:
# Example of using drop to remove a column
# this will return a new DataFrame with 'Sector' removed
# the copy DataFrame is not modified

copy = sp500[:2].copy()
afterdrop = copy.drop(['Sector'], axis = 1)
afterdrop

         Price  BookValue
Symbol                   
MMM     141.14     26.668
ABT      39.60     15.573

## Adding rows to a DataFrame
* Appending a DataFrame to another
* Concatenation of two DataFrame objects
* Setting with enlargement

### Appending

In [283]:
# Appending rows returns a new DataFrame

# copy the first three rows of sp500 & copy 10th and 11th rows
df1 = sp500.iloc[0:3].copy()
df2 = sp500.iloc[[10, 11, 2]]

# append df1 and df2
appended = df1.append(df2)

# the result is the rows of the first followed by
# those of the second
# The resulting DataFrame will consist of the union of the columns in both and
# where either did not have a column, NaN will be used as the value.
appended

# May cause duplicate indexing
# ignore index labels, create default index
# df1.append(df2, ignore_index=True)

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABT     Health Care   39.60     15.573
ABBV    Health Care   53.95      2.954
A       Health Care   56.18     16.928
GAS       Utilities   52.98     32.462
ABBV    Health Care   53.95      2.954

### concat()

In [284]:
# Using .concat
# Can concatenate more than two objects in a single call.
# Adds the ability to specify an axis (appending can be row or column based)

pd.concat([df1, df2])


             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABT     Health Care   39.60     15.573
ABBV    Health Care   53.95      2.954
A       Health Care   56.18     16.928
GAS       Utilities   52.98     32.462
ABBV    Health Care   53.95      2.954

In [285]:
# copy df2
df2_2 = df2.copy()

# add a column to df2_2 that is not in df1
df2_2.insert(3, 'Foo', pd.Series(0, index=df2.index))

# now concatenate and notice the duplicate indeces 
pd.concat([df1, df2_2], sort=True)

        BookValue  Foo   Price       Sector
Symbol                                     
MMM        26.668  NaN  141.14  Industrials
ABT        15.573  NaN   39.60  Health Care
ABBV        2.954  NaN   53.95  Health Care
A          16.928  0.0   56.18  Health Care
GAS        32.462  0.0   52.98    Utilities
ABBV        2.954  0.0   53.95  Health Care

In [286]:
# specify keys
r = pd.concat([df1, df2_2], keys=['df1', 'df2'], sort=True)
r

            BookValue  Foo   Price       Sector
    Symbol                                     
df1 MMM        26.668  NaN  141.14  Industrials
    ABT        15.573  NaN   39.60  Health Care
    ABBV        2.954  NaN   53.95  Health Care
df2 A          16.928  0.0   56.18  Health Care
    GAS        32.462  0.0   52.98    Utilities
    ABBV        2.954  0.0   53.95  Health Care

In [287]:
# Concat columns
# first three rows, columns 0 and 1
df3 = sp500.iloc[:3, [0, 1]]

# first three rows, column 2
df4 = sp500.iloc[:3, [2]]

#Concat
pd.concat([df3, df4], axis=1)

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABT     Health Care   39.60     15.573
ABBV    Health Care   53.95      2.954

Notice that:  columns are blindly appended without regard
to already existing columns

### Enlargement 

In [288]:
# make sure to copy the slice to make a copy
ss = sp500[:3].copy()
 
# create a new row with index label FOO
# and assign some values to the columns via a list
ss.loc['FOO'] = ['the sector', 100, 110]
ss

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABT     Health Care   39.60     15.573
ABBV    Health Care   53.95      2.954
FOO      the sector  100.00    110.000

Notice that:
* Change is made in place
* If FOO already exists as an index label, then the column data would be replaced.

## Removing rows from a DataFrame
* Using the .drop() method
* Boolean selection
* Selection using a slice

In [289]:
ss = sp500[:5].copy()

# drop() returns a copy
# drop rows with labels ABT and ACN
afterdrop = ss.drop(['ABT', 'ACN'])
afterdrop

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABBV    Health Care   53.95      2.954
ACE      Financials  102.91     86.897

In [290]:
# Boolean selection can be used to remove rows from a DataFrame by creating a new
# DataFrame without the desired rows. 

selection = sp500.Price > 300
 
# to make the output shorter, report the # of rows returned (500),
# and the sum of those where Price > 300 (which is 10)
"{0} {1}".format(len(selection), selection.sum())

'500 10'

In [291]:
# Using slicing
onlyFirstThree = sp500[:3].copy()
onlyFirstThree

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  141.14     26.668
ABT     Health Care   39.60     15.573
ABBV    Health Care   53.95      2.954

## Changing scalar values in a DataFrame

In [292]:
subset = sp500[:3].copy()
subset.loc['MMM', 'Price'] = 10
subset.loc['ABBV', 'Price'] = 20
subset

             Sector  Price  BookValue
Symbol                               
MMM     Industrials   10.0     26.668
ABT     Health Care   39.6     15.573
ABBV    Health Care   20.0      2.954

## Summarized data and descriptive statistics
These reductive methods, when applied to a Series,
result in a single value. When applied to a DataFrame, an axis can be specified and
the method will then be either applied to each column or row and results in a Series.

In [293]:
# .mean()
# .mean(axis=1)
# .var()
# .median()
# .min() 
# .max()
# .idxmin() # location of min price
# .idxmax()
# .mode()
# .cumsum() # calculate a cumulative sum
# .describe()  # summary statistics