# DataFrame

A DataFrame is a tabular, spreadsheet-like data structure containing an ordered collection of columns. Each column can be a different value type (numeric, string, boolean, etc.). One of the most common ways to construct a DataFrame is through equal length lists of NumPy arrays.

In [1]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

In [2]:
employees = {"name": ["Kasper", "Ellen", "Lexi", "Cecilia", "Jason", "Andrew", "Doug"],
             "year": [2012, 2011, 2011, 2012, 2013, 2011, 2012],
             "school": ["Cal Poly", "UCB", "Stanford", "Cal Tech", "UCSB", "Stanford", "Michigan"]}
frame = DataFrame(employees)
frame

Unnamed: 0,name,school,year
0,Kasper,Cal Poly,2012
1,Ellen,UCB,2011
2,Lexi,Stanford,2011
3,Cecilia,Cal Tech,2012
4,Jason,UCSB,2013
5,Andrew,Stanford,2011
6,Doug,Michigan,2012


You can pass the name of the columns if you would like them to appear in a specific order.

In [3]:
DataFrame(employees, columns = ["name", "year", "school"])

Unnamed: 0,name,year,school
0,Kasper,2012,Cal Poly
1,Ellen,2011,UCB
2,Lexi,2011,Stanford
3,Cecilia,2012,Cal Tech
4,Jason,2013,UCSB
5,Andrew,2011,Stanford
6,Doug,2012,Michigan


Passing a column not in the data produces null values (the same way it does with Series).

In [4]:
frame2 = DataFrame(employees, columns = ["name", "year", "school", "hometown"],
                       index = ["one", "two", "three", "four", "five", "six", "seven"])
frame2

Unnamed: 0,name,year,school,hometown
one,Kasper,2012,Cal Poly,
two,Ellen,2011,UCB,
three,Lexi,2011,Stanford,
four,Cecilia,2012,Cal Tech,
five,Jason,2013,UCSB,
six,Andrew,2011,Stanford,
seven,Doug,2012,Michigan,


A column can be retrieved as a Series in a couple different ways.

In [5]:
frame2["name"]

one       Kasper
two        Ellen
three       Lexi
four     Cecilia
five       Jason
six       Andrew
seven       Doug
Name: name, dtype: object

In [6]:
frame2.name

one       Kasper
two        Ellen
three       Lexi
four     Cecilia
five       Jason
six       Andrew
seven       Doug
Name: name, dtype: object

You can retrieve rows the same way you retrieve columns using the ix method.

In [7]:
frame2.ix["four"]

name         Cecilia
year            2012
school      Cal Tech
hometown         NaN
Name: four, dtype: object

You can modify columns by assignment. Note: The assigned value's length must be the same as the length of the DataFrame.

In [8]:
frame2["hometown"] = "SF"
frame2

Unnamed: 0,name,year,school,hometown
one,Kasper,2012,Cal Poly,SF
two,Ellen,2011,UCB,SF
three,Lexi,2011,Stanford,SF
four,Cecilia,2012,Cal Tech,SF
five,Jason,2013,UCSB,SF
six,Andrew,2011,Stanford,SF
seven,Doug,2012,Michigan,SF


In [9]:
frame2["year"] = np.arange(7.)
frame2

Unnamed: 0,name,year,school,hometown
one,Kasper,0,Cal Poly,SF
two,Ellen,1,UCB,SF
three,Lexi,2,Stanford,SF
four,Cecilia,3,Cal Tech,SF
five,Jason,4,UCSB,SF
six,Andrew,5,Stanford,SF
seven,Doug,6,Michigan,SF


When assigning lists or arrays to a column, the length of the array must match the length of the DataFrame. When assigning a Series, it will match up the indexes of the Series and the DataFrame and insert null values into any holes.

In [10]:
exp = Series([3, 3, 2, 1, 7], index = ["two", "three", "four", "five", "six"])
frame2["year"] = exp
frame2

Unnamed: 0,name,year,school,hometown
one,Kasper,,Cal Poly,SF
two,Ellen,3.0,UCB,SF
three,Lexi,3.0,Stanford,SF
four,Cecilia,2.0,Cal Tech,SF
five,Jason,1.0,UCSB,SF
six,Andrew,7.0,Stanford,SF
seven,Doug,,Michigan,SF


Assigning a column that doesn't exist creates a new column. You can also use del to delete a column.

In [11]:
frame2["status"] = "intern"
frame2

Unnamed: 0,name,year,school,hometown,status
one,Kasper,,Cal Poly,SF,intern
two,Ellen,3.0,UCB,SF,intern
three,Lexi,3.0,Stanford,SF,intern
four,Cecilia,2.0,Cal Tech,SF,intern
five,Jason,1.0,UCSB,SF,intern
six,Andrew,7.0,Stanford,SF,intern
seven,Doug,,Michigan,SF,intern


In [12]:
del frame2["status"]
frame2

Unnamed: 0,name,year,school,hometown
one,Kasper,,Cal Poly,SF
two,Ellen,3.0,UCB,SF
three,Lexi,3.0,Stanford,SF
four,Cecilia,2.0,Cal Tech,SF
five,Jason,1.0,UCSB,SF
six,Andrew,7.0,Stanford,SF
seven,Doug,,Michigan,SF


The nested dict of dicts format is another form of data.

In [13]:
wins = {"Giants": {2009: 88, 2010: 92, 2011: 86, 2012: 94, 2013: 76, 2014: 88},
        "Dodgers": {2010: 80, 2011: 82, 2012: 86, 2013: 92, 2014: 94},
        "Padres": {2010: 90, 2011: 71, 2012: 76, 2013: 76, 2014: 77}}
frame3 = DataFrame(wins)
frame3

Unnamed: 0,Dodgers,Giants,Padres
2009,,88,
2010,80.0,92,90.0
2011,82.0,86,71.0
2012,86.0,94,76.0
2013,92.0,76,76.0
2014,94.0,88,77.0


You can transpose the results if you want to flip columns with indexes.

In [14]:
frame3.T

Unnamed: 0,2009,2010,2011,2012,2013,2014
Dodgers,,80,82,86,92,94
Giants,88.0,92,86,94,76,88
Padres,,90,71,76,76,77


The keys of the inner dicts are combined to form the index of the result, unless a specific index is specified.

In [15]:
DataFrame(wins, index = [2008, 2009, 2010, 2011])

Unnamed: 0,Dodgers,Giants,Padres
2008,,,
2009,,88.0,
2010,80.0,92.0,90.0
2011,82.0,86.0,71.0


The index and column names can also be displayed if their name attributes are set.

In [16]:
frame3.index.name = "year"
frame3.columns.name = "team"
frame3

team,Dodgers,Giants,Padres
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2009,,88,
2010,80.0,92,90.0
2011,82.0,86,71.0
2012,86.0,94,76.0
2013,92.0,76,76.0
2014,94.0,88,77.0


The values attribute returns the DataFrame's information, just as it does with Series's information.

In [17]:
frame3.values

array([[ nan,  88.,  nan],
       [ 80.,  92.,  90.],
       [ 82.,  86.,  71.],
       [ 86.,  94.,  76.],
       [ 92.,  76.,  76.],
       [ 94.,  88.,  77.]])

### Index Objects

Index Objects (in pandas) hold things like axis labels or axis names. Index objects are immutable, meaning they can't be changed. This allows them to be safely shared among data structures. Each index has various methods that provide information about the data they contain. Some examples are as follows:
- append (concatenate with additional index objects)
- diff (set difference)
- intersection (set intersection)
- union (set union)
- delete (delete element at index i)

IMPORTANT: All of these methods create a new index, they do not modify the old index

### Reindexing

A pandas method called reindex allows you to create a new object with the new data conformed to a new index. Any missing index from the original object will be filled with a null unless a fill value is specified.

In [20]:
object1 = Series([13, 14, 1, 22], index = ["e", "r", "i", "c"])
object1

e    13
r    14
i     1
c    22
dtype: int64

In [24]:
object2 = object1.reindex(["o", "c", "i", "m", "e"])
object2

o   NaN
c    22
i     1
m   NaN
e    13
dtype: float64

In [25]:
object2 = object1.reindex(["o", "c", "i", "m", "e"], fill_value = 0)
object2

o     0
c    22
i     1
m     0
e    13
dtype: int64

You may want to forward fill or backfill values when reindexing. Forward fill is ffill while backfill is bfill. This can be really useful for time series.

In [26]:
cum_time = Series(["wait time", "setup time", "queue time", "load time", "run time"], index = [0, 3, 5, 6, 8])
cum_time.reindex(range(12), method = "ffill")

0      wait time
1      wait time
2      wait time
3     setup time
4     setup time
5     queue time
6      load time
7      load time
8       run time
9       run time
10      run time
11      run time
dtype: object

Reindex can also be used to reshape the DataFrame. After the DataFrame is constructed, the indexes and columns can be reindexed as normal.

In [28]:
frame = DataFrame(np.arange(16).reshape((4,4)), index = ["a", "b", "c", "d"], columns = ["e", "f", "g", "h"])
frame

Unnamed: 0,e,f,g,h
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15
