# DataFrame
A dataframe represents a rectangular table of data and contains an ordered collection of columns, each of which can a different value type (numeric, string, boolean, etc). 
The dataframe has both a row and column index; it can be thought of as a dict of Series all sharing the same index. Under the hood, the data is stored as one or more two-dimenstional blocks rather than a list, dictionary or some other collection of one-dimensional arrays. 

In [1]:
# imports
import numpy as np
import pandas as pd

In [2]:
# there are many ways to construct a dataframe.
# the most common is from a dictionary of equal-length lists or NumPy arrays
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    'population': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,population
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [3]:
# index
frame.index

RangeIndex(start=0, stop=6, step=1)

In [4]:
# keys
frame.keys()

Index(['state', 'year', 'population'], dtype='object')

In [5]:
# values
frame.values

array([['Ohio', 2000, 1.5],
       ['Ohio', 2001, 1.7],
       ['Ohio', 2002, 3.6],
       ['Nevada', 2001, 2.4],
       ['Nevada', 2002, 2.9],
       ['Nevada', 2003, 3.2]], dtype=object)

In [6]:
# for large data frames, the head method selects only the first five rows
frame.head()

Unnamed: 0,state,year,population
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [7]:
# for head(n), the number of values to be displayed is modified by n
frame.head(3)

Unnamed: 0,state,year,population
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6


In [8]:
# If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order:
pd.DataFrame(data, columns=['year', 'state', 'population'])

Unnamed: 0,year,state,population
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [9]:
# a column in a data frame can be retrieved as a series either by dict-like notation or by attribute
frame2 = pd.DataFrame(data, columns=['year', 'state', 'population', 'debt'])
print('frame2:\n{}\n'.format(frame2))

# frame state
print('frame state:\n{}\n'.format(frame2['state']))

# frame year: notice how i reference it this time
print('frame year:\n{}\n'.format(frame2.year))

frame2:
   year   state  population debt
0  2000    Ohio         1.5  NaN
1  2001    Ohio         1.7  NaN
2  2002    Ohio         3.6  NaN
3  2001  Nevada         2.4  NaN
4  2002  Nevada         2.9  NaN
5  2003  Nevada         3.2  NaN

frame state:
0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

frame year:
0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64



The series returned above have the same index as the DataFrame, and their name attribute has been appropriately set.
Rows can be retrieved using their position or name with the special 'loc' attribute.

In [10]:
frame2.loc[3]

year            2001
state         Nevada
population       2.4
debt             NaN
Name: 3, dtype: object

In [11]:
# columns can be modified by assignment
# for example, the empty 'debt' column could be assigned a scalar value or an array of values
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,population,debt
0,2000,Ohio,1.5,16.5
1,2001,Ohio,1.7,16.5
2,2002,Ohio,3.6,16.5
3,2001,Nevada,2.4,16.5
4,2002,Nevada,2.9,16.5
5,2003,Nevada,3.2,16.5


In [12]:
# or
frame2['debt'] = np.arange(6.)
frame2

Unnamed: 0,year,state,population,debt
0,2000,Ohio,1.5,0.0
1,2001,Ohio,1.7,1.0
2,2002,Ohio,3.6,2.0
3,2001,Nevada,2.4,3.0
4,2002,Nevada,2.9,4.0
5,2003,Nevada,3.2,5.0


When assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to
the DataFrame’s index, inserting missing values in any holes:

In [13]:
val = pd.Series([-1.2, -1.5, -1.7], index=[1,3,5])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,population,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,-1.2
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,-1.5
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,-1.7


Assigning a column that doesn’t exist will create a new column. The del keyword will delete columns as with a dict.
As an example of del , I first add a new column of boolean values where the state column equals 'Nevada'

In [19]:
frame2['eastern'] = frame['state'] != 'Nevada'
frame2

Unnamed: 0,year,state,population,debt,eastern
0,2000,Ohio,1.5,,True
1,2001,Ohio,1.7,-1.2,True
2,2002,Ohio,3.6,,True
3,2001,Nevada,2.4,-1.5,False
4,2002,Nevada,2.9,,False
5,2003,Nevada,3.2,-1.7,False


In [23]:
# the del method can be used to remove the column
try:
    del frame2['eastern']
except KeyError:
    print('Key does not exist')
frame2.columns

Key does not exist


Index(['year', 'state', 'population', 'debt'], dtype='object')

In [28]:
# another common form of data is a nested dict of dicts:
pop = {
    'Nevada': {
        2001: 2.4, 2002: 2.9
    }, 'Ohio': {
        2000: 1.5, 2001: 1.7, 2002: 3.6
    }
}

# If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys
# as the columns and the inner keys as the row indices:
    
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [29]:
# the dataframe can be transposed with similar syntax to a numpy array
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5
