# 5.1 Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse data structures: **Series** and **DataFrame**.

## Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.

In [151]:
import pandas as pd

In [152]:
obj = pd.Series([4, 7, -5, 3])

In [153]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [154]:
obj.values

array([ 4,  7, -5,  3])

In [155]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [156]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])

In [157]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [158]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [159]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [160]:
obj2["a"]

-5

In [161]:
obj2["b"]

7

In [162]:
obj2[["a", "d", "c"]]

a   -5
d    4
c    3
dtype: int64

In [163]:
obj2[obj2 > 2]

d    4
b    7
c    3
dtype: int64

In [164]:
obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

In [165]:
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [166]:
"b" in obj2

True

In [167]:
"e" in obj2

False

Should you have data contained in a Python dict, you can create a Series from it by passing the dict:

In [168]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}

In [169]:
sdata

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [170]:
obj3 = pd.Series(sdata)

In [171]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict's keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:

In [172]:
states = ["California", 'Ohio', "Utah"]

In [173]:
obj4 = pd.Series(data=sdata, index=states)

In [174]:
obj4

California        NaN
Ohio          35000.0
Utah           5000.0
dtype: float64

In [175]:
obj3.index

Index(['Ohio', 'Texas', 'Oregon', 'Utah'], dtype='object')

In [176]:
obj3.reindex(index=states) # reindex by passing a list

California        NaN
Ohio          35000.0
Utah           5000.0
dtype: float64

In [177]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

Since no value for 'California' was found, it appears as **NaN** (not a number), which is considered in pandas to mark missing or NA values. Since 'Utah' was not included in **states**, it is excluded from the resulting object.

In [178]:
pd.isnull(obj4)

California     True
Ohio          False
Utah          False
dtype: bool

In [179]:
pd.notnull(obj4)

California    False
Ohio           True
Utah           True
dtype: bool

In [180]:
pd.notna(obj4)

California    False
Ohio           True
Utah           True
dtype: bool

Both the Series object itself and its index have a **name** attribute, which integrates with other key areas of pandas functionality:

In [181]:
obj4.name = "population"

In [182]:
obj4.index.name = 'state'

In [183]:
obj4

state
California        NaN
Ohio          35000.0
Utah           5000.0
Name: population, dtype: float64

A Series's index can be altered in-place by assignment:

In [184]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [185]:
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]

In [186]:
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

## DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). 

The DataFrame has both a row and column index.

There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:

In [187]:
data = {
    "state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
    "year": [2000, 2001, 2002, 2001, 2002, 2003],
    "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2],
}

In [188]:
frame = pd.DataFrame(data)

In [189]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [190]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [191]:
frame.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


If you specify a sequence of columns, the DataFrame's columns will be arranged in that order:

In [192]:
pd.DataFrame(data, columns=["year", "state", "pop"])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


If you pass a column that isn't contained in the dict, it will appear with missing values in the result:

In [193]:
frame2 = pd.DataFrame(
    data,
    columns=["year", "state", "pop", "debt"],
    index=["one", "two", "three", "four", "five", "six"],
)

In [194]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [195]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [196]:
frame2["state"]

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [197]:
frame2.index

Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

In [198]:
frame2.state

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [199]:
frame2["state"]

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

**frame2[column]** works for any columns name, but **frame2.column** only works when the column name is a valid Python variable name.

In [200]:
frame2.loc["three"]

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [201]:
frame2.loc["three", "pop"]

3.6

Columns can be modified by assignment. For example, the empty **'debt'** column could be assigned a scalar value or an array of values:

In [202]:
frame2["debt"] = 16.5

In [203]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


In [204]:
frame2["debt"] = np.arange(6.0)

In [205]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


When you are assigning lists or arrays to a columns, the value's length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame's index, inserting missing values in any holes:

In [206]:
val = pd.Series([-1.2, -1.5, -1.7], index=["one", "two", "five"])

In [207]:
frame2["debt"] = val

In [208]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,-1.2
two,2001,Ohio,1.7,-1.5
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


In [212]:
frame2["eastern"] = frame2.state == "Ohio"

In [213]:
frame2.columns

Index(['year', 'state', 'pop', 'debt', 'eastern'], dtype='object')

New columns cannot be created with the **frame2.eastern** syntax

**----------------update--------------**

It can.

In [217]:
frame2["western"] = frame2.state == "Nevada"

In [218]:
frame2.columns

Index(['year', 'state', 'pop', 'debt', 'eastern', 'western'], dtype='object')

In [219]:
frame2

Unnamed: 0,year,state,pop,debt,eastern,western
one,2000,Ohio,1.5,-1.2,True,False
two,2001,Ohio,1.7,-1.5,True,False
three,2002,Ohio,3.6,,True,False
four,2001,Nevada,2.4,,False,True
five,2002,Nevada,2.9,-1.7,False,True
six,2003,Nevada,3.2,,False,True


The **del** method can then be used to remove this column:

In [226]:
del frame2["eastern"]

In [228]:
del frame2["western"]

In [229]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

The column returned from indexing a DataFrame is a *view* on the underlying data, not a copy. Thus, any in-place modifications to the Series will be reflect in the DataFrame. The column can be explicitly copied with the Series's **copy** method.

In [240]:
val = pd.Series([-1.2, -1.5, -1.7], index=["one", "two", "five"])

In [241]:
frame2["debt"] = val

In [242]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,-1.2
two,2001,Ohio,1.7,-1.5
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


In [246]:
val["one"] = 5

In [247]:
val

one     5
two     2
five    5
dtype: int64

In [249]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,-1.2
two,2001,Ohio,1.7,-1.5
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,
