# 5.1 Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse data structures: **Series** and **DataFrame**.

## Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.

In [1]:
import pandas as pd

In [2]:
obj = pd.Series([4, 7, -5, 3])

In [3]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [4]:
obj.values

array([ 4,  7, -5,  3])

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])

In [7]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [8]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [9]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [10]:
obj2["a"]

-5

In [11]:
obj2["b"]

7

In [12]:
obj2[["a", "d", "c"]]

a   -5
d    4
c    3
dtype: int64

In [13]:
obj2[obj2 > 2]

d    4
b    7
c    3
dtype: int64

In [14]:
obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

In [15]:
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [16]:
"b" in obj2

True

In [17]:
"e" in obj2

False

Should you have data contained in a Python dict, you can create a Series from it by passing the dict:

In [18]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}

In [19]:
sdata

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [20]:
obj3 = pd.Series(sdata)

In [21]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict's keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:

In [22]:
states = ["California", 'Ohio', "Utah"]

In [23]:
obj4 = pd.Series(data=sdata, index=states)

In [24]:
obj4

California        NaN
Ohio          35000.0
Utah           5000.0
dtype: float64

In [25]:
obj3.index

Index(['Ohio', 'Texas', 'Oregon', 'Utah'], dtype='object')

In [26]:
obj3.reindex(index=states) # reindex by passing a list

California        NaN
Ohio          35000.0
Utah           5000.0
dtype: float64

In [27]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

Since no value for 'California' was found, it appears as **NaN** (not a number), which is considered in pandas to mark missing or NA values. Since 'Utah' was not included in **states**, it is excluded from the resulting object.

In [28]:
pd.isnull(obj4)

California     True
Ohio          False
Utah          False
dtype: bool

In [29]:
pd.notnull(obj4)

California    False
Ohio           True
Utah           True
dtype: bool

In [30]:
pd.notna(obj4)

California    False
Ohio           True
Utah           True
dtype: bool

Both the Series object itself and its index have a **name** attribute, which integrates with other key areas of pandas functionality:

In [31]:
obj4.name = "population"

In [33]:
obj4.index.name = 'state'

In [34]:
obj4

state
California        NaN
Ohio          35000.0
Utah           5000.0
Name: population, dtype: float64

A Series's index can be altered in-place by assignment:

In [35]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [36]:
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]

In [37]:
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

## DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). 

The DataFrame has both a row and column index.

There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:

In [71]:
data = {
    "state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
    "year": [2000, 2001, 2002, 2001, 2002, 2003],
    "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2],
}

In [72]:
frame = pd.DataFrame(data)

In [73]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [74]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [75]:
frame.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


If you specify a sequence of columns, the DataFrame's columns will be arranged in that order:

In [76]:
pd.DataFrame(data, columns=["year", "state", "pop"])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2
