# Introducing Pandas Objects

Pandas objects can be thought of as enhanced NumPy structured arrays - __where the rows and columns are identified with labels rather than simple integer indices.__ Let's introduce __three fundamental Pandas data structures: the ``Series``, ``DataFrame``, and ``Index``.__

In [1]:
import numpy as np
import pandas as pd

### The Pandas Series Object

A Pandas ``Series`` is a __one-dimensional array of indexed data.__ It can be created from a list or array.

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

A ``Series`` __, which we can access with the ``values`` and ``index`` attributes__. The ``values`` are simply a NumPy array. The index is a similar array-like object.

In [4]:
data.values

array([ 0.25,  0.5 ,  0.75,  1.  ])

In [5]:
data.index

RangeIndex(start=0, stop=4, step=1)

Series data can be accessed by an index value, just like Numpy.

In [7]:
data[1]

0.5

In [8]:
data[1:3]

1    0.50
2    0.75
dtype: float64

### ``Series`` as generalized NumPy array

__The difference between Numpy arrays and Pandas Series is the index__. While the Numpy Array has an *implicitly defined* integer index used to access the values, the Pandas ``Series`` __has an *explicitly defined* index__ associated with the values.

This explicit index definition gives the ``Series`` object additional capabilities. For example, the index can consist of values of any desired type.

In [9]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [10]:
data['b']

0.5

__Non-contiguous & non-sequential indices are ok:__

In [12]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [13]:
data[5]

0.5

### Series as specialized dictionary

Think of a ``Series`` as a specialized dictionary. It is a structure which maps typed keys to a set of typed values.

In [14]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

In [15]:
population['California']

38332521

Unlike dictionaries a __``Series`` also supports array-style operations such as slicing:__

In [16]:
population['California':'Illinois']

California    38332521
Florida       19552860
Illinois      12882135
dtype: int64

### Constructing Series objects

We've already seen a few ways of constructing a Pandas ``Series`` from scratch. ``data`` can be a list or NumPy array, in which case ``index`` defaults to an integer sequence:

In [17]:
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

``data`` can be a scalar:

In [19]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

- ``data`` can be a dictionary - ``index`` defaults to the sorted dictionary keys:

In [20]:
pd.Series({2:'a', 1:'b', 3:'c'})

1    b
2    a
3    c
dtype: object

__The index can be explicitly set__. In this case the ``Series`` is only populated with the explicitly identified keys.

In [21]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

### Pandas DataFrames

If a ``Series`` resembles a 1D array with flexible indices, a ``DataFrame`` resembles a 2D array with flexible row and column indices. __Think of a ``DataFrame`` as a sequence of aligned ``Series`` objects. "Aligned" means they share the same index.__

In [22]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
dtype: int64

Let's add this series to the ``population`` Series from before to build a two-dimensional object.

In [23]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


``DataFrame``s have ``index`` and ``column`` attributes (an index object holding the column labels.)

In [24]:
states.index

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

In [25]:
states.columns

Index(['area', 'population'], dtype='object')

### DataFrame as specialized dictionary

``DataFrame``s are a specialized dictionaries. __They map a column name to a ``Series`` of column data.__

In [26]:
states['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

Note: in a 2D NumPy array, ``data[0]`` returns the first *row*. In a ``DataFrame``, ``data['col0']`` will return the first *column*.

#### Constructing DataFrames from a single Series object

- A ``DataFrame`` is a collection of ``Series`` objects. A single-column ``DataFrame`` can be constructed from a single ``Series``:

In [27]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Florida,19552860
Illinois,12882135
New York,19651127
Texas,26448193


### Constructing DataFrames from a list of dicts

In [28]:
data = [{'a': i, 'b': 2*i}
        for i in range(5)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4
3,3,6
4,4,8


If dictionary keys are missing, __Pandas will fill them in with ``NaN``values:__

In [26]:
pd.DataFrame([
    {'a': 1, 'b': 2}, 
    {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


### Constructing DataFrames from a dictionary of Series objects

As we saw before, a ``DataFrame`` can be constructed from a dictionary of ``Series`` objects as well:

In [27]:
pd.DataFrame({'population': population, 'area': area})

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


### Constructing DataFrames from a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names. If omitted, an integer index will be used for each:

In [29]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.669045,0.221414
b,0.180722,0.101075
c,0.757857,0.62129


### Constructing DataFrames from a NumPy structured array

``DataFrame``s operate much like a structured array, and can be created directly from one:

In [30]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0,  0.), (0,  0.), (0,  0.)],
      dtype=[('A', '<i8'), ('B', '<f8')])

In [31]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


### Pandas Index Objects

``Series`` and ``DataFrame`` objects contain an explicit *index* that lets you reference and modify data. __The ``Index`` can be thought of as an *immutable array* or an *ordered set*__ (technically a multi-set, as ``Index`` objects may contain repeated values).

In [32]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

### Indexes as immutable arrays

We can use standard Python indexing to retrieve values or slices:

In [33]:
ind[1]

3

In [34]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

``Index`` objects have many NumPy array-like attributes:

In [35]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


Note: __``Index`` indices are cannot be modified via the normal means__. Immutability makes it safer to share indices between multiple ``DataFrame``s and arrays without the risk of side effects from inadvertent index modification.

In [37]:
ind[1] = 0

TypeError: Index does not support mutable operations

### Indexes as ordered sets

Pandas objects are designed for operations such as dataset joins, which depend on set arithmetic. The ``Index`` object follows many of the conventions used by Python's built-in ``set`` data structure, __so that unions, intersections, differences, and other combinations can be computed in a familiar way:__

In [38]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [39]:
indA & indB  # intersection

Int64Index([3, 5, 7], dtype='int64')

In [40]:
indA | indB  # union

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [41]:
indA ^ indB  # symmetric difference

Int64Index([1, 2, 9, 11], dtype='int64')