# Introducing Pandas Objects

At a basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than interger indices.
The three fundamental Pandas data structures are:
  * Series
  * DataFrame
  * Index

In [1]:
#standard numpy/pandas imports
import numpy as np
import pandas as pd

## The Pandas Series Object
A pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

The output shows that the Series wraps both a sequence of values and a sequence of indices. Both of these can be accessed with the values and index attributes.

In [3]:
data.values

array([ 0.25,  0.5 ,  0.75,  1.  ])

In [4]:
data.index

RangeIndex(start=0, stop=4, step=1)

Like a NumPy array, data can be accessed by associated index.

In [5]:
data[1]

0.5

In [6]:
data[1:3]

1    0.50
2    0.75
dtype: float64

### Series as a generalized NumPy array
The essential differences between a NumPy array and a Series are that the Series has an explicitly defined index associated with values while the NumPy array is implicit.

This give the Series additional capablities. For example, we can set the index to what ever we want, in the following case we can use strings as the index.

In [7]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [8]:
data['b']

0.5

### Series as a specialized dictionary
Series can also be looked at like a specialization of Python's built in dictionary. A dictionary maps arbitrary keys to a set of arbitrary values. A Pandas Series however used *typed* keys and values making it noticably more efficient than a dict. 

A series can be constructed directly from a python dict.

In [9]:
#made up values
population_dict = {'California': 7849781947,
                  'Texas': 72401374,
                  'New York': 8491041,
                  'Florida': 91348104,
                  'Illinois': 7381941}
population = pd.Series(population_dict)
population

California    7849781947
Florida         91348104
Illinois         7381941
New York         8491041
Texas           72401374
dtype: int64

In [10]:
population['California']


7849781947

### Other ways to construct Series objects


In [11]:
pd.Series([2,5,1])

0    2
1    5
2    1
dtype: int64

The data can be scalar that will be repeated to the specified index.

In [12]:
pd.Series(5, index=[1,2,3])

1    5
2    5
3    5
dtype: int64

Data can be created like a python dict where index defaults to sorted dict keys.

In [13]:
pd.Series({2:'b', 1:'a', 3:'c'})

1    a
2    b
3    c
dtype: object

## The Pandas DataFrame Object
The next fundamental data structure is the Pandas DataFrame. It can be thought of as a generalized NumPy array.

### Dataframe as a generalized NumPy Array
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two dimensional array with flexibility in row indices and column names. Just as a two-dimensional array is a sequence of aligned one-dimensional arrays, a DataFrame is a sequence of aligned Series objects. "Aligned" means they share the same index.

In [14]:
#construct new series of area codes
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
dtype: int64

Now that we have a population series and an area code series we can combine them into a two-dimensional DataFrame.

In [15]:
states = pd.DataFrame({'population': population, 'area': area})
states

Unnamed: 0,area,population
California,423967,7849781947
Florida,170312,91348104
Illinois,149995,7381941
New York,141297,8491041
Texas,695662,72401374


DataFrames have the following attributes:
 * index
 * columns

In [16]:
states.index

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

In [17]:
states.columns

Index(['area', 'population'], dtype='object')

### DataFrames as specialized dictionaries
Dataframes can also be seen as specialized versions of pythons builtin dict. Where dicts map keys to values, DataFrames map column name to a Series of a column Data. For example, 'area' attribute returns the Series object containing the area Series.

In [18]:
states['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [19]:
states['population']

California    7849781947
Florida         91348104
Illinois         7381941
New York         8491041
Texas           72401374
Name: population, dtype: int64

### Constructing DataFrame objects
DataFrames can be constructed in a variety of ways

**From a single Series object**

A single Series object would lead to a DataFrame with one column

In [20]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,7849781947
Florida,91348104
Illinois,7381941
New York,8491041
Texas,72401374


**From a list of dictionaries**

Using a simple list comprehension, a list of dictionaries can be transformed into a DataFrame.

In [22]:
data = [{'a': i, 'b': 2 * i} for i in range(4)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4
3,3,6


**From a two-dimensional NumPy array** 

A DataFrame can be constructed using a two-dimensional array of data with specified index and column names. If index or column name is ommited, DataFrame will default to intergers.

In [24]:
array = np.random.rand(3,2)
pd.DataFrame(array, index=['a', 'b', 'c'], columns=['foo', 'bar'])

Unnamed: 0,foo,bar
a,0.262957,0.45411
b,0.376431,0.534379
c,0.436688,0.754184


## The Pandas Index Object

The Pandas Series and DataFrame objects are objects containing an explicit index that lets you reference and modify data. 

The Index object can be thought of as an *immutable* array or as an *ordered set*, however in this case it is a multi set as it can have multiple of the same value. 

Constructing an Index from a list of intergers:

In [26]:
ind = pd.Index([1, 2, 5, 8])
ind

Int64Index([1, 2, 5, 8], dtype='int64')

An Index behaves a lot like an array. For example, using standard Python indexing notation:

In [27]:
ind[3]

8

In [28]:
ind[::2]

Int64Index([1, 5], dtype='int64')

Index objects also have familiar attributes to NumPy arrays:

In [29]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

4 (4,) 1 int64


However, its contents cannot be modified.

In [30]:
ind[2] = 2

TypeError: Index does not support mutable operations

### Index as an Ordered Set

The index object follows manyof the conventions of mathematical sets, that is:
 * Unions
 * Intersections
 * Differences
 * And other combinations

In [32]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [33]:
indA & indB #intesection

Int64Index([3, 5, 7], dtype='int64')

In [34]:
indA | indB #union

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [35]:
indA ^ indB #symmetric difference

Int64Index([1, 2, 9, 11], dtype='int64')

In [36]:
#can also be accessed via object methods
indA.intersection(indB)

Int64Index([3, 5, 7], dtype='int64')