# Pandas

Pandas objects can be thought of as enhanced versions of NumPy structured arryas. Rows and columns can be identified with labels rather than simple integer indeces. Pandas also brings a host of useful functions and operations

In [24]:
import numpy as np
import pandas as pd

## Series Object

a series object is a one dimensional array of indexed data. It can be created from a list. The series wraps both a sequence of values and a sequence of indeces, both of which are accessible.

In [25]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

data can be accessed by the associated index with the normal square bracket notation

In [26]:
data[1]

0.5

In [27]:
data[1:3]

1    0.50
2    0.75
dtype: float64

the 'values' attribute gives the familiar NumPy array

In [28]:
data.values

array([ 0.25,  0.5 ,  0.75,  1.  ])

the 'index' attribute is an array lke object of type pd.Index

In [29]:
data.index

RangeIndex(start=0, stop=4, step=1)

The big change that Pandas brings to NumPy one-dimensional arrays is the presence of the index. Pandas Series has an explicitly defined index associated with the values.

The explicit index definition gives the Series object additional capabilities. The index does not even need to be an integer, but be of an desired type.

In [30]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                index=['a','b','c','d'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

Item access can then be done using the explicit index

In [31]:
data['b']

0.5

the indeces don't even need to be continuous

In [32]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [33]:
data[5]

0.5

## Series as specialized dictionary

Series can also be treated as a specialized dictionary with key value pairs. The implimentation of series as a dictionary is much more efficient than the standard python dictionary

In [34]:
population_dict = {'california': 38332521,
                  'texas': 26448193,
                  'new york': 19651127,
                  'florida': 18552860,
                  'illinois': 12882135}
population = pd.Series(population_dict)
population

california    38332521
florida       18552860
illinois      12882135
new york      19651127
texas         26448193
dtype: int64

Once the series is created from the dictionary normal key access is as expected. 

In [35]:
population['florida']

18552860

Putting the dictionary into a series allows for array like operations like slicing. 

In [36]:
population['california':'illinois']

california    38332521
florida       18552860
illinois      12882135
dtype: int64

## Constructing Series Objects

already one way of constructing series objects has been shown

In [37]:
# pd.Series(data, index=index)
# index is an optional parameter
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a','b','c','d'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

Series can also be constructed from a list or NumPy array. In this case the index defaults to a character sequence

In [38]:
pd.Series([2,4,8])

0    2
1    4
2    8
dtype: int64

the data provided can be a scalar, the result of which comes to the scalar being repeated for each index 

In [39]:
pd.Series(4, index=[100, 200, 300])

100    4
200    4
300    4
dtype: int64

data can also be a dictionary, the index will default to the dictionary key for each value

In [40]:
# strings as keys, numbers as values
pd.Series({'a':4, 'g':2, 'p':6})

a    4
g    2
p    6
dtype: int64

In [41]:
# number as key, string as value
pd.Series({4:'a', 2:'g', 6:'p'})

2    g
4    a
6    p
dtype: object

alternatively, the index can be given. The index values given are treated as gate keepers. Any value with a key not in the index list given will not be in the Series

In [42]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3,2])

3    c
2    a
dtype: object

## DataFrame Object

### As Generalized NumPy arrays
Pandas DataFrame objects can be thought of as generalized NumPy arrays. Recall the population Series from earlier

In [47]:
population_dict = {'California': 38332521,
                  'Texas': 26448193,
                  'New York': 19651127,
                  'Florida': 18552860,
                  'Illinois': 12882135,
                  'Montana': 9999999}
population = pd.Series(population_dict)

and now introduce a Series containing the area for each state

In [48]:
area_dict = {'California': 423967, 
             'Texas': 695662, 
             'New York': 141297,
             'Florida': 170312, 
             'Illinois': 149995,
             'Nevada': 999999}
area = pd.Series(area_dict)


finally putting these two together in a DataFrame will result in a single, two dimensional object categorized by like keys. Notice what happens to the values for which there is no key for it's series. 
- population does not have a Nevada entry
- area_dict does not have a Montana entry

In [49]:
states = pd.DataFrame({'population': population,
                      'area': area})
states

Unnamed: 0,area,population
California,423967.0,38332521.0
Florida,170312.0,18552860.0
Illinois,149995.0,12882135.0
Montana,,9999999.0
Nevada,999999.0,
New York,141297.0,19651127.0
Texas,695662.0,26448193.0


DataFrame also has an index attribute to acces the indeces.

In [52]:
states.index

Index(['California', 'Florida', 'Illinois', 'Montana', 'Nevada', 'New York',
       'Texas'],
      dtype='object')

DataFrame has a column attribute ( an index ) for accessing column names

In [54]:
states.columns

Index(['area', 'population'], dtype='object')

As generalized NumPy arrays a DataFrame can be viewed as a way to merge arrays together into 2d forms associated by shared indexes (keys)

### As Specialized Dictionary 

Dictionaries map a key to a value, a DataFrame with map a column name to a Series of column data.

In [57]:
states['area']

California    423967.0
Florida       170312.0
Illinois      149995.0
Montana            NaN
Nevada        999999.0
New York      141297.0
Texas         695662.0
Name: area, dtype: float64

It is important to note that accessing a DataFrame in this way will return the COLUMN of data, whereas the same operation on a 2d NumPy array would result in a row of data. 
Therefore it is better to think of DataFrames as specialized Dictionaries rather than arrays.

### Construction DataFrame objects

DataFrames are a collection of Series objects, therefore we can construct a DataFrame with a single Series if so desired

In [58]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Florida,18552860
Illinois,12882135
Montana,9999999
New York,19651127
Texas,26448193


DataFrames can also be constructed from any Python list

In [63]:
data = [{'a': 1**i, 'b': 2**i, 'c': 3**i} for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b,c
0,1,1,1
1,1,2,3
2,1,4,9


Pandas will figure out the details, if a key in the dictionary is missing then it will be filled with NaN values.

In [64]:
# notice that the first dictionary is missing an entry for 'c'
# the second is missing an entry for 'a'
pd.DataFrame([{'a':1, 'b':2}, 
              {'b':3, 'c':4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


DataFrames can be constructed from a dictionary of Series objects. The keys used will become the column names

DataFrames can also be constructed from a two-dimensional array of data, we can create a DataFrame with any specified column and index names. If omitted, an integer index will be used for each

## Index Object 

Both the Series and DataFrame object contain an explicit index used to reference and modify data. Indexes can be thought of as an immutable array or an ordered list. Indexes can contain repeat values.

Indexes can be constructed from a list of integers

In [65]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

Indexes operate like arrays in many ways. For example the paradigm for accessing values or slices holdes

In [66]:
ind[1]

3

In [69]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

Pandas Indexes are immutable. This makes it safer to transfer and share indeces between DataFrames and Series without worry of modification

## Index as ordered set 

Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. The Index object follows many of the conventions used by Python's built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [70]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7 , 11])

In [71]:
# intersection of a, b
indA & indB

Int64Index([3, 5, 7], dtype='int64')

In [72]:
# union of a, b
indA | indB

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [73]:
# symmetric difference
indA ^ indB

Int64Index([1, 2, 9, 11], dtype='int64')