# *pandas* Data Structures

*pandas* contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. It is often used in tandem with numerical computing tools like *NumPy* and *SciPy*, analytical libraries - *statsmodels* and *scikit-learn* - as well as data visualisation libraries like *matplotlib*. Compared to *NumPy* which is more suited for numerical array data, *pandas* is designed for working with tabular or heterogeneous data.

To better appreciate the use of *pandas*, I will introduce its two workhorse data structures - *Series* and *DataFrame*.

## Series

This is a one-dimensional array-like object containing a sequence of values and an associated array of index, or data labels. You can think of Series as a fixed-length, ordered `dict`, as it is a mapping of index values to data values:

In [1]:
import pandas as pd   # we first need to import pandas to use its library
import numpy as np

# generating a simple series
obj = pd.Series([4, 7, -5, 3]) 
# 0    4
# 1    7
# 2   -5
# 3    3
# dtype: int64


# getting the values of a Series, in array form
obj.values 
# [ 4  7 -5  3]


# this returns a range of numbers
obj.index
# RangeIndex(start=0, stop=4, step=1)


# we can generate a series with predefined labels
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

### Labels

Compared with NumPy arrays, pandas series allows you to use labels in the index to select values or a set of values:

In [2]:
# selecting value of label 'a'
obj2['a'] 
# -5


# selecting multiple values via list of corresponding labels
obj2[['c', 'a', 'd']]
# c    3
# a   -5
# d    4
# dtype: int64


# assigning values
obj2['d'] = 6
obj2

d    6
b    7
a   -5
c    3
dtype: int64

### Scalar multiplication, boolean indexing

Series also allows NumPy functions or operations such as filtering with a boolean condition and scalar multiplication:

In [3]:
# returning only values greater than zero in the series
obj2[obj2 > 0]  
# d    4
# b    7
# c    3
# dtype: int64

# multiply every value by 2
obj2 * 2 
# d     8
# b    14
# a   -10
# c     6
# dtype: int64

np.exp(obj2)  # calculate exponential of every value

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

### Similar to in-built `dict` in Python

You can think about a Series as fixed-length, ordered dict, as it is a mapping of index values to data values. It can be used in several contexts where you might a dict:

In [4]:
'b' in obj2   # check for b index/label in obj2
# True


# creating a Series from a defined dict
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3
# Ohio      35000
# Texas     71000
# Oregon    16000
# Utah       5000
# dtype: int64


# override order of how keys would appear in the Series
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4
# California        NaN
# Ohio          35000.0
# Oregon        16000.0
# Texas         71000.0
# dtype: float64

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In the above example, no value for `'California'` was found in `sdata`, thus a `NaN` or missing value will be generated in place instead. Also notice that `obj3` will include the value for `Utah` but not `obj4` as `Utah` was not included in `states` list of indices passed in as an argument.

`pandas` also provides functions which can easily detect missing values in a Series:

In [5]:
pd.isnull(obj4)

# the instance method does the same as above
obj4.isnull()
# California     True
# Ohio          False
# Oregon        False
# Texas         False
# dtype: bool


# the opposite of isnull() is notnull()
obj4.notnull()

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

### Arithmethic operations on Series

A useful Series feature is that it automatically aligns by index label in arithmetic operations:

In [6]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

## DataFrame

A DataFrame represents a table of data and contains an ordered collection of columns, each of which can be a different value type(numeric, string, boolean, etc.) The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. Data in a DataFrame is stored as one or more two-dimensional blocks rather than a list, dict or some other collection of one-directional arrays. 

To construct a DataFrame:

In [7]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
       'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2],
       }
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


***Note***: *in Jupyter notebook, `pandas` DataFrame objects will be displayed as a more browser-friendly HTML table as an output.*

For larger DataFrames, you can use the `head` method, which displays only the first five rows:

In [8]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


Specifying a sequence of columns will arrange the columns in that order while generating the DataFrame. You can also override the default index labels as well:

In [9]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                     index=['one', 'two', 'three', 'four', 'five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


Notice in the above example that if you pass a column that isn't found in the `dict` data, the corresponding values for that column will be initialised with `NaN` values in the DataFrame. 

Another common form of data passed to a DataFrame is a nested dict of dicts:

In [10]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In the above example, `pandas` will interpret the outer dict keys as the column labels, and the inner keys as the row indices. Table below lists possible data inputs to DataFrame constructor:

**Type** | **Notes**
--- | ---
2D ndarray | A matrix of data, passing optional row and column labels
`dict` of arrays, lists or tuples | Each sequence becomes a column in the DataFrame; all sequences must be the same length
NumPy structured/record array | Treated as the "dict of arrays" case
`dict` of Series | Each value becomes a column; indexes from each Series are unioned together to form the result's row index if no explicit index is passed.
`dict` of `dicts` |  Each inner dict becomes a column; keys are unioned to form the row index as in the "dict of Series" case
List of `dicts` or Series | Each item becomes a row in the DataFrame; union of dict keys or Series indexes become the DataFrame's column labels
List of lists or tuples | Treated as the "2D ndarray" case
Another DataFrame | The DataFrame's indexes are used unless different ones are passed
NumPy MaskedArray | Like the "2D ndaaray" case except masked values become NA/missing in the DataFrame result

### Indexing
A column can be retrieved in a DataFrame as a Series, which will retain the column label and index labels, like this:

In [11]:
frame2['state']
# one        Ohio
# two        Ohio
# three      Ohio
# four     Nevada
# five     Nevada
# six      Nevada
# Name: state, dtype: object

frame2.year   # works similarly but not encouraged

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

**Note**: `frame2[column]` in the above example works for any column name, but `frame2.column` only works when the column name is a valid Python variable name (Recall [chapter](https://github.com/colintanwh/python-basics/blob/master/variables.ipynb) on *Variables* in Python basics)

### Assigning values to existing DataFrame

Columns can be modified by assignment. For example, the empty *debt* column could be assigned a scalar value, an array of values or even a Series. Note that the column returned via indexing is a *view* on the underlying data, not a copy. Thus, any in-place modifications to the Series will be reflectedin the DataFrame:

In [12]:
# assigns all 'debt' values to 16.5
frame2['debt'] = 16.5  
print(f"{frame2}\n")

# assigns an array of running integers from 0 to 'debt'
frame2['debt'] = np.arange(6.)
print(f"{frame2}\n")

# assigns a Series of values to 'debt'
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
print(f"{frame2}\n")

       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5
six    2003  Nevada  3.2  16.5

       year   state  pop  debt
one    2000    Ohio  1.5   0.0
two    2001    Ohio  1.7   1.0
three  2002    Ohio  3.6   2.0
four   2001  Nevada  2.4   3.0
five   2002  Nevada  2.9   4.0
six    2003  Nevada  3.2   5.0

       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
six    2003  Nevada  3.2   NaN



In the example above, assigning a Series will have its labels realigned exactly to the DataFrame's index, inserting missing values in any gaps.

When assigning a column that doesn't exist, a new column will be created in the DataFrame:

In [13]:
frame2['eastern'] = frame2.state == 'Ohio'  # True or False
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


### Delete columns

`del` method can be used to remove columns in a DataFrame:

In [14]:
del frame2['eastern']
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


### Index Objects

`pandas`'s Index objects are responsible for holding the axis labels and other metadata. Any array or other sequence of labels you use when constructing a Series of DataFrame is internally converted to an Index. Index objects are immutable and thus can't be modified by the user:

In [15]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
# Index(['a', 'b', 'c'], dtype='object')

labels = pd.Index(np.arange(3))  # creating index objects from pandas

obj2 = pd.Series([1.5, -2.5, 0], index=labels)  # generate series with specified index objects passed in
obj2
# 0    1.5
# 1   -2.5
# 2    0.0
# dtype: float64

0    1.5
1   -2.5
2    0.0
dtype: float64

A `pandas` Index can contain duplicate labels. Selections with duplicate labels will select all occurrences of that label:

In [16]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

Index objects has a number of methods and properties for set logic. Some useful ones are summarised below:

**Method** | **Description**
--- | ---
`append` | Concatenate with additional index objects, producing a new Index 
`difference` | Compute set difference as an Index
`intersection` | Compute set intersection
`union` | Compute set union
`isin` | Compute boolean array indicating if each value is contained in the passed collection
`delete` | Compute new Index with element at index `i` deleted
`drop` | Compute new Index by deleting passed values
`insert` | Compute new Index by inserting element at index `i`
`is_monotonic` | Returns `True` if each element is greater than or equal to the previous element
`is_unique` | Returns `True` if the index has no duplicate values
`unique` | Compute the array of unique values in the index

## Essential Functionality

In the following sections, we will delve more deeply into data analysis and manipulation topics using `pandas`. 

### Reindexing

`reindex` helps create a new object with the data conformed to a new index. Calling `reindex` will rearrange the data according to the new index, introducing missing values if any index values were not already present. Consider this example below:

In [17]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
# d    4.5
# b    7.2
# a   -5.3
# c    3.6
# dtype: float64


obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
# a   -5.3
# b    7.2
# c    3.6
# d    4.5
# e    NaN
# dtype: float64

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, we can use `ffill` method to do some interpolation or filling of values when reindexing:

In [18]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3 = obj3.reindex(range(6), method='ffill')
obj3

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

By default, the `reindex` method reindexes the rows by default when passed only a sequence. We can reindex the columns with the `columns` keyword when invoking the `reindex` method:

In [19]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                    index = ['a', 'c', 'd'],
                    columns = ['Ohio', 'Texas', 'California'])
print(frame)

states = ['Texas', 'Utah', 'California']
frame = frame.reindex(['a', 'b', 'c', 'd'], columns = states)
frame

   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8


Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


### Dropping Entries

The `drop` method will return a new object with the indicated value or values deleted from an axis: