# Data Structures
In __pandas__, there are two workhorse data strucutures: _Series_ and _DataFrame_.

## Series
A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels called _index_. The __default__ index is from 0 to _N-1_ -- N is the number of elements in the data.

In [1]:
import pandas as pd

In [2]:
arr = pd.Series([-2, 90, 3, 4], index = ['a', 'd', 'b', 'c'])
arr

a    -2
d    90
b     3
c     4
dtype: int64

In [3]:
arr.values

array([-2, 90,  3,  4])

In [4]:
arr.index

Index(['a', 'd', 'b', 'c'], dtype='object')

If you consider _series_ as a dictionary, then following expression makes sense:

In [5]:
'b' in arr

True

Series automatically align differently-indexed data in arithmetic operations.

In [6]:
population1 = pd.Series([500, 45, 345, 55], index=['Nanjing', "Madison", "San Diego", "Twin City"])
population2 = pd.Series([1000, 900, 23, 1500], index=['BeiJing', 'Nanjing', 'Madison', "Shanghai"])

In [7]:
population1

Nanjing      500
Madison       45
San Diego    345
Twin City     55
dtype: int64

In [8]:
population2

BeiJing     1000
Nanjing      900
Madison       23
Shanghai    1500
dtype: int64

In [9]:
population1 + population2

BeiJing         NaN
Madison        68.0
Nanjing      1400.0
San Diego       NaN
Shanghai        NaN
Twin City       NaN
dtype: float64

One can fill with a special value in arithmetic operations between different-indexed objects via __add__ method of the _Series_ or _DataFrame_.

In [43]:
population1.add(population2, fill_value = 0)

BeiJing      1000.0
Madison        68.0
Nanjing      1400.0
San Diego     345.0
Shanghai     1500.0
Twin City      55.0
dtype: float64

In [10]:
population1.name = "Population"
population1.index.name = "City"

In [11]:
population1

City
Nanjing      500
Madison       45
San Diego    345
Twin City     55
Name: Population, dtype: int64

## DataFrame
A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which could be a different value type. 

In [12]:
data = {'city': ['Nanjing', 'Shenzhen', 'Madison', 'Seattle', "Santa Cruz"],
        'year': [2007, 2011, 2015, 2017, 2013]}
frame = pd.DataFrame(data)

In [13]:
frame

Unnamed: 0,city,year
0,Nanjing,2007
1,Shenzhen,2011
2,Madison,2015
3,Seattle,2017
4,Santa Cruz,2013


In [14]:
frame.columns

Index(['city', 'year'], dtype='object')

In [15]:
frame['city']

0       Nanjing
1      Shenzhen
2       Madison
3       Seattle
4    Santa Cruz
Name: city, dtype: object

One can access the column as the attribute of the data frame as well. 

In [16]:
frame.city

0       Nanjing
1      Shenzhen
2       Madison
3       Seattle
4    Santa Cruz
Name: city, dtype: object

In [17]:
# Let's get the row
frame.ix[3]

city    Seattle
year       2017
Name: 3, dtype: object

In [18]:
frame.values

array([['Nanjing', 2007],
       ['Shenzhen', 2011],
       ['Madison', 2015],
       ['Seattle', 2017],
       ['Santa Cruz', 2013]], dtype=object)

For duplicated values, one can check if the index is duplicated for __is_unique__.

In [57]:
frame.index.is_unique

True

For the arithmetic operations between DataFrame and Series, by default, the index of the Series will __match on__ columns of the DataFrame and <font color='blue'>broadcasting down the rows</font>

In [44]:
frame = pd.DataFrame(np.arange(15).reshape((5, 3)), columns=['CA', 'WA', 'TX'], index=range(5))
series = frame.ix[0]

In [45]:
frame

Unnamed: 0,CA,WA,TX
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11
4,12,13,14


In [46]:
series

CA    0
WA    1
TX    2
Name: 0, dtype: int64

In [47]:
frame - series

Unnamed: 0,CA,WA,TX
0,0,0,0
1,3,3,3
2,6,6,6
3,9,9,9
4,12,12,12


One of the arithmetic methods is __needed__ to broadcast over the column.

In [48]:
series = frame['CA']

In [49]:
series

0     0
1     3
2     6
3     9
4    12
Name: CA, dtype: int64

In [50]:
frame.sub(series, axis = 0)

Unnamed: 0,CA,WA,TX
0,0,1,2
1,0,1,2
2,0,1,2
3,0,1,2
4,0,1,2


## Basic Functionality
There are several essential functionality including :
* Indexing 
* Arithemetic calculations
* Sorting and ranking
* Function Application
etc

### Indexing
An important function that can transfer the data conformed to a new index. 

In [19]:
obj = pd.Series([4.5, 6.4, 3, 9.78], index = ['a', 'b', 'c', 'd'])

In [20]:
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0)

a    4.50
b    6.40
c    3.00
d    9.78
e    0.00
dtype: float64

In [21]:
obj.reindex(list('abcde'), method = 'ffill')

a    4.50
b    6.40
c    3.00
d    9.78
e    9.78
dtype: float64

Reindex also works in the dataframe with columns and rows.

In [22]:
import numpy as np
data = pd.DataFrame(np.arange(9).reshape((3,3)), index = [1,2,3], columns = ['CA', 'OH', 'WI'])

In [23]:
data

Unnamed: 0,CA,OH,WI
1,0,1,2
2,3,4,5
3,6,7,8


In [24]:
data.reindex(index = range(1, 5), columns = ['CA', 'NY', 'WA'])

Unnamed: 0,CA,NY,WA
1,0.0,,
2,3.0,,
3,6.0,,
4,,,


There is a more succinctly way to reindex with __ix__.

In [25]:
data.ix[range(1,5),['CA', 'NY', 'WA']]

Unnamed: 0,CA,NY,WA
1,0.0,,
2,3.0,,
3,6.0,,
4,,,


### Indexing, Selection and filtering
_Series_ works similar to Numpy Array, and has additional __index__ to select. However, the slicing with labels __include the end point__, <font color='red'>which is different from Numpy</font>.

In [29]:
obj[:2]

a    4.5
b    6.4
dtype: float64

In [30]:
obj['a']

4.5

In [31]:
obj['a':'c']

a    4.5
b    6.4
c    3.0
dtype: float64

For DataFrame, one can directly retrieve one or columns by passing a single or sequence of labels. 

In [32]:
data['CA']

1    0
2    3
3    6
Name: CA, dtype: int64

There are some special cases in the DataFrame slicing or indexing:
* By slicing or a boolean array one can select __rows__.

In [33]:
data[:2]

Unnamed: 0,CA,OH,WI
1,0,1,2
2,3,4,5


In [34]:
data[data['WI'] > 2]

Unnamed: 0,CA,OH,WI
2,3,4,5
3,6,7,8


* Indexing with a boolean DataFrame

In [35]:
data < 4

Unnamed: 0,CA,OH,WI
1,True,True,True
2,True,False,False
3,False,False,False


There is a special indexing field __ix__ that can be used for row selection in DataFrame.

In [37]:
data.ix[2]

CA    3
OH    4
WI    5
Name: 2, dtype: int64

In [42]:
data.ix[range(2), ['CA', 'WI']]

Unnamed: 0,CA,WI
0,,
1,0.0,2.0


### Drop entries from an axis
The __drop__ method will return a new object not a _view_, and it can be used to delete either axis in DataFrame.

In [27]:
data.drop('CA', axis = 1)

Unnamed: 0,OH,WI
1,1,2
2,4,5
3,7,8


In [28]:
data.drop(1, axis = 0)

Unnamed: 0,CA,OH,WI
2,3,4,5
3,6,7,8


### Sorting and ranking
__`sort_index`__ will sort by row or column index and return a <font color='green'>new, sorted</font> object.

In [51]:
data.sort_index(axis = 1)

Unnamed: 0,CA,OH,WI
1,0,1,2
2,3,4,5
3,6,7,8


__`sort_values`__ function will sort a Series by its values.

In [54]:
population1.sort_values()

City
Madison       45
Twin City     55
San Diego    345
Nanjing      500
Name: Population, dtype: int64

One can sort multiple variabbles by passsing values to __by__ in __sort_values__.

In [55]:
data.sort_values(by = ['CA', 'WI'])

Unnamed: 0,CA,OH,WI
1,0,1,2
2,3,4,5
3,6,7,8


_ranking_ is simlar to __numpy.argsort__, except that ties are broken according to a rule. 

In [56]:
data.rank()

Unnamed: 0,CA,OH,WI
1,1.0,1.0,1.0
2,2.0,2.0,2.0
3,3.0,3.0,3.0


## Summary Statistics
The summary and descriptive statistics in Pandas excludes missing data by default.

In [58]:
data.sum()

CA     9
OH    12
WI    15
dtype: int64

In [59]:
data.describe()

Unnamed: 0,CA,OH,WI
count,3.0,3.0,3.0
mean,3.0,4.0,5.0
std,3.0,3.0,3.0
min,0.0,1.0,2.0
25%,1.5,2.5,3.5
50%,3.0,4.0,5.0
75%,4.5,5.5,6.5
max,6.0,7.0,8.0


### Correlation and Covariance
The data structures in Pandas have instance methods that can compute the correlations and covariance in pairwise manner or full scale.

In [60]:
population1.corr(population2)

1.0

In [61]:
data.corr()

Unnamed: 0,CA,OH,WI
CA,1.0,1.0,1.0
OH,1.0,1.0,1.0
WI,1.0,1.0,1.0


In [62]:
data.cov()

Unnamed: 0,CA,OH,WI
CA,9.0,9.0,9.0
OH,9.0,9.0,9.0
WI,9.0,9.0,9.0


### Unique values, value counts
For one-dimensional Series, one may be interested in the unique values of the values or the number of values occurs in the Series. It is very common for the categorical data in the perspective of the data analysis. 

In [63]:
population2.unique()

array([1000,  900,   23, 1500])

In [64]:
data['CA'].value_counts()

3    1
6    1
0    1
Name: CA, dtype: int64

## Missing data handling
In Pandas, the floating point value __NAN__ represents missing data in both floating as well as in non-floating point arrays. Furthermore, the built-in Python __None__ value is treated as NA as well.

### Filtering out missing data
__dropna__ is the function to go when to filter out the missing data. 
* In Series, it will drop all missing data and returns a new Series.

In [65]:
data1 = pd.Series([1, np.nan, 4.04, -3.2, np.nan, 9])

In [66]:
data1.dropna()

0    1.00
2    4.04
3   -3.20
5    9.00
dtype: float64

* In DataFrame, by default __dropna__ drops any row containing a missing value.

In [68]:
data2 = pd.DataFrame([[4.23, -34.4, 0, 0.03], [np.nan, 9.03, 12.3, -9.3], [5, 3, 4, 9]])

In [69]:
data2

Unnamed: 0,0,1,2,3
0,4.23,-34.4,0.0,0.03
1,,9.03,12.3,-9.3
2,5.0,3.0,4.0,9.0


In [70]:
data2.dropna()

Unnamed: 0,0,1,2,3
0,4.23,-34.4,0.0,0.03
2,5.0,3.0,4.0,9.0


To drop by columns, passing __axis=1__.

In [71]:
data2.dropna(axis = 1)

Unnamed: 0,1,2,3
0,-34.4,0.0,0.03
1,9.03,12.3,-9.3
2,3.0,4.0,9.0


One even can specify the number of observations that can be kept in a row by passing __thresh__ argument.

### Filling in missing data
__fillna__ method is the way to fill the missing value with some values, and one can do it inplace by passing the __inplace__ argument.

In [72]:
data2.fillna(0)

Unnamed: 0,0,1,2,3
0,4.23,-34.4,0.0,0.03
1,0.0,9.03,12.3,-9.3
2,5.0,3.0,4.0,9.0
