# CME 193 - Lecture 6 - Pandas

Pandas is a package for working with tabular data

# Pandas

[Pandas](https://pandas.pydata.org/) is a Python library for dealing with data.  The main thing you'll hear people talk about is the DataFrame object (inspired by R), which is designed to hold tabular data.

## Difference between a DataFrame and NumPy Array

Pandas DataFrames and NumPy arrays both have similarities to Python lists.  
* Numpy arrays are designed to contain data of one type (e.g. Int, Float, ...)
* DataFrames can contain different types of data (Int, Float, String, ...)
    * Usually each column has the same type
    
    
Both arrays and DataFrames are optimized for storage/performance beyond Python lists

Pandas is also powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ...

## Key Features

* File I/O - integrations with multiple file formats
* Working with missing data (.dropna(), pd.isnull())
* Normal table operations: merging and joining, groupby functionality, reshaping via stack, and pivot_tables,
* Time series-specific functionality:
    * date range generation and frequency conversion, moving window statistics/regressions, date shifting and lagging, etc.
* Built in Matplotlib integration

## Other Strengths

* Strong community, support, and documentation
* Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
* Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
* Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

## Python/Pandas vs. R

* R is a language dedicated to statistics. Python is a general-purpose language with statistics modules.
* R has more statistical analysis features than Python, and specialized syntaxes.

However, when it comes to building complex analysis pipelines that mix statistics with e.g. image analysis, text mining, or control of a physical experiment, the richness of Python is an invaluable asset.

# Getting Started

[Here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) is a link to the documentation for DataFrames

In [3]:
import pandas as pd
import numpy as np

## Objects and Basic Creation

| Name | Dimensions | Description  |
| ------:| -----------:|----------|
| ```pd.Series``` | 1 | 1D labeled homogeneously-typed array |
| ```pd.DataFrame```  | 2| General 2D labeled, size-mutable tabular structure |
| ```pd.Panel``` | 3|  General 3D labeled, also size-mutable array |

# Series
## What are they?
- Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.
- Basic method to create a series: 
```python 
s = pd.Series(data, index = index) ```
- Data Can be many things:
    * A Python Dictionary
    * An ndarray (or reg. list)
    * A scalar 
- The passed index is a list of axis labels (which varies on what data is)

Think "Series = Vector + labels"

In [4]:
first_series = pd.Series([1,2,4,8,16,32,64])
print(type(first_series))
print(first_series)

<class 'pandas.core.series.Series'>
0     1
1     2
2     4
3     8
4    16
5    32
6    64
dtype: int64


In [8]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
print('-'*50)
print(s.index)

a    0.285041
b   -0.375502
c   -0.029911
d    1.617153
e   -0.122967
dtype: float64
--------------------------------------------------
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')


If Data is a dictionary, if index is passed the values in data corresponding to the labels in the index will be pulled out, otherwise an index will be constructed from the sorted keys of the dict

In [10]:
d = {'a': [0., 0], 'b': {'1':1.}, 'c':2.}
pd.Series(d)

a      [0.0, 0]
b    {'1': 1.0}
c             2
dtype: object

You can create a series from a scalar, but need to specify indices

In [11]:
pd.Series(5, index=['a', 'b', 'c'])

a    5
b    5
c    5
dtype: int64

You can index and slice series like you would numpy arrays/python lists

In [47]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)

a    0.047532
b    1.117764
c   -0.926234
d    0.240809
e   -0.139864
dtype: float64


In [48]:
end_string = '\n' + '#'*50 + '\n'
print(s[0], end=end_string)
# slicing
print(s[:3], end=end_string)

0.04753222276252921
##################################################
a    0.047532
b    1.117764
c   -0.926234
dtype: float64
##################################################


In [49]:
end_string = '\n' + '-'*50 + '\n'
print(s[0], end=end_string)
# slicing
print(s[:3], end=end_string)

0.04753222276252921
--------------------------------------------------
a    0.047532
b    1.117764
c   -0.926234
dtype: float64
--------------------------------------------------


In [50]:
# Is the s element greater then the average of s?
s > s.mean()

a    False
b     True
c    False
d     True
e    False
dtype: bool

In [51]:
# conditional max - index with booleans
print(s[s > s.mean()], end=end_string)
# elementwise function - vectorization
print(np.exp(s), end=end_string)

b    1.117764
d    0.240809
dtype: float64
--------------------------------------------------
a    1.048680
b    3.058009
c    0.396042
d    1.272278
e    0.869476
dtype: float64
--------------------------------------------------


Series are also like dictionaries - you can access values using index labels

In [52]:
print(s, end=end_string)
print(s['a'], end=end_string)

a    0.047532
b    1.117764
c   -0.926234
d    0.240809
e   -0.139864
dtype: float64
--------------------------------------------------
0.04753222276252921
--------------------------------------------------


In [53]:
s['e'] = 12  # set element using index label
print(s, end=end_string)
print('f' in s, end=end_string)  # check for index label
print(s.get('f', None), end=end_string)  # get item with index 'f' - if no such item return None
print(s.get('e', None), end=end_string)

a     0.047532
b     1.117764
c    -0.926234
d     0.240809
e    12.000000
dtype: float64
--------------------------------------------------
False
--------------------------------------------------
None
--------------------------------------------------
12.0
--------------------------------------------------


### Series Attributes:

- Get the index : 
```python 
s.index ``` 
- Get the values :
``` python 
s.values ``` 
- Find the shape : 
``` python 
s.shape ``` 

In [54]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [55]:
s.values

array([ 0.04753222,  1.11776391, -0.92623426,  0.24080936, 12.        ])

In [56]:
s.shape

(5,)

### Series Iteration

In [57]:
for idx, val in s.iteritems():
    print(idx, val)

a 0.04753222276252921
b 1.1177639072831642
c -0.9262342605999548
d 0.2408093569053461
e 12.0


Sort by index or by value

In [58]:
print(s.sort_values(), end=end_string)
print(s.sort_index(), end=end_string)

c    -0.926234
a     0.047532
d     0.240809
b     1.117764
e    12.000000
dtype: float64
--------------------------------------------------
a     0.047532
b     1.117764
c    -0.926234
d     0.240809
e    12.000000
dtype: float64
--------------------------------------------------


Find counts of unique values

In [59]:
s = pd.Series([0,0,0,1,1,1,2,2,2,2])
sct = s.value_counts()
print(sct)

2    4
1    3
0    3
dtype: int64


You can do just about anything you can do with a numpy array

- Series.mean()
- Series.median()
- Series.mode()
- Series.nsmallest(num)
- Series.max ...

In [61]:
print(s.min(), end=end_string)
print(s.max(), end=end_string)
print(s.mean(), end=end_string)
print(s.median(), end=end_string)
print(s.nsmallest(2), end=end_string)

0
--------------------------------------------------
2
--------------------------------------------------
1.1
--------------------------------------------------
1.0
--------------------------------------------------
0    0
1    0
dtype: int64
--------------------------------------------------


## Exercise

- Consider the series `s` of letters in a sentence.
- What is count of each letter in the sentence, output a series which is sorted by the count
- Create a list with only the top 5 common letters (not including space)

In [63]:
s = pd.Series(list("Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index."))

In [64]:
s

0      S
1      e
2      r
3      i
4      e
      ..
195    n
196    d
197    e
198    x
199    .
Length: 200, dtype: object

In [66]:
# Series sorted by count
sct = s.value_counts()
sct

     29
e    23
a    15
t    12
i    12
n    12
s    11
l    11
r    10
o    10
d     6
y     5
c     5
b     5
g     4
,     4
h     4
p     3
.     3
f     3
m     2
x     2
(     1
P     1
u     1
S     1
j     1
)     1
T     1
-     1
v     1
dtype: int64

In [77]:
# list with the 5 most common letters
sct_5 = []
for idx, val in sct.iteritems():
    if idx != ' ':
        sct_5.append(idx)   
        if len(sct_5) == 5:
            break
print(sct_5)

['e', 'a', 't', 'i', 'n']
