# Python basic data storage – towards spreadsheets

If we want to manipulate tablular data, like spreadsheets, in Python, what would be the natural data structures, and what problems does Pandas solve?

In [1]:
import numpy as np
import pandas as pd

### Lists

Besides a single variable, a list is perhaps the most basic data container in Python.

- It stores multiple values
- and it's super flexible – you're not confined to just one type of object in a list

In [2]:
ll = ['a', 'b', 'c', 'd', 'e']
ll

['a', 'b', 'c', 'd', 'e']

### List access

- We access individual elements by integer position, starting with `0`, and ending with `length-1`
- we can use the `n:m` "slice" notation to access a sequence (which remains a list)
- notice the returned `n:m` slice starts with `n` but ends with `m-1`
- **knowing the correct integer position can be error prone**
- if you don't know where an element is, searching through the list is slow

In [3]:
print('first element:\t', ll[0])
print('1:3 slice:\t', ll[1:3])
print('last element:\t', ll[-1])

first element:	 a
1:3 slice:	 ['b', 'c']
last element:	 e


### Dictionaries

Knowing the correct integer position to grab an item out of a list can be tricky and error prone, so often it's handy to name things and access them by the name. Python has a "dictionary" for just this purpose.

- Dictionaries contain what are called "**key, value pairs**"
- Keys have some constraints, but numbers and strings are fine (immutable)
- Values can be anything
- Dictionary contents are not guaranteed to be stored in any particular order
- **Lookup is super fast!**

In [4]:
dd = {'aa':1, 'bb':2, 'cc':3, 'dd':4, 'ee':5}
dd

{'aa': 1, 'bb': 2, 'cc': 3, 'dd': 4, 'ee': 5}

### Dictionary access

You access the "values" by putting the "key" in square brackets after the varible name.

In [5]:
print("'aa' key's value:", dd['aa'])
print("'cc' key's value:", dd['cc'])

'aa' key's value: 1
'cc' key's value: 3


---

## Spreadsheets?

If we want to store something like a spreadsheet, where we have columns of data in a grid, how might we think of doing that?

### Lists of lists

Lists can hold other lists, to create an array of values

In [6]:
lol = [[1,2,3], [4,5,6], [7,8,9]]
lol

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

But we have to access elements by integer again

In [7]:
lol[1][2]

6

And we can only do math easily along one direction – the first-level lists are easiest.

In [8]:
for l in lol:
    print(sum(l))

6
15
24


(or changing the loop to a "list comprehension", for a more "pythonic" expression)

In [9]:
[sum(l) for l in lol]

[6, 15, 24]

### Numpy arrays

For 2D arrays of values we can use Numpy arrays. That is getting us closer to a spreadsheet!

In [10]:
nn = np.array(lol)
nn

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

### Numpy array access

These arrays are built to access by integer position, kind of like lists, but easier than lists in 2 (and higher) dimensions

In [11]:
print('first row and first column value:', nn[0,0])
print('second row and third column value:', nn[1,2])

first row and first column value: 1
second row and third column value: 6


### Numpy math

And with these we can easily do math in any direction. axis=0 is down.

These arrays are especially good for doing things like matrix math (linear algebra) and image processing!

*Note: Numpy arrays are stored more efficiently than Python lists and allow mathematical operations to be vectorized, which results in significantly higher performance than with looping constructs in Python
[[ref]](https://medium.com/@ericvanrees/pandas-series-objects-and-numpy-arrays-15dfe05919d7)*

In [12]:
nn.sum(axis=0)

array([12, 15, 18])

and axis=1 is across

In [13]:
nn.sum(axis=1)

array([ 6, 15, 24])

### But in a Numpy array all the data has to be the same type

If you try to store mixed types, it has to adjust to a "lowest common denominator" data type. Here it reverts to a 21-character unicode string representation "<U21", so the numbers are stored as strings instead of integers, so we can't do math on the numbers.

In [14]:
mixed_array = np.array([[1,2,3],['a','b','c']])
mixed_array

array([['1', '2', '3'],
       ['a', 'b', 'c']], dtype='<U21')

---

## Dictionary of lists or arrays would be a decent idea for a spreadsheet

To store columns of mixed types of data, like our typical spreadsheets, what would be ideal is to have columns stored in dictionaries, so we can access them by the column name, then the values would be stored in lists or arrays so we can do math and have easy access.

- dictionary key is the column name
- value is a list or array holding our rows of values, including Nulls or NaNs

In [15]:
dict_of_arrays = {'a':np.array([1,2,3]), 
                  'b':np.array(['x','y','z']),
                  'c':np.array([7,8,np.nan])}
dict_of_arrays

{'a': array([1, 2, 3]),
 'b': array(['x', 'y', 'z'], dtype='<U1'),
 'c': array([ 7.,  8., nan])}

### But math is hard on a dictionary of lists or arrays

- If we want to do math down the columns it's complicated
    - have to iterate through columns
    - have to deal with non-numeric types
- across the rows is impossible in this form
- there is nothing to guarantee alignment of the values into rows
- NaN screws up stats like sum and mean

In [16]:
for k,v in dict_of_arrays.items():
    try:
        print('mean of column', k, 'is', v.mean())
    except TypeError:
        print("can't do mean on column", k)

mean of column a is 2.0
can't do mean on column b
mean of column c is nan


---

## Pandas DataFrame is sort of like a more flexible dictionary of arrays

### Data Structures [[ref]](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html)

- **Series**	1D labeled homogeneously-typed array
    - Container for scalars or strings
    - Each one has an index and potentially a name
    - Based on a Numpy array [[ref]](https://medium.com/@ericvanrees/pandas-series-objects-and-numpy-arrays-15dfe05919d7)
- **DataFrame**	General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column
    - Container for Series
    - Overall index, and each Series (column) has a name

When we print in a notebook you see the column names/labels along the top, and the "index", which are the row labels along the left-hand side.

In [17]:
df = pd.DataFrame(dict_of_arrays)
df

Unnamed: 0,a,b,c
0,1,x,7.0
1,2,y,8.0
2,3,z,


### Access is easy on a DataFrame

Columns are each a Series with their own index, and we can access them by name so we don't make mistakes with integers

- the index is like a row number in Excel, but more flexible and powerful
- if you don't specify an index when you create the DataFrame, Pandas will create an integer index by default
- **the index can be other types of identifiers like strings or dates**

In [18]:
df['c']

0    7.0
1    8.0
2    NaN
Name: c, dtype: float64

### Math is easy on a DataFrame

- Default is down columns
- Strings are ignored or handled in a logical way
- NaN/Null is ignored rather than causing NaN

In [19]:
df.mean()

a    2.0
c    7.5
dtype: float64

In [20]:
df.sum()

a      6
b    xyz
c     15
dtype: object

- and we can also do calculations across rows

In [21]:
df.mean(axis=1)

0    4.0
1    5.0
2    3.0
dtype: float64

### We can do things like Transpose

As with a Numpy array, we can do flexible manipulations like transpose

In [22]:
df.T

Unnamed: 0,0,1,2
a,1,2,3
b,x,y,z
c,7,8,


#### or even turn the DataFrame into a dictionary of dictionaries

In [23]:
df.to_dict()

{'a': {0: 1, 1: 2, 2: 3},
 'b': {0: 'x', 1: 'y', 2: 'z'},
 'c': {0: 7.0, 1: 8.0, 2: nan}}

#### or a Numpy array if we need to

In [24]:
df.to_numpy()

array([[1, 'x', 7.0],
       [2, 'y', 8.0],
       [3, 'z', nan]], dtype=object)

#### or a JSON object

In [25]:
df.to_json()

'{"a":{"0":1,"1":2,"2":3},"b":{"0":"x","1":"y","2":"z"},"c":{"0":7.0,"1":8.0,"2":null}}'

#### or save to a CSV or Excel file

In [26]:
df.to_csv('df_out.csv')

## Operations are vectorized – don't iterate through rows!

When most of us learn programming, one of the most common methods we learn (besides if/then statements) are loops.

If you find yourself **iterating through the elements** (rows or columns) of a DataFrame, there's a good chance you're doing something **much slower and more complicated** than it needs to be! *I hardly ever need to loop through a DataFrame!*

In [27]:
df['B'] = df['b'].str.upper()
df

Unnamed: 0,a,b,c,B
0,1,x,7.0,X
1,2,y,8.0,Y
2,3,z,,Z


In [28]:
df['ac'] = df['a']/df['c']
df

Unnamed: 0,a,b,c,B,ac
0,1,x,7.0,X,0.142857
1,2,y,8.0,Y,0.25
2,3,z,,Z,


In [29]:
df['c_filled'] = df['c'].fillna(df['c'].mean())
df

Unnamed: 0,a,b,c,B,ac,c_filled
0,1,x,7.0,X,0.142857,7.0
1,2,y,8.0,Y,0.25,8.0
2,3,z,,Z,,7.5


In [30]:
df['c_a'] = df['c'] > df['a']
df

Unnamed: 0,a,b,c,B,ac,c_filled,c_a
0,1,x,7.0,X,0.142857,7.0,True
1,2,y,8.0,Y,0.25,8.0,True
2,3,z,,Z,,7.5,False


## Series and DataFrame indices are automatically aligned

In [31]:
sd = pd.Series(index=[2,0,1], data=['third','first','second'])
sd

2     third
0     first
1    second
dtype: object

In [32]:
df['s'] = sd
df

Unnamed: 0,a,b,c,B,ac,c_filled,c_a,s
0,1,x,7.0,X,0.142857,7.0,True,first
1,2,y,8.0,Y,0.25,8.0,True,second
2,3,z,,Z,,7.5,False,third


## `.dropna()` – Drop rows with NAs

```
df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0
    Determine if rows or columns which contain missing values are removed.
    * 0, or 'index' : Drop rows which contain missing values.
    * 1, or 'columns' : Drop columns which contain missing value.

how : {'any', 'all'}, default 'any'
    Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
    * 'any' : If any NA values are present, drop that row or column.
    * 'all' : If all values are NA, drop that row or column.

thresh : int, optional
    Require that many non-NA values.
subset : array-like, optional
    Labels along other axis to consider, e.g. if you are dropping rows
    these would be a list of columns to include.
inplace : bool, default False
    If True, do operation inplace and return None.
```

*Here we're not going to make the change in place, but just see what the DataFrame looks like with the NAs dropped.*

In [33]:
df.dropna()

Unnamed: 0,a,b,c,B,ac,c_filled,c_a,s
0,1,x,7.0,X,0.142857,7.0,True,first
1,2,y,8.0,Y,0.25,8.0,True,second


## Essential Basic Functionality

https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html

pandas is well suited for many different kinds of data, but most of the time we use it for **tabular data with heterogeneously-typed columns**, as in an SQL table or Excel spreadsheet

Here are just a few of the things that Pandas does well:

- Easy handling of missing data (represented as NaN)
- Automatic and explicit data alignment
- Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
- Intuitive merging and joining data sets
- Flexible reshaping and pivoting of data sets
- Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
- Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging, etc