# Python basic data storage – towards spreadsheets

It may seem like a waste of time to go through this when I could just start showing you Pandas syntax, but **there were a lot of things that seemed really strange to me when I first learned Pandas. Perhaps if you see what problems Pandas avoids, it will make more sense why Pandas is built like it is.**

The only spreadsheets I'd ever dealt with were in Excel, which has almost no constraints – you can type any value into any cell – but I didn't realize how that lack of constraints can cause problems.

**I love Excel, it's very handy and I use it a lot, but it can be problematic in various ways:**
- Operations are not "reproducible"
    - there is no record of what you did to the data (editing, transforming, etc)
    - so you don't know what you did
    - and others don't know what you did
    - so you have to manually take notes to share with others
    - like when you publish your paper and others want to try to reproduce your work
    - it's not easy to apply the same manipulations to new data
    - or to a bunch of files containing the same types of data
- You're allowed to mix data types in a column, which can cause problems with analysis
- It's possible to sort one column and not sort the others
- When joining tables together:
    - copy/pasting new columns can result in misaligned rows
    - copy/pasting new rows can result in misaligned columns
    - joining based on a key column is supported, but formulas take a while to set up in each new case, and it's tedious to set up when joining mulitiple columns
- You can't do some common transformations like to pivot/melt/gather wide data into tall (tidy) data
- It's not Excel's fault, but few people learn how to use the super-useful/powerful Pivot Table functionality

To help understand Pandas better, let's ask ourselves, **"If we want to manipulate tablular data, like spreadsheets, in Python, what would be the natural data structures, and what problems does Pandas solve?"** Along the way we'll review a few Python data structures, which will help us later, too.

---

*To preserve the mystery, select from the notebook menus*

`Edit -> Clear All Outputs`

---

## Let's review two basic native Python data containers

*Note: there are more than these two, but **Lists** and **Dictionaries** are the two most common mutable (changable) data structures, and understanding them will help with understanding how Pandas data structures work, so they're worth a review.*

In [1]:
import numpy as np
import pandas as pd

### Lists

Besides a single variable, a list is perhaps the most basic data container in Python.

- It stores multiple values
- and it's super flexible – you're not confined to just one type of object in a list

In [2]:
ll = ['a', 'b', 'c', 'd', 'e']
ll

['a', 'b', 'c', 'd', 'e']

#### List access – simple indexing by integer

- We access individual elements by integer position, starting with `0`, and ending with `length-1`
- negative values count from the end starting at -1
- **knowing the correct integer position can be error prone**
- if you don't know where an element is, searching through the list is slow

In [3]:
print('first element\tll[0]\t', ll[0])
print('second element\tll[1]\t', ll[1])
print('last element\tll[4]\t', ll[4])
print('last element\tll[-1]\t', ll[-1])

first element	ll[0]	 a
second element	ll[1]	 b
last element	ll[4]	 e
last element	ll[-1]	 e


#### List access – slicing a range of values

- we can use the `m:n` "slice" notation to access a sequence (which remains a list)
- notice the returned `m:n` slice starts with `m` but ends with `n-1`
- if you leave out the beginning or ending index the slice goes all the way from the beginning or end
- the `:` by itself returns all elements

In [4]:
print('1:3 slice\tll[1:3]\t', ll[1:3])
print('first 3 elems\tll[:3]\t', ll[:3])
print('index 3 to end\tll[3:]\t', ll[3:])
print('all elements\tll[:]\t', ll[:])

1:3 slice	ll[1:3]	 ['b', 'c']
first 3 elems	ll[:3]	 ['a', 'b', 'c']
index 3 to end	ll[3:]	 ['d', 'e']
all elements	ll[:]	 ['a', 'b', 'c', 'd', 'e']


### Dictionaries

Knowing the correct integer position to grab an item out of a list can be tricky and error prone, so often it's handy to name things and access them by the name. Python has a "dictionary" for just this purpose.

- Dictionaries contain what are called "**key, value pairs**"
- Keys have some constraints, but numbers and strings are fine (immutable)
- Values can be anything
- Dictionary contents are not guaranteed to be stored in any particular order
- **Lookup is super fast!**

In [5]:
dd = {'aa':1, 'bb':2, 'cc':3, 'dd':4, 'ee':5}
dd

{'aa': 1, 'bb': 2, 'cc': 3, 'dd': 4, 'ee': 5}

#### Dictionary access

You access the "values" by putting the "key" in square brackets after the varible name.

In [6]:
print("'aa' key's value:", dd['aa'])
print("'cc' key's value:", dd['cc'])

'aa' key's value: 1
'cc' key's value: 3


---

## Spreadsheets? (Tabular data)

Let's define spreadsheets as:

- Data in a grid of columns and rows
- Within a column, all values have a consistent data type *(string, integer, floating point, boolean...)*
- Not all columns in a spreadsheet need to be the same data type

**If we want to store something like a spreadsheet, how might we think of doing that in Python?**

### Lists of lists

Lists can hold other lists, to create an array of values

In [7]:
lol = [[1,2,3], [4,5,6], [7,8,9]]
lol

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

But we have to access elements by integer again

In [8]:
lol[1][2]

6

And we can only do math easily along one direction – the first-level lists are easiest.

In [9]:
for l in lol:
    print(sum(l))

6
15
24


(or changing the loop to a "list comprehension", for a more "pythonic" expression)

In [10]:
[sum(l) for l in lol]

[6, 15, 24]

### Numpy arrays

For 2D arrays of values we can use Numpy arrays. This is getting us closer to a spreadsheet!

In [11]:
nn = np.array(lol)
nn

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

### Numpy array access

These arrays are built to access by integer position, kind of like lists, but easier than lists in 2 (and higher) dimensions

In [12]:
print('first row and first column value:', nn[0,0])
print('second row and third column value:', nn[1,2])

first row and first column value: 1
second row and third column value: 6


### Numpy math

- With Numpy arrays we can easily do math in any direction. 
- They are especially good for doing things like matrix math (linear algebra) and image processing!

*Note: Numpy arrays are stored more efficiently than Python lists and allow mathematical operations to be vectorized, which results in significantly higher performance than with looping constructs in Python
[[ref]](https://medium.com/@ericvanrees/pandas-series-objects-and-numpy-arrays-15dfe05919d7)*

#### axis=0 is down

This notation is horrible, but you'll see it again in Pandas, so you just have to get used to it. (Accessing elements in a 2D array, or a Pandas DataFrame, is always done in the order [row,column], so this is the way I remember down over all the rows is first, and across over all the columns is after...)

In [13]:
nn.sum(axis=0)

array([12, 15, 18])

#### axis=1 is across

In [14]:
nn.sum(axis=1)

array([ 6, 15, 24])

### But in a Numpy array all the data has to be the same type

If you try to store mixed types, it has to adjust to a "lowest common denominator" data type. In the example below it reverts to a 21-character unicode string representation "<U21", so the numbers are stored as strings instead of integers, so we can't do math on the numbers.

In [15]:
mixed_array = np.array([[1,2,3],['a','b','c']])
mixed_array

array([['1', '2', '3'],
       ['a', 'b', 'c']], dtype='<U21')

---

## Dictionary of lists (or arrays) would be a decent idea for a spreadsheet

- *Dictionary key acts as a column name*
- *Dictionary value is a list or array holding our rows of values (including Nulls or NaNs)*

**This is really starting to look more like a spreadsheet!** We now have:
- Column names
- Columns don't need to be of the same type!
- Easy math down the columns with the lists or arrays

In [16]:
dict_of_arrays = {'name':np.array(['bernice', 'jinyue', 'haim']),
                  'level':np.array([1, 2, 3]), 
                  'grade':np.array([2.9, 3.8, np.nan])}
dict_of_arrays

{'name': array(['bernice', 'jinyue', 'haim'], dtype='<U7'),
 'level': array([1, 2, 3]),
 'grade': array([2.9, 3.8, nan])}

### But math is slightly complicated on a dictionary of lists or arrays

- If we want to do math down all the columns we have to
    - iterate through columns
    - deal with non-numeric types
- across the rows is can't be done directly in this form
- there is nothing to guarantee alignment of the values into rows
    - **if you sort one column, the rest don't get sorted at the same time!**
    - **if you add a new column, you need to be very careful that the rows are in the same order!**
- NaNs (nulls) screw up stats like sum and mean

Here's an example of trying to take the mean of the columns:

In [17]:
for k,v in dict_of_arrays.items():
    try:
        print('mean of', k, 'column is', v.mean())
    except TypeError:
        print("can't do mean on", k, "column")

can't do mean on name column
mean of level column is 2.0
mean of grade column is nan


### It's hard to return a certain row

In a dictionary of arrays it's easy to access a single column thorugh the "column name", which is the dictionary key, but with a spreadsheet we often want to access a certain row, based on one of the row values.

**For example, if you want to see all of Jinyue's information, you have to**

- Find the index of the "name" array corresponding to Jinyue
- Grab the value from that index in all arrays and pair it with the column name

In [18]:
ind = np.argwhere( dict_of_arrays['name']=='jinyue' )[0,0]
row = [(k, v[ind]) for k,v in dict_of_arrays.items()]
row

[('name', 'jinyue'), ('level', 2), ('grade', 3.8)]

## List of Dictionaries makes row access easier

An alternative spreadsheet storage method with the built-in Python data structures is to store each row as a dictionary of values. That has the advantage of keeping the row information all together, always keyed to the "column name".

In [19]:
list_of_dicts = [{'name': 'bernice', 'level': 1, 'grade': 2.9},
                 {'name': 'jinyue', 'level': 2, 'grade': 3.8},
                 {'name': 'haim', 'level': 3, 'grade': np.nan}]

for l in list_of_dicts:
    if l['name']=='jinyue':
        print(l)

{'name': 'jinyue', 'level': 2, 'grade': 3.8}


### But, finding elements and math down columns is hard

We basically need to re-collect all the values in a certain column and then perform the math operation on it:

In [20]:
math_level = [dd['level'] for dd in list_of_dicts]
print('mean level =', sum(math_level)/len(math_level))

mean level = 2.0


---

## Pandas DataFrame is sort of like a more flexible and efficient dictionary of arrays (or array of dictionaries)

### Pandas Data Structures [[ref]](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html)

- **Series**: 1D labeled homogeneously-typed array
    - Container for scalars (numbers), strings, or booleans (True/False)
    - Each Series has an Index (names for all entries) and potentially an overall name
    - Based on a Numpy array [[ref]](https://medium.com/@ericvanrees/pandas-series-objects-and-numpy-arrays-15dfe05919d7)
- **DataFrame**: General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns
    - Container for Series
    - Overall Index (row names), and each Series (column) has a name

Printing a DataFrame in a notebook shows the column names/labels along the top, and the "index", which are the row labels along the left-hand side (here just sequential integers).

Here we easily initialize (create and insert values into) a new DataFrame using the dictionary of arrays

In [21]:
df = pd.DataFrame(dict_of_arrays)
df

Unnamed: 0,name,level,grade
0,bernice,1,2.9
1,jinyue,2,3.8
2,haim,3,


We could have created the DataFrame from the list of dictionaries just as easily

In [22]:
pd.DataFrame(list_of_dicts)

Unnamed: 0,name,level,grade
0,bernice,1,2.9
1,jinyue,2,3.8
2,haim,3,


### Access is easy on a DataFrame

- **Columns are each a Series with their own Index**
- we can access each element by name so we don't make mistakes with integers
- the Index is like a row number in Excel, but more flexible and powerful
- if you don't specify an Index when you create the DataFrame, Pandas will create a sequential integer Index by default
- **the Index can be other types of identifiers like strings or dates**

Let's grab (select) just the column called "grade", which is a Series.

**Besides the data access syntax that we'll cover in the next section, this output was always the most confusing thing to me about Pandas. I expected to get just an array of values when I asked for a column or row, and instead I got this weird structure with both labels and values. I didn't understand how helpful the Series was!**

In [23]:
df['grade']

0    2.9
1    3.8
2    NaN
Name: grade, dtype: float64

### Math is easy on a DataFrame

- Default is down columns
- NaNs/Nulls are ignored rather than leading to a NaN result

*Note that a sum() works on strings*

In [24]:
df.sum()

name     bernicejinyuehaim
level                    6
grade                  6.7
dtype: object

- Columns used to be ignored if there was a TypeError (like trying to take the mean of strings), but now you have to avoid those columns if the aggregation function doesn't make sense with that column. 
- It's a little inconvenient, but I can see why they do this so it's more explicit what's happening


In [25]:
df.mean(numeric_only=True)

level    2.00
grade    3.35
dtype: float64

- and we can also do calculations across rows

*I know this notation is confusing: axis=0 is down columns, axis=1 is across rows*

In [26]:
df.mean(axis=1, numeric_only=True)

0    1.95
1    2.90
2    3.00
dtype: float64

### We can do things like Transpose

As with a Numpy array, we can do flexible manipulations like transpose

In [27]:
df.T

Unnamed: 0,0,1,2
name,bernice,jinyue,haim
level,1,2,3
grade,2.9,3.8,


#### or even turn the DataFrame into a dictionary of dictionaries

In [28]:
df.to_dict()

{'name': {0: 'bernice', 1: 'jinyue', 2: 'haim'},
 'level': {0: 1, 1: 2, 2: 3},
 'grade': {0: 2.9, 1: 3.8, 2: nan}}

#### or a Numpy array if we need to

In [29]:
df.to_numpy()

array([['bernice', 1, 2.9],
       ['jinyue', 2, 3.8],
       ['haim', 3, nan]], dtype=object)

#### or a JSON object

In [30]:
df.to_json()

'{"name":{"0":"bernice","1":"jinyue","2":"haim"},"level":{"0":1,"1":2,"2":3},"grade":{"0":2.9,"1":3.8,"2":null}}'

#### or save to a CSV or Excel file

In [31]:
df.to_csv('df_out.csv')

## Operations are vectorized – don't iterate through rows!

When most of us learn programming, one of the most common methods we learn (besides if/then statements) are loops.

If you find yourself **iterating through the elements** (rows or columns) of a DataFrame, there's a good chance you're doing something **much slower and more complicated** than it needs to be! ***I hardly ever need to loop through the elements of a DataFrame!***

*Note that if we assign to a column name that doesn't exist, Pandas will just create a new column*

#### String functions on columns

In [32]:
df['NAME'] = df['name'].str.upper()
df

Unnamed: 0,name,level,grade,NAME
0,bernice,1,2.9,BERNICE
1,jinyue,2,3.8,JINYUE
2,haim,3,,HAIM


#### Math on combinations of columns

In [33]:
df['level_grade'] = df['level']/df['grade']
df

Unnamed: 0,name,level,grade,NAME,level_grade
0,bernice,1,2.9,BERNICE,0.344828
1,jinyue,2,3.8,JINYUE,0.526316
2,haim,3,,HAIM,


#### Filling nulls with values

In [34]:
df['grade_filled'] = df['grade'].fillna(df['grade'].mean())
df

Unnamed: 0,name,level,grade,NAME,level_grade,grade_filled
0,bernice,1,2.9,BERNICE,0.344828,2.9
1,jinyue,2,3.8,JINYUE,0.526316,3.8
2,haim,3,,HAIM,,3.35


#### Boolean comparisons between columns

In [35]:
df['grade_gt_level'] = df['grade'] > df['level']
df

Unnamed: 0,name,level,grade,NAME,level_grade,grade_filled,grade_gt_level
0,bernice,1,2.9,BERNICE,0.344828,2.9,True
1,jinyue,2,3.8,JINYUE,0.526316,3.8,True
2,haim,3,,HAIM,,3.35,False


## Series and DataFrame indices are automatically aligned

**Both rows and columns in Pandas are ordered** (unlike dictionaries), and that order doesn't need to be numerical or alphabetical in either the column or row indexes.

In [36]:
sd = pd.Series(index=[2,0,1], data=['third','first','second'])
sd

2     third
0     first
1    second
dtype: object

#### Assigning our new Series as a column in our DataFrame matches on the index

*Warning: entries that don't have a matching index value in the DataFrame will be dropped! (There are ways to get around this.)*

In [37]:
df['s'] = sd
df

Unnamed: 0,name,level,grade,NAME,level_grade,grade_filled,grade_gt_level,s
0,bernice,1,2.9,BERNICE,0.344828,2.9,True,first
1,jinyue,2,3.8,JINYUE,0.526316,3.8,True,second
2,haim,3,,HAIM,,3.35,False,third


## Easy to drop rows with NAs

[.dropna() documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

Many options, including 
- drop any row that has an NA
- specify certain columns in which to look for NAs
- a threshold number of NAs to tolerate

*Here we're not going to make the change in place, but just see what the DataFrame looks like with the NAs dropped.*

In [38]:
df.dropna()

Unnamed: 0,name,level,grade,NAME,level_grade,grade_filled,grade_gt_level,s
0,bernice,1,2.9,BERNICE,0.344828,2.9,True,first
1,jinyue,2,3.8,JINYUE,0.526316,3.8,True,second


## Also easy to drop duplicates

[.drop_duplicates() documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html)

Many options here, too, including subsets of columns to consider and which row instance to keep.

In [39]:
df_dup = pd.DataFrame({'col1':['a','b','c','a','b'],'col2':[1,2,3,1,2]})
df_dup

Unnamed: 0,col1,col2
0,a,1
1,b,2
2,c,3
3,a,1
4,b,2


In [40]:
df_dup.drop_duplicates()

Unnamed: 0,col1,col2
0,a,1
1,b,2
2,c,3


---

## Pandas Essential Basic Functionality

https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html

Pandas is well suited for many different kinds of data, but most of the time we use it for **tabular data with heterogeneously-typed columns**, as in an SQL table or Excel spreadsheet

Here are just a few of the things that Pandas does well:

- Easy handling of missing data (represented as NaN)
- Automatic and explicit data alignment
- Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
- Intuitive merging and joining data sets
- Flexible reshaping and pivoting of data sets
- Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
- Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging, etc