# Review: lists, loops, numpy, matplotlib

* Lists are convenient ways to store data.
  * Create a list by enclosing entries in closed brackets:  ```my_list = [1,3,4,7,8]```
  * Append to the end of a list using the ```.append()``` method:  ```my_list.append(10)```
  * Subset a list using single integers or a slice:  ```my_list[1]```, ```my_list[0:4]```
  * The length of a list can be found using ```len(my_list)```

* Use a ```for __ in ___:``` loop to go through a range or items in a list

* Import libraries like ```numpy```, ```matplotlib.pyplot```, and ```datetime``` for better data manipulation
* Typical imports you'll see in the Python community are:
  * ```import numpy as np```
  * ```import matplotlib.pyplot as plt```
* Create a quick figure using syntax like ```plt.plot()``` or ```plt.scatter()```

In [1]:
import numpy
#import scipy
#import scipy.stats
import matplotlib.pyplot as plt # note, this is often imported as "plt"
import pandas # for 2D tables like csv and text files
import datetime # for time series data

# special code for Jupyter Notebook; allows in-line plotting (may not be needed on your machine)
%matplotlib inline

# Creating data frames in pandas

Definition: Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame

In [2]:
dates = pandas.date_range('20190101','20200101',freq='2SM') # Set up index for data

In [3]:
numpy.random.seed(1) # Makes sure every time you run this, the random numbers don't change
dummy_data=pandas.DataFrame(numpy.random.randn(12, 4), index=dates, columns=list('ABCD'))

In [4]:
dummy_data

Unnamed: 0,A,B,C,D
2019-01-15,1.624345,-0.611756,-0.528172,-1.072969
2019-02-15,0.865408,-2.301539,1.744812,-0.761207
2019-03-15,0.319039,-0.24937,1.462108,-2.060141
2019-04-15,-0.322417,-0.384054,1.133769,-1.099891
2019-05-15,-0.172428,-0.877858,0.042214,0.582815
2019-06-15,-1.100619,1.144724,0.901591,0.502494
2019-07-15,0.900856,-0.683728,-0.12289,-0.935769
2019-08-15,-0.267888,0.530355,-0.691661,-0.396754
2019-09-15,-0.687173,-0.845206,-0.671246,-0.012665
2019-10-15,-1.11731,0.234416,1.659802,0.742044


Syntax

* The dates are your indices.

* A, B, C, & D are columns.

In [5]:
dummy_data.index

DatetimeIndex(['2019-01-15', '2019-02-15', '2019-03-15', '2019-04-15',
               '2019-05-15', '2019-06-15', '2019-07-15', '2019-08-15',
               '2019-09-15', '2019-10-15', '2019-11-15', '2019-12-15'],
              dtype='datetime64[ns]', freq='2SM-15')

In [6]:
dummy_data.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

## Retrieving subsets of data

In [7]:
dummy_data['A'] #one way to get a column of data

2019-01-15    1.624345
2019-02-15    0.865408
2019-03-15    0.319039
2019-04-15   -0.322417
2019-05-15   -0.172428
2019-06-15   -1.100619
2019-07-15    0.900856
2019-08-15   -0.267888
2019-09-15   -0.687173
2019-10-15   -1.117310
2019-11-15   -0.191836
2019-12-15    0.050808
Freq: 2SM-15, Name: A, dtype: float64

In [8]:
dummy_data.A #another way of getting a column of data

2019-01-15    1.624345
2019-02-15    0.865408
2019-03-15    0.319039
2019-04-15   -0.322417
2019-05-15   -0.172428
2019-06-15   -1.100619
2019-07-15    0.900856
2019-08-15   -0.267888
2019-09-15   -0.687173
2019-10-15   -1.117310
2019-11-15   -0.191836
2019-12-15    0.050808
Freq: 2SM-15, Name: A, dtype: float64

In [9]:
dummy_data[0:1] #one way to get data in same index

Unnamed: 0,A,B,C,D
2019-01-15,1.624345,-0.611756,-0.528172,-1.072969


In [10]:
dummy_data['2019-01-15':'2019-03-15']

Unnamed: 0,A,B,C,D
2019-01-15,1.624345,-0.611756,-0.528172,-1.072969
2019-02-15,0.865408,-2.301539,1.744812,-0.761207
2019-03-15,0.319039,-0.24937,1.462108,-2.060141


* Note that the way to index in the second example includes data from 2019-03-15

In [11]:
dummy_data.loc[dates[0:2],['A','C']] # Can be even more specific with the way we retrieve data

Unnamed: 0,A,C
2019-01-15,1.624345,-0.528172
2019-02-15,0.865408,1.744812


* One can choose specific indices and columns

In [12]:
dummy_data.iloc[[0,1],[0,2]] # Another way of getting the same data

Unnamed: 0,A,C
2019-01-15,1.624345,-0.528172
2019-02-15,0.865408,1.744812


In [13]:
dummy_data[dummy_data.A>0]

Unnamed: 0,A,B,C,D
2019-01-15,1.624345,-0.611756,-0.528172,-1.072969
2019-02-15,0.865408,-2.301539,1.744812,-0.761207
2019-03-15,0.319039,-0.24937,1.462108,-2.060141
2019-07-15,0.900856,-0.683728,-0.12289,-0.935769
2019-12-15,0.050808,-0.636996,0.190915,2.100255


* One can also use boolean indexing

## Manipulations with the data

In [14]:
dummy_data.mean()

A   -0.008268
B   -0.464053
C    0.364507
D   -0.059944
dtype: float64

What did it take the average over?

In [15]:
dd_index_mean=dummy_data.mean(1)
dd_index_mean

2019-01-15   -0.147138
2019-02-15   -0.113132
2019-03-15   -0.132091
2019-04-15   -0.168148
2019-05-15   -0.106314
2019-06-15    0.362047
2019-07-15   -0.210383
2019-08-15   -0.206487
2019-09-15   -0.554072
2019-10-15    0.379738
2019-11-15   -0.033542
2019-12-15    0.426246
Freq: 2SM-15, dtype: float64

If we wanted to create a dataset where we subtract the mean from each date,

In [16]:
dummy_data.sub(dd_index_mean, axis='index')

Unnamed: 0,A,B,C,D
2019-01-15,1.771483,-0.464619,-0.381034,-0.925831
2019-02-15,0.978539,-2.188407,1.857943,-0.648075
2019-03-15,0.45113,-0.117279,1.594199,-1.92805
2019-04-15,-0.154269,-0.215906,1.301918,-0.931743
2019-05-15,-0.066114,-0.771544,0.148528,0.68913
2019-06-15,-1.462667,0.782676,0.539543,0.140447
2019-07-15,1.111239,-0.473345,0.087493,-0.725387
2019-08-15,-0.061401,0.736842,-0.485174,-0.190267
2019-09-15,-0.1331,-0.291133,-0.117174,0.541408
2019-10-15,-1.497048,-0.145322,1.280064,0.362306


One can come up with your own function and apply it to the dataset

In [17]:
def calculate_range(x):
    y=x.max()-x.min()
    return y

In [18]:
dummy_data

Unnamed: 0,A,B,C,D
2019-01-15,1.624345,-0.611756,-0.528172,-1.072969
2019-02-15,0.865408,-2.301539,1.744812,-0.761207
2019-03-15,0.319039,-0.24937,1.462108,-2.060141
2019-04-15,-0.322417,-0.384054,1.133769,-1.099891
2019-05-15,-0.172428,-0.877858,0.042214,0.582815
2019-06-15,-1.100619,1.144724,0.901591,0.502494
2019-07-15,0.900856,-0.683728,-0.12289,-0.935769
2019-08-15,-0.267888,0.530355,-0.691661,-0.396754
2019-09-15,-0.687173,-0.845206,-0.671246,-0.012665
2019-10-15,-1.11731,0.234416,1.659802,0.742044


In [19]:
dummy_data.apply(calculate_range)

A    2.741656
B    3.446262
C    2.491970
D    4.160396
dtype: float64

## Types of data frames, grouping, and sorting

The data don't all have to be floats (numbers), they can be a mix of strings, integers, floats.

In [20]:
numpy.random.seed(1)
dummy_data2 = pandas.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                    'B': ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                    'C': numpy.random.randn(8),
                    'D': numpy.random.randn(8)})

In [21]:
dummy_data2

Unnamed: 0,A,B,C,D
0,foo,one,1.624345,0.319039
1,bar,one,-0.611756,-0.24937
2,foo,two,-0.528172,1.462108
3,bar,three,-1.072969,-2.060141
4,foo,two,0.865408,-0.322417
5,bar,two,-2.301539,-0.384054
6,foo,one,1.744812,1.133769
7,foo,three,-0.761207,-1.099891


In [22]:
dummy_data2.groupby('B').sum()

Unnamed: 0_level_0,C,D
B,Unnamed: 1_level_1,Unnamed: 2_level_1
one,2.757401,1.203438
three,-1.834176,-3.160032
two,-1.964303,0.755636


In [23]:
dummy_data2.sort_values(by='A')

Unnamed: 0,A,B,C,D
1,bar,one,-0.611756,-0.24937
3,bar,three,-1.072969,-2.060141
5,bar,two,-2.301539,-0.384054
0,foo,one,1.624345,0.319039
2,foo,two,-0.528172,1.462108
4,foo,two,0.865408,-0.322417
6,foo,one,1.744812,1.133769
7,foo,three,-0.761207,-1.099891


# Key points

* Pandas data frames can store 2-dimensional data in various forms
* Data can be retrieved by their indices or their identifiers (index and column values)
* Data can be directly manipulated with operations using the data.xxxx or data.apply(xxxxx) syntax
* Data can be sorted and grouped by the values or categories in different columns