# Scientific Data Types

## Numpy Arrays

These are highly optimized data structures suitable for number crunching. The methods available to numpy objects are coded in C or Fortran. Most python analysis packages use numpy under the hood.

In [1]:
import numpy as np # it is common practicie to put the numpy library in the np namespace

In [2]:
a = np.linspace(0, 2, 12) # real number space in [0,2]
a.shape

(12,)

In [3]:
a = a**2 # square the array
a

array([0.        , 0.03305785, 0.1322314 , 0.29752066, 0.52892562,
       0.82644628, 1.19008264, 1.61983471, 2.11570248, 2.67768595,
       3.30578512, 4.        ])

In [4]:
a.resize(4,3) # resize the array in-place
a

array([[0.        , 0.03305785, 0.1322314 ],
       [0.29752066, 0.52892562, 0.82644628],
       [1.19008264, 1.61983471, 2.11570248],
       [2.67768595, 3.30578512, 4.        ]])

In [5]:
a[1,:] # grab the second row (counting from 0)

array([0.29752066, 0.52892562, 0.82644628])

In [6]:
np.linalg.norm(a, axis = 1) # magnitude of each row

array([0.13630101, 1.02532644, 2.91828   , 5.83936789])

### Iterating with numpy arrays
While you *could* iterate over numpy arrays, you probably do not need to: numpy has an optimized C or Fortran version for almost every numerical operation you'll need. As an example, suppose we want to square a very large array. 

In [7]:
a = np.linspace(0,2,10000000) #10 million points on [0,2]

Let's create a copy of a, which we will use to store the result. 

In [8]:
b = a.copy() 

Note: we could have used ```b = a```, but then changing b would also change a

The slow way of squaring the array is to use a python for-loop:

In [9]:
for i in range(len(a)):
    b[i] = a[i]**2
b[-1]

4.0

The fast way is to rely on numpy's equivalent vectorized for-loop:

In [10]:
b = a**2
b[-1]

4.0

Note that the same would not work with a list:

In [11]:
a = range(10) # handy way of creating a list of integers from [0,9]
try:
    a**2
except TypeError as e:
    print(e)

unsupported operand type(s) for ** or pow(): 'range' and 'int'


**If you are dealing with numerical data you should probably use numpy.** 1) In many cases, numpy avoids the need to iterate in python and 2) numpy probably has a function suited to your purposes. Almost anything you would do in matlab can be done in numpy. You may consult the numpy documentation here

https://docs.scipy.org/doc/numpy/reference/index.html

# Pandas DataFrame

On their surface, pandas data types look much like excel spread sheets. Under the hood, they are built on numpy arrays and they bring together many powerful features we've seen in other data types, making them ideal for data processing. Together, pandas and numpy have become a mainstay in the data science community.

To see how pandas works, let's start with their most ubiquitous type, the ```DataFrame```

In [12]:
import pandas as pd # it is common to see pandas imported as pd

In [13]:
names = [('elvis', 'presley', 85), ('bob','smith', 30), ('jane','doe', 32)]

names = pd.DataFrame(names, columns = ['First','Last', 'Age'], index = ['first','second','third'])
names

Unnamed: 0,First,Last,Age
first,elvis,presley,85
second,bob,smith,30
third,jane,doe,32


The above dataframe renders like a spreadsheet when viewed in a jupyter notebook.

### Accessing columns
A given column (a ```pd.Series``` type) can be retrieved using dictionary-like syntax.

In [14]:
names['Age']

first     85
second    30
third     32
Name: Age, dtype: int64

We can also access the same column through dot notation, provided the column name follows python's naming conventions and is not already used by one of dataframe's methods.

In [15]:
names.First

first     elvis
second      bob
third      jane
Name: First, dtype: object

In [16]:
names.sort_values('Age')

Unnamed: 0,First,Last,Age
second,bob,smith,30
third,jane,doe,32
first,elvis,presley,85


In [17]:
names.Age**2

first     7225
second     900
third     1024
Name: Age, dtype: int64

### Accessing rows

A given row may be accessed using the ```loc``` and ```iloc``` indexers, both of which will return a ```pd.Series``` object. Use ```loc``` if you know the index of the row by name.

In [18]:
names.loc['second']

First      bob
Last     smith
Age         30
Name: second, dtype: object

In [19]:
names.iloc[1]

First      bob
Last     smith
Age         30
Name: second, dtype: object

You may also provide a boolean series object as the indexer, which acts like a filter.

In [20]:
names[names.Age > 30]

Unnamed: 0,First,Last,Age
first,elvis,presley,85
third,jane,doe,32


## Multi-indexed data

Suppose we want to represent a regular grid of 24 values indexed by i,j,k. Pandas provides the ```MultiIndex``` suitable for this purpose.

In [21]:
multi_index = pd.MultiIndex.from_product([range(2), range(3), range(4)], names = ['i','j','k'])

Now we need our data in a flattened array of compatible length.

In [22]:
data = np.linspace(1,24,24)

In [23]:
df = pd.DataFrame(data, index = multi_index)

df.tail(10) #get the last 10 rows 

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,0
i,j,k,Unnamed: 3_level_1
1,0,2,15.0
1,0,3,16.0
1,1,0,17.0
1,1,1,18.0
1,1,2,19.0
1,1,3,20.0
1,2,0,21.0
1,2,1,22.0
1,2,2,23.0
1,2,3,24.0


In [24]:
df.loc[1,:,2] # get all values for i = 1, k = 2

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,0
i,j,k,Unnamed: 3_level_1
1,0,2,15.0
1,1,2,19.0
1,2,2,23.0


Documentation on multi-indexers can be found here: https://pandas.pydata.org/pandas-docs/stable/advanced.html

## Time Series Data

For data indexed by time, we may use the special DatetimeIndex.

In [25]:
timerange = pd.date_range('Jan 1, 2018', 'Dec 31, 2018', freq = 'd')
doy = pd.Series(1, index = timerange)
doy.head()

2018-01-01    1
2018-01-02    1
2018-01-03    1
2018-01-04    1
2018-01-05    1
Freq: D, dtype: int64

Now we can group data by month and integrate each group individually. In this case, we get the total number of days in each month.

In [26]:
doy.groupby(pd.Grouper(freq = '1 M')).sum()

2018-01-31    31
2018-02-28    28
2018-03-31    31
2018-04-30    30
2018-05-31    31
2018-06-30    30
2018-07-31    31
2018-08-31    31
2018-09-30    30
2018-10-31    31
2018-11-30    30
2018-12-31    31
Freq: M, dtype: int64

We can also group by column

In [27]:
df = pd.DataFrame(dict(count = range(6),
                 fruit =['orange','apple','pear','orange','pear','pear']))
df

Unnamed: 0,count,fruit
0,0,orange
1,1,apple
2,2,pear
3,3,orange
4,4,pear
5,5,pear


In [28]:
df.groupby('fruit').mean()

Unnamed: 0_level_0,count
fruit,Unnamed: 1_level_1
apple,1.0
orange,1.5
pear,3.666667
