# Session 4: Numpy and Pandas

## Numpy

Python lists are easy to use and versatile. Once you've mastered the basic syntax, array stuff in other languages will quickly become annoying. 

However, the downside of this flexibility is poor performance as the data cannot be efficiently arranged in the memory, nor vectorized or parallelized in obvious ways to support modern SIMD instruction extensions such as AVX2. This is where ``Numpy`` shines: Its goal is not only to provide methods and functions to simplify various sorts of everyday numerical operations, but also to provide a new data type that trades back a little bit of flexibility for a huge performance plus. Instead of lists, numpy uses arrays with a fixed data type such as ``double`` or ``int``. Python lists can directly be converted into arrays:

In [None]:
import numpy as np

ints = np.array( [1,1,2,3,5] )
floats = np.array( [1.,1.,2.,3.,5.] )

Accessing one-dimensional arrays is just like accessing list elements:

In [None]:
print(ints)
print(ints[0])
print(floats[-1])
print(floats[1:3])

Nested lists can be converted into 2d arrays. Accessing elements in such higher-dimensional arrays works similar like accessing nested lists, but we only have one set of brackets with comma-separated indices:

In [None]:
mat = np.array( [[1,2,3],[4,5,6],[7,8,9]] )

print(mat)
print()
print(mat.shape)
print()
print(mat[0,0])
print(mat[0,1])
print()
print(mat[:,0])
print(mat[0,:])
print()
print(mat[0:2,0:2])

Elements can also be selected based on some condition:

In [None]:
sub = mat[mat%2==0]

print( sub.shape )
print( sub )

The data type is chosen automatically but can also be set explicitly. To check the datatype, check the ``dtype`` attribute:

In [None]:
print(ints.dtype)
print(floats.dtype)

To create an array of zeros with some given dimensionality, use the ``zeros`` function:

In [None]:
arr0 = np.zeros(5)
arr1 = np.zeros([2,3], dtype="int")

print(arr0)
print(arr0.dtype)
print()
print(arr1)
print(arr1.shape)
print(arr1.dtype)

The default datatype is ``float64`` aka ``double``. To pre-set an array with ones instead of zeros, use the ``ones`` method while ``empty`` returns an un-initialized array. ``fill`` allows you to initialize an array with given dimension and some default value.

Naturally, ``Numpy`` has built-in functions to create linearly spaced elements. The two most common ones used are ``arange`` and ``linspace``:

In [None]:
vals1 = np.arange(0,5,0.5) # Just like the range function, but also works with float spacing
vals2 = np.linspace(0,5,11) # Generate 11 values between 0. and 5., including the borders

print(vals1, len(vals1))
print(vals2, len(vals2))

Where ``Numpy`` really shines is its support for all-array vector operations. For every mathematical function in Pythons ``math`` library there is a ``Numpy`` equivalent with the same name. As the ``Numpy`` functions also support normal datatypes such as integers, floats and lists, and generally give better performance thanks to the ```Intel Math Kernel Library (MKL)```, I recommend using the ``Numpy`` functions:

In [None]:
def f(x):
    return np.sin(x)

x = np.arange(0.,5.,11)
y = f(x) # The function f is applied to all values in the array x

As you might have expected, ``Numpy`` arrays are compatible with ``Matplotlib``.

Thanks to ``Numpy``, we can manipulate entire arrays in a single line of code:

In [None]:
x = x**2 + y + 3. # Square every entry of x, add y element-wise and add 3 to every element
z = np.sqrt(np.arange(0,30,1))
print(z)

Numpy provides a plethora of all-array functions that help to get a first impression of the data:

In [None]:
print("Maximum: {0:f}".format(np.max(z)))
print("Minimum: {0:f}".format(np.min(z)))
print("Sum: {0:f}".format(np.sum(z)))
print("Average: {0:f}".format(np.mean(z)))
print("Standard deviation: {0:f}".format(np.std(z)))
print("Median: {0:f}".format(np.median(z)))

``Numpy`` also provides solutions for your everyday linear algebra problems:

In [None]:
vecA = np.array( [2,-1,4] )
vecB = np.array( [-1,-2,5] )

matA = np.array( [[2,-1,4],[-1,-2,5],[4,-2,8]] )

print("Cross-product of vectors A and B:", np.cross(vecA,vecB))
print("Matrix A times vector A: \n", np.dot(matA, vecA))
print("Matrix A squared: \n", np.dot(matA, matA))
print("Determinant of Matrix A:", np.linalg.det(matA))

``Numpy`` also provides functions to get data into files and back into your RAM:

In [None]:
def poly(x,a):
    return x**a

x = np.arange(0,1,0.01)
a = np.arange(0,3.5,0.5)

lines = len(x)
cols = len(a)
data = np.zeros( [lines,cols] )
print(data.shape)

for i in range(lines):
    for j in range(cols):
        data[i,j] = poly(x[i],a[j])


In [None]:
dataExp = np.insert(data,0,x,axis=1) # Append x values as first column

print(dataExp.shape)

np.savetxt("data.txt", dataExp)

In [None]:
data = np.loadtxt("data.txt")

print(data.shape)

In [None]:
x = data[:,0] # Extract the first column

sqt = data[:,2] # Extract the third column
lin = data[:,3]
quad = data[:,5]

print(sqt)

# Pandas

``Numpy`` is just perfect when you're dealing with purely numerical data. But in the real-world, datasets regularly contain string-type arguments such as the date or some category which makes getting the data into your Python program somewhat more tricky (see ``np.genfromtxt``). In addition, working with ``Numpy`` indices feels more like C (or Fortran...) and is not entirely in line with *the zen of Python*, where we can work with such beautiful data structures as dictionaries that allow us to access data based on strings instead of indices. For instance, for a Python purist data in the popular CSV (comma-separated values) format seems to be predestined for reading into a dictionary where every column forms a list and can be accessed via the column header specified in the first line of the file: 

![CSV Header in Notepad++](csv_header.png "Title")

Here, we are dealing with time information in the English date format, floats, integers and nulls indicating days with either no trading or simply missing information. Python lists are flexible enough to deal with either, a simple function that does the trick of reading a CSV file into a dictionary just takes a few lines:

In [None]:
def loadCSV(fname):
    data = {}
    with open(fname, "r") as file:
        header = file.readline().rstrip().split(",")
        for head in header:
            data[head] = []
        for line in file:
            cols = line.rstrip().split(",")
            for col in zip(header,cols):
                try:
                    data[col[0]].append(float(col[1]))
                except:
                    data[col[0]].append(col[1])
    return data

Detecting the correct data type is where things start to get a little bit annoying, so I have even included some runtime error-checking to automatically convert all kinds of numerical data into floats with strings being the fallback option in case that the conversion fails.

In [None]:
data = loadCSV("Gold.csv")

print("Dictionary keys:", data.keys())

Accessing and inspecting the data dictionary is convenient and straight-forward, we can even do some basic statistics:

In [None]:
print("Start date:", data["Date"][0])
print("End date:", data["Date"][-1])
print("Number of entries:", len(data["Date"]))
print("Missing entries", data["Open"].count("null"))

More advanced numerical analysis fails because we're mixing floats and strings even within one column of data, so we would need to clean up our data dictionary first. But even then, stuff as list comprehensions over several columns quickly gets tiresome and brings back stuff, that we wanted to avoid: Working with indices ... 

Thankfully, Python provides a library that combines the speed of ``numpy`` with the flexibility and readability of simple Python structures: ``Pandas`` which is short for *Python Data Analysis Library*. Everything revolves around dataframes which consist of series, roughly equivalent to tables and columns. A dataframe can be created from nested lists where every element of the main list represents one line of the table. Pandas takes the column headers as a list of strings and creates a nice table:

In [None]:
import pandas as pd

month = ["Jan-2021", "Feb-2021", "Mar-2021"]
profit = [638, 436, 887]
debt = [-3554, -3145, -2901]

someData = list(zip(month,profit,debt))

df = pd.DataFrame(someData,columns=["Month","Profit","Debt"])
print(df)

Dictionaries can be directly converted into dataframes:

In [None]:
df = pd.DataFrame(data)
print(df)

Among other file types such as JSON, XLS, ODS or HDF5, CSV files can be directly read into a dateframe, followed by a call to the ``info`` method to get some basic information regarding the structure of the data:

In [None]:
df = pd.read_csv("Gold.csv")
print(df.info())

Pandas automatically detected a total of 5231 dates with the generic datatype *object* in column 0, 5119 non-null values in columns 1 to 6 and set the datatype of these columns to float64. For numerical data, die ```describe``` method is a good start:

In [None]:
df.describe()

Dataseries or columns can be selected using the column header string and dictionary syntax:

In [None]:
subDf = df[["Open","Low","High"]]

print(subDf)

To investigate just a subset of rows, use the ``iloc`` method to slice your dataframe:

In [None]:
print(subDf.iloc[:1000])

You can also select data based on a criterion such as a numerical threashold or a substring:

In [None]:
print(subDf[ subDf["High"] > 1000 ])

print(df[ df["Date"].str.contains("2012")  ] )

Similar to the ``keys()`` method provided by dictionaries, a short onliner is sufficient to create a list of all column names which comes handy when we need to iterate through a dataframe:

In [None]:
print(df.columns.tolist())

You can also do element-wise calculations with entire columns similar to ``numpy`` arrays. Here, we will calculate the daily fluctuation of the gold price relative to the open price to quantify the daily volatility:

In [None]:
vola = (subDf["High"]-subDf["Low"])/subDf["Open"]
print(vola)

The result will be a dataseries which we can inspect further:

In [None]:
print("Median Fluctuation:", vola.median())
print("Max Fluctuation:", vola.max())

Dataseries can be added to existing dataframes just like new dictionary entries:

In [None]:
results = pd.DataFrame() # Create an empty Dataframe
results["Volatility"] = vola
print(results)

All basic statistical quantities such as ``mean``, ``median`` or ``std`` can be calculated for entire frames selected series or subsets of series based on some sort of filter. ``Matplotlib`` natively supports ``Pandas`` dataseries:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline 

plt.figure(figsize=(15,10))
plt.hist(vola, 100)
plt.axvline(vola.mean(), lw = 4, ls = "--", color = "red") # Arithmetic mean
plt.axvline(vola.median(), lw = 4, ls = "--", color = "orange") # Median
plt.axvline(list(vola.mode()), lw = 4, ls = "--", color = "yellow") # Mode of the distribution (returns a dataseries)
plt.show()

By the way: If you need to compare multiple distributions of some quantities, e.g. the volatility of various cryptocurrencies, boxplots are a good start. They include key information such as the median (orange line), the range between the upper and lower quartiles that contain 50% of all values as box surrounding the orange line, 1.5 times the quartiles as *whiskers* to visualize the spread of most of the values plus outliers as individual dots:

In [None]:
plt.figure(figsize=(15,10))
plt.boxplot(results["Volatility"].dropna(), labels=["Gold"])
plt.show()