# Methods I: Programming and Data Analysis

## Session 14: NumPy; Pandas

### Gerhard Jäger

#### (based on Johannes Dellert's slides)

February 8, 2022

In [None]:
import numpy as np

To create sequences of numbers, NumPy provides a function analogous to `range` that returns arrays instead of lists.

In [None]:
np.arange( 10, 30, 5 )

In [None]:
np.arange( 0, 2, 0.3 )  

When `arange` is used with floating point arguments, it is generally not possible to predict the number of elements obtained, due to the finite floating point precision. For this reason, it is usually better to use the function `linspace` that receives as an argument the number of elements that we want, instead of the step:

In [None]:
np.linspace( 0, 2, 9 )                 # 9 numbers from 0 to 2

In [None]:
from numpy import pi
from matplotlib import pyplot as plt

x = np.linspace( 0, 2*pi, 100 )        # useful to evaluate function at lots of points
f = np.sin(x)
plt.plot(x, f)

In [None]:
np.array([[[[1,2], [3,4], [5,6]], 
          [[1,2], [3,4], [5,6]]],
         [[[1,2], [3,4], [5,6]], 
          [[1,2], [3,4], [5,6]]]
         ])

## Printing Arrays

When you print an array, NumPy displays it in a similar way to nested lists, but with the following layout:
- the last axis is printed from left to right,
- the second-to-last is printed from top to bottom,
- the rest are also printed from top to bottom, with each slice separated from the next by an empty line.

One-dimensional arrays are then printed as rows, bidimensionals as matrices and tridimensionals as lists of matrices

In [None]:
import numpy as np
a = np.arange(6)
print(a)

In [None]:
b = np.arange(12).reshape(4,3)
print(b)

In [None]:
c = np.arange(24).reshape(2,3,4)
print(c)

If an array is too large to be printed, NumPy automatically skips the central part of the array and only prints the corners:

In [None]:
print(np.arange(10000))

In [None]:
print(np.arange(10000).reshape(100, 100))

## Basic operations

Arithmetic operators on arrays apply *elementwise*. A new array is created and filled with the result.

In [None]:
a = np.array([20,30,40,50], dtype=int)
b = np.arange(4)
c = a*b
print(c)

In [None]:
print(b**2)

In [None]:
print(10*np.sin(a))

In [None]:
print(a<35)

Some operations, such as `+=` and `*=`, act in place to modify an existing array rather than create a new one.

In [None]:
a = np.ones((2,3))
b = np.random.random((2,3))
a *= 3
print(a)

In [None]:
b += a
print(b)

When operating with arrays of different types, the type of the resulting array corresponds to the more general or precise one (a behavior known as upcasting).

In [None]:
from math import pi
a = np.ones(3, dtype=np.int32)
b = np.linspace(0, pi, 3)
b.dtype.name

In [None]:
a.dtype

In [None]:
c = a+b
c

In [None]:
c.dtype.name

In [None]:
d = np.exp(c*1j)

In [None]:
d.dtype.name

Many unary operations, such as computing the sum of all the elements in the array, are implemented as methods of the `ndarray` class.

In [None]:
a = np.random.random((2,3))
a

In [None]:
a.sum()

In [None]:
a.min()

In [None]:
a.max()

By default, these operations apply to the array as though it were a list of numbers, regardless of its shape. However, by specifying the `axis` parameter you can apply an operation along the specified axis of an array:

In [None]:
b = np.arange(12).reshape(3,4)
b

In [None]:
b.sum(axis=0)

In [None]:
b.min(axis=1)

In [None]:
np.arange(5).cumsum()

In [None]:
b.cumsum(axis=1)

## Universal Functions

NumPy provides familiar mathematical functions such as sin, cos, and exp. In NumPy, these are called "universal functions" (`ufunc`). Within NumPy, these functions operate elementwise on an array, producing an array as output.

In [None]:
B = np.arange(3)
B

In [None]:
np.exp(B)

In [None]:
np.sqrt(B)

In [None]:
C = np.array([2., -1., 4.])
np.add(B, C)

## Indexing, Slicing and Iterating

**One-dimensional** arrays can be indexed, sliced and iterated over, much like lists and other Python sequences.

In [None]:
a = np.arange(10)**3
a

In [None]:
a[2]

In [None]:
a[2:5]

In [None]:
a[:6:2] = -1000
a

In [None]:
a[: :-1]

In [None]:
for i in a:
    print(i**(1/3.))

**Multidimensional** arrays can have one index per axis. These indices are given in a tuple separated by commas:

In [None]:
def f(x,y):
    return 10*x+y

b = np.fromfunction(f,(5,4),dtype=int)
b

In [None]:
b[2,1]

In [None]:
b[0:5,1]

In [None]:
b[:,1]

In [None]:
b[1:3, :]

When fewer indices are provided than the number of axes, the missing indices are considered complete slices:

In [None]:
b[-1]

**Iterating** over multidimensional arrays is done with respect to the first axis:

In [None]:
for row in b:
    print(row)

# Pandas

Pandas is a package that offers an improved interface for numpy arrays. You can add names for rows and column, easily compute summary statistics, plot them etc.

Pandas defines a datatype *dataframe*, which is similar in functionality to R's data frames.

(Some of the following material is taken from https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/)

In [None]:
import pandas as pd

In [None]:
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

Data frames can be created from lists, numpy arrays or dictionaries. In the latter case, the keys of the dictionary are re-used as column names.

In [None]:
purchases = pd.DataFrame(data)

purchases

Using the `index` keyword, you can add row names.

In [None]:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

purchases

Individual rows or columns can be accessed using the corresponding names.

In [None]:
purchases['apples']

In [None]:
purchases.loc['Robert']

Pandas data frames can be easily wrote into and read from csv files (besides a host of other file types).

In [None]:
purchases.to_csv("purchases.csv")

In [None]:
pd.read_csv("purchases.csv", index_col=0)

In [None]:
! wget http://www.sfs.uni-tuebingen.de/~jdellert/northeuralex/0.9/northeuralex-0.9-forms.tsv

In [None]:
northeuralex = pd.read_csv("northeuralex-0.9-forms.tsv", sep="\t", error_bad_lines=False)
northeuralex

In [None]:
northeuralex.head()

In [None]:
northeuralex.head(10000).tail(10)

In [None]:
northeuralex.info()

In [None]:
northeuralex.shape

In [None]:
temp_df = northeuralex.append(northeuralex)


In [None]:
temp_df.shape


In [None]:
temp_df = temp_df.drop_duplicates()

temp_df.shape


In [None]:
northeuralex[["Language_ID", "Concept_ID"]].drop_duplicates().Concept_ID.value_counts()


In [None]:
northeuralex[["Language_ID", "Concept_ID"]].drop_duplicates().Concept_ID.value_counts().plot()


In [None]:
northeuralex[["Language_ID", "Concept_ID"]].drop_duplicates().Concept_ID.value_counts().hist()


In [None]:
northeuralex[["Language_ID", "Concept_ID"]].drop_duplicates().Language_ID.value_counts().hist()


It is straightforward to split a data frame according to the values in one column and apply certain operations on each value in parallel.

In [None]:
northeuralex.groupby('Glottocode').apply(lambda x: len(x.Concept_ID.unique())).sort_values()


In [None]:
northeuralex.groupby('Glottocode').apply(lambda x: len(x.Concept_ID.unique())).hist()


In [None]:
northeuralex.groupby('Glottocode').apply(lambda x: len(x.Concept_ID.unique())).plot(kind="box")