# Installing 3rd party libraries

## conda/mamba vs pip
### pip:
- Python package installer.
- Installs packages from the Python Package Index ([PyPI](https://pypi.org/)).
- comes with your python installation
- you need some other tool for managing virtual environments
- most packages are available on PyPI
- but sometimes they need compilation, making installation more difficult

### conda/mamba:
- Cross-language package manager (Python, R, Ruby, Lua, Scala, Java, JavaScript, C/C++, FORTRAN).
- Installs packages from the Anaconda repository (or the community `conda-forge` [repository](https://conda-forge.org/))
- needs to be installed separately
- Environment management built-in
- licensing issues using the conda repositories for commercial purposes.
  - conda-forge (used by default in mamba) is fine though
- some packages that need compilation are available as pre-built binaries from conda, making installation easier in some cases

## So what should I do?
- use conda/mamba for managing virtual environments
  - `conda create -n <NAME> python=<VERSION>`
  - `conda activate` / `conda deactivate`
  - `conda env list`
  - `conda env remove -n <NAME>`

- use pip for installing packages. Will always install into the currently active environment.
  - `pip install <PACKAGE_NAME>==<VERSION>` (install into current environment, version spec is optional)
  - `pip install --upgrade <PACKAGE_NAME>` (update package to latest available version)
  - `pip uninstall <PACKAGE_NAME>` (remove package from current environment)
  - `pip install -r requirements.txt` (install all packages listed in requirements file)
  - `pip list` (display list of all installed packages)

- use conda/mamba for installing packages only if you can not install the package using pip
- in case of problems (very rare):
  - first create a new environment with only python
  - then use conda/mamba to install all those packages you can not install via pip
  - then use pip to install remaining packages

# numpy

- **the** most important library for working with numerical data
- basis for a whole host of other libraries forming a vast ecosystem around numpy
  - **scipy**: mathematical analysis
  - **pandas**: tabular data & statistics
  - **matplotlib** / **plotnine**: graphic visualizations
  - **scikit-learn**: machine learning
  - **tensorflow** / **pytorch**: deep learning
  - and many more, for statistics, signal processing, simulations, graphs & networks, astronomy, bioinformatics, chemistry, quantum computing, ...
- see [the overall website](https://www.numpy.org) and [the user guide](https://numpy.org/doc/stable/user/index.html) for additional information

## what do you get in numpy?

> numpy provides a **multidimensional array object**, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

### What’s the difference between a Python list and a NumPy array?

> NumPy gives you an enormous range of fast and efficient ways of creating arrays and manipulating numerical data inside them. While a Python list can contain different data types within a single list, all of the elements in a NumPy array should be homogeneous. The mathematical operations that are meant to be performed on arrays would be extremely inefficient if the arrays weren’t homogeneous.

### Why use NumPy?

> NumPy arrays are faster and more compact than Python lists. An array consumes less memory and is convenient to use. NumPy uses much less memory to store data and it provides a mechanism of specifying the data types. This allows the code to be optimized even further.


### And what are these magical arrays?
> An array is a central data structure of the NumPy library. An array is a grid of values and it contains information about the raw data, how to locate an element, and how to interpret an element. It has a grid of elements that can be indexed in various ways. The elements are all of the same type, referred to as the array dtype.

> An array can be indexed by a tuple of nonnegative integers, by booleans, by another array, or by integers. The rank of the array is the number of dimensions. The shape of the array is a tuple of integers giving the size of the array along each dimension.

> One way we can initialize NumPy arrays is from Python lists, using nested lists for two- or higher-dimensional data.

(everything above from the numpy documentation)

## Installation and Import

In [None]:
!/home/atreju/.conda/envs/dhbw/bin/pip install numpy  # the exclamation mark just passes the command to a shell

In [None]:
import numpy as np  # convention! numpy is very, very often imported as `np`

that's it. nothing more to be done

## Numpy Arrays
- `np.array` class
- strongly and statically typed. The type is referred to as the arrays' `dtype`
- multidimensional. your array can have any number of dimensions
- sized: has a fixed (pre-allocated) size along each dimension

<div class="alert alert-block alert-info">
<b>Nomenclature:</b> <br>
<a>
    A numpy array can have any number of dimensions. It is always an instance of `np.array`, no matter how many dimensions it has.<br>
    People (mathematicans) sometimes talk about vectors, matrices or tensors. It don't matter to us, all of these are `np.array`s.<br>
    People (computer scientists) sometimes talk about 1D-, 2D-, or ndarrays.  It don't matter to us, all of these are `np.array`s.
</a>
</div>

In [None]:
import numpy as np  # convention! numpy is very, very often imported as `np`
import string

### arrays and their contents

In [None]:
arr = np.array([1, 2, 3, 4, 5, 6])
arr

In [None]:
# arrays have a type:
arr.dtype

In [None]:
# and we can convert it to an array of a different type
float_array = arr.astype(float)
float_array

In [None]:
float_array.dtype

In [None]:
# we can use python types in the conversion
arr.astype(str)

In [None]:
arr.astype(bool)

In [None]:
arr.astype(complex)

In [None]:
# but the `dtype` attribute might look not so familiar -- in particular it's not just 'int' here...
arr.dtype

In [None]:
# numpy dtype contains more explicit information on type of data, size of data, byte order etc.
# but the types are really also /different/ from the python types (regardless of using the python types in 'astype')
# for example this also means values in numpy integer arrays are **not** unlimited size (unlike regular python integers)
print(f'{arr.dtype.itemsize=}, {arr.dtype.byteorder=}, {arr.dtype.name=}')

In [None]:
arr[0] = 10**100

In [None]:
object_array = arr.astype('object')

In [None]:
object_array

In [None]:
object_array[0] = 10**100

In [None]:
object_array

In [None]:
# so: you **can** use the 'object' dtype, where the array just contains pointers to python objects
# but only do that if you really, really must. array operations on native numeric types are /much/ faster 
# than on 'object'

In [None]:
# watch out when using string, those are limited in size!
string_array = np.array(['a', 'b', 'c'])
string_array

In [None]:
string_array[0] = 'xyz'

In [None]:
# whoopsie, only room for one character... (and no error message during the assignment)
string_array

In [None]:
# at least it's enough room for a full unicode code point :)
string_array[0] =  '\U0001F622'
string_array

In [None]:
string_array = string_array.astype('U256')  # you can make the size explicit!
string_array[0] = 'xyz'
string_array

In [None]:
string_array.astype('object')  # or use the object type (ok-ish for string)

### creating more arrays

In [None]:
# creating arrays from (nested) lists
arr = np.array(
    [[1, 2, 3, 4],
     [5, 6, 7, 8],
     [9, 10, 11, 12]]
)

In [None]:
arr

In [None]:
# you can also create pre-initialized arrays of any size
np.zeros((3, 5), dtype=int)

In [None]:
# you can also create pre-initialized arrays of any size
np.ones((3, 5), dtype=int)

In [None]:
# or non-initialized arrays (slightly faster)
np.empty((3, 5), dtype=int)

In [None]:
# you can also create ranges -- very similar to the built-in `range`
np.arange(10)

In [None]:
# and, often useful, a fixed number regularly spaced elements in a certain range (including bounds)
np.linspace(1, 2, 21)

In [None]:
# two special methods for two-dimensional arrays, often useful in linear algebra

In [None]:
# creating unit matrices
np.eye(3)

In [None]:
# creating diagonal matrices, specifying elemens on the diagonal
np.diag([97, 98, 99])

### multi-dimensional arrays

In [None]:
# create a matrix, aka 2D-array
arr = np.zeros((3, 5), dtype=int)

In [None]:
arr

In [None]:
# figuring out the number of dimensions
arr.ndim

In [None]:
# and the number of entries along each dimension
arr.shape  # two axes, length 3 and 5 respectively

In [None]:
# total number of elements (product of all elements of arr.shape)
arr.size

In [None]:
# you can index into elements in multi-dimensional arrays within a single []
arr[1, 2] = 1
arr

In [None]:
# using multiple `[]`-pairs also works, but is less inefficient, as a separate intermediate view is created this way
arr[0, 2] == arr[0][2] 

In [None]:
# and you can use multi-dimensional slicing as well
# in many arithmetic operations the arguments are automatically 'broadcast' to the correct shape -- more later
arr[1, :] += 2
arr[:, 2] += 2
arr

In [None]:
# just leaving out a dimension is the same as `:` for any following dimensions
arr[1]

In [None]:
arr[1, :]

In [None]:
# and no-one said we're limited to two dimensions
five_d_array = np.ones((1, 2, 3, 4, 5), dtype=int)
five_d_array

In [None]:
five_d_array[0, 0, 0]  # indexing the first the axes out leaves me with the last two (4x5)

In [None]:
# of course I get a different (4x5)-section if I use different indices in the first dimensions
# (broadcasting again, btw)
five_d_array[0, 0, 0] = 0
five_d_array[0, 0, 1] = 1
five_d_array[0, 0, 2] = 2
five_d_array[0, 1, 0] = 5
five_d_array[0, 1, 1] = 6
five_d_array[0, 1, 2] = 7
five_d_array

<div class="alert alert-block alert-warning">
<b>Slicing creates views:</b> <br>
<a>
<p>Whenever you use indexing/slicing to access parts of an array what you get is a `view` in numpy language. A reference, effectively. It points to the area of memory, so changing the view also changes the original. </p>
use `np.copy` as necessary
</a>
</div>

### reshaping arrays

In [None]:
# I can change shape (and dimensionality) of an array
np.arange(100).reshape((10, 10))

In [None]:
np.arange(100).reshape((2, 5, 10))

In [None]:
# or I can `flatten` (= remove dimensions)
five_d_array.flatten()

In [None]:
# there's also `ravel` which looks the same
# `flatten` creates a new object, while `ravel` gives you a reference to the original object, it just looks different to you
five_d_array.ravel()

In [None]:
arr = np.arange(4).reshape(2, 2)
arr

In [None]:
raveled_arr = arr.ravel()
flattened_arr = arr.flatten()

In [None]:
flattened_arr

In [None]:
flattened_arr[1] = 4
flattened_arr

In [None]:
arr

In [None]:
raveled_arr

In [None]:
raveled_arr[1] = 4
raveled_arr

In [None]:
arr

In [None]:
# reshaping can add extra dimensions (as long as the total number of entries stays the same)
np.arange(10).reshape(1, 1, 1, 2, 5, 1)

### logical indexing

In [None]:
arr = np.arange(9)

In [None]:
arr

In [None]:
arr > 5

In [None]:
# logical indexing
# use an array the same shape as your original array, with boolean values
arr[arr > 5]

In [None]:
# of course you can have anything that gives you boolean values with the right shape
arr[arr % 2 == 0]

In [None]:
# and you can combine conditions with a single (!) `&`, `|` or `^`
# (requires brackets due to operator precedence...)
arr[(arr > 5) & (arr % 2 == 0)]

In [None]:
arr[(arr > 5) ^ (arr % 2 == 0)]

In [None]:
# you can also get the indexes meeting some condition
np.where(arr > 5)

In [None]:
# and you can also index with those if you like
arr[np.where(arr > 5)]

In [None]:
# of course that's also possible in higher dimensions
arr = arr.reshape(3, 3)

In [None]:
arr

In [None]:
# logical indexing collapsed the dimensions
arr[arr % 2 == 0]

In [None]:
# in multi-dimensional arrays `where ` returns a tuple with one array for each axis
np.where(arr % 2 == 0)

In [None]:
# and you can use these tuples for indexing still
arr[np.where(arr % 2 == 0)]

In [None]:
# but if you'd rather have 'coordinates' instead:
even_indices = np.where(arr % 2 == 0)
list(zip(*even_indices))

### combining and splitting arrays

In [None]:
arr1 = np.array([[1, 1], [2, 2]])
arr1

In [None]:
arr2 = np.array([[3, 3], [4, 4]])
arr2

In [None]:
# stack them 'horizontally' (inner-most dimension)
h_stacked = np.hstack([arr1, arr2])
h_stacked

In [None]:
h_stacked.shape

In [None]:
# stack them 'vertically' (outer-most dimension)
v_stacked = np.vstack([arr1, arr2])
v_stacked

In [None]:
v_stacked.shape

In [None]:
# obviously the dimensions need to match
np.hstack([np.ones((2, 2)), np.ones((3, 2))])

In [None]:
# you can also split arrays
# you can specify into how many segments to split
split_arr1, split_arr2 = np.hsplit(h_stacked, 2)

In [None]:
split_arr1

In [None]:
split_arr2

In [None]:
# of course you could also split the other way around
np.vsplit(h_stacked, 2)

In [None]:
# splits specified this way need to divide the dimension length
np.split(h_stacked, 3)

In [None]:
# but if you specify split points rathern than #splits you're free to create unequal parts
# if you want to specify one (or multiple) splitting points, pass a tuple
np.hsplit(h_stacked, (1, 3))

In [None]:
# for higher-dimensional arrays, use `np.split` and specify the axis
cube_array = np.arange(27).reshape((3, 3, 3))
cube_array

In [None]:
np.split(cube_array, (1, ), axis=0)

In [None]:
np.split(cube_array, (1, ), axis=1)

In [None]:
np.split(cube_array, (1, ), axis=2)

### Broadcasting

> The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.

In [None]:
arr_1 = np.arange(1, 4)
arr_2 = np.ones(3)

In [None]:
arr_1 + arr_2

In [None]:
# but just adding a (scalar) 1 works just as well...?
arr_1 + 1.

In [None]:
# same for division (or any other math operation
arr_2 / arr_1

In [None]:
1 / arr_1

> NumPy operations are usually done on pairs of arrays on an element-by-element basis. In the simplest case, the two arrays must have exactly the same shape. <br>
> NumPy’s broadcasting rule relaxes this constraint when the arrays’ shapes meet certain constraints. The simplest broadcasting example occurs when an array and a scalar value are combined in an operation, as above.<br>
> General rules: When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing (i.e. rightmost) dimension and works its way left. Two dimensions are compatible when **they are equal**, or **one of them is 1**.<br>
> Input arrays do not need to have the same number of dimensions. The resulting array will have the same number of dimensions as the input array with the greatest number of dimensions

In [None]:
# example: let's create a 4x4x3 array (imagine: 4x4 pixels, 3 color channels for example
arr = np.zeros((4, 4, 3))
# the first row is all red
arr[0, :, 0] = 1
# the second row is all green
arr[1, :, 1] = 1
# the third row is all blue
arr[2, :, 2] = 1
# and the last row is white
arr[3, :, :] = 1
arr

In [None]:
# another way to look at this:
# here's the red contributions on the screen
arr[..., 0]

In [None]:
# let's make everything less blue. we half the blues:
# broadcasting takes care of giving us the right shape
arr / np.array([1, 1, 2])

<div class="alert alert-block alert-info">
<b>another curse of dimensionality:</b> <br>
<p>
    These broadcasting rules are incredibly useful.<br>
    They are also nice and easy to understand if one side is a scalar.<br>
    If both sides are arrays, and one (or both) of these arrays are very high-dimensional they can become pretty unwieldy....</p>
</div>

## Advanced Topics:
- `numpy.linalg` -- Linear Algebra: just leave it to scipy
- `numpy.matlib` -- matrices: just use plain arrays and make the (scipy) calls explicit
- `ufunc`s: vectorized element-wise functions on arrays. You're using those already, there's more (technical) details though...
- `numpy.ctypeslib`, `numpy.datetime`, `numpy.fft`, masked arrays, various utility functions, ...

# pandas
- 'fast, powerful, flexible and easy to use open source data analysis and manipulation tool'
- functionality looks a bit like execel sheets / relational tables
- built on top of numpy
- two main datastructures:
  - `Series`: 1D Data -- typically single column, multiple rows
  - `DataFrame`: 2D Data -- multiple rows, multiple columns
- generally: avoid looping over rows of `DataFrames`: you can almost always achieve your goal using vectorized operations, joins etc.

## Installation and Import

In [None]:
!/home/atreju/.conda/envs/dhbw/bin/pip install pandas

In [None]:
import numpy as np
import pandas as pd  # convention, as usual

## pd.Series Basics
- one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
- The axis labels are collectively referred to as the index. 

### creating a Series
- can be easily created from lists/numpy arrays/dicts/scalars

In [None]:
# create a Series from a list/numpy array
pd.Series([3, 4, 5])

In [None]:
# and specify an index while you're at it
pd.Series([3, 4, 5], index=['a', 'b', 'c'])

In [None]:
# or you can create it from a dictionary
pd.Series({'a': 3, 'b': 4, 'c': 5})

In [None]:
# or from constants, basically like broadcasting. 
pd.Series(3, index=['a', 'b', 'c'])

### Series properties
- dtypes, math + broadcasting, names, ...

In [None]:
# you can get out the index again
s = pd.Series({'a': 3, 'b': 4, 'c': 5})
s.index

In [None]:
# and there's a single datatype for a Series, usually one of the numpy types
s.dtype

In [None]:
# just like numpy you can do vectorized math according to numpy broadcasting rules
s + 2

In [None]:
# and a `pd.Series` can be used in many numpy functions directly (preserving, but not modifying the index)
np.sqrt(s)

In [None]:
# and you can get a real numpy array out of it if you need to
arr = s.to_numpy()
arr

In [None]:
type(arr)

In [None]:
arr.dtype

In [None]:
# a Series can also have a `name`, behaving sort of like a column label
s = pd.Series({'a': 3, 'b': 4, 'c': 5}, name='series_name')
s

In [None]:
# and you can change the name, of course
s.rename('new_name')

### Indexing
- indexing of rows by index value simply using []
- indexing of rows by numerical row-number using `.iloc`
- logical indexing like in numpy also works
- no indexing of columns, since there's only one :)

In [None]:
s = pd.Series({'a': 3, 'b': 4, 'c': 5})
s

In [None]:
# square bracket indexing returns the row with matching index
s['a']

In [None]:
# you can slice with non-numeric indices
s['a':'c':2] *= 2
s

In [None]:
# alternatively, use `Series.loc (more interesting for DataFrames)
s.loc['a':'b']

In [None]:
# and iff the row name is a valid python variable name you can also access it as an attribute (but no slicing here)
s.a

In [None]:
# you can also use numerical indices -- row number, effectively
s.iloc[1]

In [None]:
# and you can use logical indexing
s[s > 4]

In [None]:
# you can create new rows simply by indexing and assignment
# (but this is rather slow)
s['foo'] = 5
s

In [None]:
# and rows can be removed again using `del`
# (again, rather slow)
del s['foo']
s

### Automatic alignment
- operations combining two Series objects automatically consider the index in all element-wise operations

In [None]:
s1 = pd.Series({'a': 3, 'b': 4, 'c': 5           })
s2 = pd.Series({        'b': 14, 'c': 15, 'd': 16})

In [None]:
s1 + s2

## pd.DataFrame Basics
- 2-dimensional labeled data structure with columns of **potentially different** types
- pretty much like a spreadsheet or SQL table
- index for both rows and columns (and indices can be hierarchical)
- most commonly used and most important pandas object

### creating a DataFrame
- can easily be created from dicts of lists, lists of dicts, Series, ...
- or (commonly) read from files

In [None]:
data = {
    'first_column': [1, 2, 3, 4, 5],
    'second_column': [1, 2, 3.1415, 4, 5]
}
df = pd.DataFrame(data)
df

In [None]:
# or with more interesting row index
data = {
    'first_column': [1, 2, 3, 4, 5],
    'second_column': [1, 2, 3.1415, 4, 5]
}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd', 'e'])
df

In [None]:
# or read from a large variety of file formats
pd.read_csv('../data/iris.csv')

In [None]:
pd.read_

### DataFrame properties

In [None]:
data = {
    'first_column': [1, 2, 3, 4, 5],
    'second_column': [1, 2, 3.1415, 4, 5]
}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd', 'e'])
df

In [None]:
# you can get out the index again
df.index

In [None]:
# but now there's also an index for the columns
df.columns

In [None]:
# and the dtype is now column-specific
df.dtypes

In [None]:
# you also have a shape, same as in numpy
df.shape

In [None]:
# just like numpy you can do vectorized math according to numpy broadcasting rules
df + 2

In [None]:
# or pass it to numpy functions (if the datatypes of all columns are compatible of course)
# this will automatically upcast column dtypes as necessary
np.sqrt(df)

In [None]:
np.sqrt(df).dtypes

In [None]:
# you can still convert it to a single numpy array
df_arr = df.to_numpy()
df_arr

In [None]:
# but that needs to bring all columns to a common type by upcasting
df_arr.dtype

In [None]:
# more interestin type mixing
data = {
    'first_column': [1, 2, 3, 4, 5],
    'second_column': ['foo', 'bar', 'bazz', 'here', 'there']
}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd', 'e'])
df

In [None]:
df.dtypes

In [None]:
# obviously these kinds of operations only work on compatible column types
df + 2

In [None]:
# and the last resort for upcasting mixed types is `object`
df.to_numpy()

### indexing
- square brackets now index columns, not rows
- `.loc` allows indexing of both rows and column
- `.iloc` allows numerical row/column-indices
- logical indexing now doesn't reduce the size, but just replaces unselected values by `NaN`. Use `.dropna` to actually remove rows

In [None]:
data = {
    'first_column': [1, 2, 3, 4, 5],
    'second_column': [1, 2, 3.1415, 4, 5]
}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd', 'e'])
df

In [None]:
# normal indexing gives me a column
df['first_column']

In [None]:
# but (watch out!) a slice in the same position will be applied to rows, not columns
df['a'::2]

In [None]:
# you can also access the column as an attribute if it's a valid python variable name
df.second_column

In [None]:
# `.loc` by default will be a row-index still
df.loc['a']

In [None]:
# but I can pass the column as second argument
df.loc['c', 'first_column']

In [None]:
# slicing still works also in `.loc`
# and also, I can select multiple rows (or columns) by indexing with a list/tuple
df.loc['a':'d':2, ('first_column', 'second_column')]

In [None]:
# and even re-order the index that way
df.loc[('a', 'c', 'b'), 'first_column'::2]

In [None]:
# `.iloc` for numerical indicies into both rows and columns, otherwise working like `.loc`
df.iloc[-1, :]

In [None]:
# logical indexing will simply set all non-selected values to NaN
df[df > 3]

In [None]:
# use dropna to actuall git rid of extra rows/columns
df[df > 3].dropna(how='any')

In [None]:
# decide on dropping rows where any value is NaN, or all values are NaN
df[df > 3].dropna(how='all')

In [None]:
# you can again add columns (or rows) using indexing + assignment
df['new_column'] = [11, 12, 13, 14, 15]
df

In [None]:
# obviously the length needs to be broadcastable
df['new_column'] = [11, 12, 13]

In [None]:
# broadcasting works just fine here
df['new_column'] = 5
df

In [None]:
# same for rows, using loc
df.loc['f', :] = [6, 6., 5]
df

In [None]:
# or both could be new
df.loc['g', 'some_column'] = 42
df

### Automatic Alignment
- for `DataFrame`s alignment happens on both rows and columns

In [None]:
df1 = pd.DataFrame(np.ones((6, 4)), columns=['A', 'B', 'C', 'D'], index=['a', 'b', 'c', 'd', 'e', 'f'])
df2 = pd.DataFrame(np.ones((3, 3)), columns=['A', 'C', 'D'], index=['b', 'd', 'f'])
df1 + df2


In [None]:
# but you can manually specify a 'fill value' to use if one of the DataFrames is NaN
df1.add(df2, fill_value=0)

In [None]:
df = pd.DataFrame(np.arange(15).reshape(5, 3), columns=['A', 'B', 'C'])
df

In [None]:
# operations between DataFrames and Series are broadcast row-wise
s = pd.Series([1, 1, 1], index=['A', 'B', 'C'])
df - s

In [None]:
# but only if the indices match will the result make much sense (automatic alignment again)
s = pd.Series([1, 1, 1], index=['X', 'B', 'Z'])
df - s

In [None]:
# so subtracting eg one columns from the rest doesn't work the way you think:
df - df.C

In [None]:
# you can use explicit dataframe methods if you want to apply arithmetic column-wise, specifying an axis
df.sub(df.C, axis=0)

## Truthiness and comparison of DataFrames

In [None]:
data = {
    'first_column': [1, 2, 3, 4, 5],
    'second_column': [1, 2, 3.1415, 4, 5]
}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd', 'e'])
df

In [None]:
if df > 0:
    print('It is indeed greater than 0')

In [None]:
boolean_df = df > 0
boolean_df

In [None]:
boolean_df.all()

In [None]:
boolean_df.all(axis=1)

In [None]:
boolean_df.all().all()

In [None]:
if boolean_df.all().all():
    print('Actually, all the elements are True')

In [None]:
df.loc['a', 'first_column'] = np.NaN
df

In [None]:
# NaNs hiding in a DataFrame can make some things suprisingly false
df + df == 2*df

In [None]:
(df + df == 2*df).all().all()

In [None]:
# use `equals` to compare dataframes for equality instead
(df + df).equals(2*df)

## Summarizing DataFrames

In [None]:
df = pd.read_csv('../data/iris.csv')

In [None]:
# by default a limited number of rows (and columns) is printed
df

In [None]:
# look only at the first N rows
df.head(5)

In [None]:
# or the last N rows
df.tail(3)

In [None]:
# get some overview of datatypes and NULL/NaN values
df.info()

In [None]:
# or some descriptive statistics (leaving out non-numeric columns)
# (you can specify your own percentiles, too)
df.describe()

In [None]:
# describe also works on some non-numeric columns, just not in combination with numerical columns)
df.variety.describe()

In [None]:
# unless you force it
df.describe(include='all')

## sorting DataFrames

In [None]:
df = pd.DataFrame(
    np.random.randint(low=3, high=17, size=(5, 3)),
    columns=['col_1', 'col_2', 'col_3'],
    index=['e', 'a', 'x', 'y', 'b']
)
df

In [None]:
# sort by index
df.sort_index()

In [None]:
# sort by the values in one column
df.sort_values(['col_1'])

In [None]:
# or by multiple columns (if there are ties in the first column)
df.sort_values(['col_1', 'col_3'])

## `query` for convenient filtering

In [None]:
df = pd.read_csv('../data/iris.csv')

In [None]:
# not so super convenient to read
df[((df.variety == 'Setosa') & (df['petal_width'] > 0.4)) | ((df.variety == 'Virginica') & (df['sepal_width'] > 3.5))]

In [None]:
# use a query string instead

df.query('(variety == "Setosa" and petal_width > 0.4) or (variety == "Virginica" and sepal_width > 3.5)')

In [None]:
# of course you can just refer to column values in both sides of the comparison, and you can do math in the query...
df.query('sepal_length < 1.1*petal_length')

In [None]:
# sometimes column names are not valid python identifiers:
renamed_df = df.rename(columns={'sepal_length': 'sepal length'})
renamed_df.head(3)

In [None]:
# query doesn't work out of the box then
renamed_df.query('sepal length > 7')

In [None]:
# but you can use backticks to escape such variable names
renamed_df.query('`sepal length` > 7')

In [None]:
# you can refer to external variables by pre-fixing them with an @-sign
min_sepal_length = 7
df.query('sepal_length > @min_sepal_length')

## aggregations
- return a single value for a series
- usually applied per-column (but could also be applied per-row)
- easy methods for aggregating multiple columns with multiple aggregation functions
- revisit later with windowing/grouping

**there are a large number of such functions we could apply**
- count: Number of non-NA observations
- sum: Sum of values
- prod: Product of values
- mean: Mean of values
- std: Sample standard deviation
- sem: Standard error of the mean
- var: Unbiased variance
- skew: Sample skewness (3rd moment)
- kurt: Sample kurtosis (4th moment)
- median: Arithmetic median of values
- quantile: Sample quantile (value at %)
- min/max: Smalles/Largest value
- idxmin/idxmax: Index of smallest/largest value
- mode: Most frequent value
- nunique: number of unique values
- cumsum/comprod: Cumulative sum/product
- cummax/cummin: Cumulative maximum/minimum
- ... probably more that I forgot

In [None]:
df = pd.read_csv('../data/iris.csv')

In [None]:
# applying an aggregation function to a dataframe applies the function column-wise
df.drop(columns='variety').mean()

In [None]:
# I can apply it row-wise using the `axis` parameter
df.drop(columns='variety').mean(axis=1)

In [None]:
# sometimes we want to apply multiple aggregations, there's a convenient helper `aggregate`
df.drop(columns='variety').aggregate(['sum', 'mean', 'median', 'nunique'])

In [None]:
# you can use your own functions -- the argument will be a series
# but the name is not very nice now...
df.drop(columns='variety').aggregate([lambda x: sum(x), lambda x: x.median()])

In [None]:
def my_mean(s: pd.Series) -> float | str:
    if s.dtype.name != 'object':
        return sum(s) / len(s)
    return '<CAN NOT AGGREGATE OBJECTS>'

In [None]:
# if I used named functions rather than lambdas, the name of the function is assigned
df.aggregate([my_mean])

In [None]:
# I can also use aggregate with dictionary arguments, allowing me to apply separate functions for each column
df.aggregate(
    {
        'sepal_length': 'mean',
        'sepal_width': ['min', 'max']
    }
)

In [None]:
# finally, you can use named arguments to select columns /and/ rename the output label
df.aggregate(
    sepal_length_mean=('sepal_length', lambda x: x.mean()),
    sepal_width_max=('sepal_width', lambda x: x.max()),
)

In [None]:
# `agg` is an alias for `aggregate`
# documentation actually recommends using the `agg` alias
df.agg == df.aggregate

## Special Accessors
- `.str` for string columns
- `.dt` for timestamp columns

### `str` accessor
- makes string functions (eg `upper`, `lower`, indexing, ...) available as vectorized methods on a Series

In [None]:
df = pd.DataFrame(
    {
        'numeric': [1, 2, 3.1415, 4, 5],
        'text': ['hello', 'world', 'aint', 'this', 'fun']
    })
df

In [None]:
# just a normal row index
df.text[0]

In [None]:
# vectorized, returning the first character of each element
df['first_character'] = df.text.str[0]
df

In [None]:
df['has_a'] = df.text.str.contains('a')
df

### `.dt`-accessor
- just like the string accessor, but for datetime objects

In [None]:
df = pd.DataFrame({'ts': pd.date_range("20221201 09:10:12", periods=4, freq='M', tz='utc'), 'value': np.arange(4)})
df

In [None]:
df.ts.dt.month_name()

In [None]:
df.ts.dt.month

## apply functions to DataFrames
- apply functions to a whole table: `pipe`
- apply functions to rows/columns: `apply`
- apply functions to each element: `map`
- apply multiple functions to each element: `transform`

### pipe
- function argument is a dataframe
- function return is a dataframe

In [None]:
address_df = pd.DataFrame(dict(address = ['Frau Dr. Ute Herzog,  Esplanade 89,  31759 Teugn', 'Gaby Maier, Landsberger Allee 59, 31759 München']))
address_df

In [None]:
def extract_postcode_city(df):
    df['postcode_city'] = df['address'].str.split(',').str.get(-1).str.strip()
    return df

def extract_postcode(df):
    df['postcode'] = df['postcode_city'].str.split(' ').str.get(0).str.strip()
    return df

def extract_name(df):
    df['name'] = df['address'].str.split(',').str.get(0).str.strip()
    return df

def add_country_name(df, country):
    df['country'] = country
    return df

In [None]:
# applying all three is not very nice to read
extract_name(add_country_name(extract_postcode(extract_postcode_city(address_df)), ' DE'))

In [None]:
## not much better if you spread it out over multiple lines
extract_name(
    add_country_name(
        extract_postcode(
            extract_postcode_city(address_df)
        ),
        ' DE'
    )
)

In [None]:
# much nicer to read with `pipe`
(
  address_df
    .pipe(extract_postcode_city)
    .pipe(extract_postcode)
    .pipe(add_country_name, 'DE')
    .pipe(extract_name)
)

### apply
- function argument is a series
- function return is a pd.Series or a scalar
  - if the function returns a Series, `apply` returns a DataFrame
  - if the function returns a scalar, `apply` returns a Series
- the function is applied to all rows or columns of the `DataFrame` by `apply`

In [None]:
# example where the function returns a Series
def scale_min_max(s: pd.Series) -> pd.Series:
    return (s - s.min()) / (s.max() - s.min())

In [None]:
df = pd.DataFrame(np.random.randint(low=3, high=17, size=(5, 3)), columns=['col_1', 'col_2', 'col_3'])
df

In [None]:
# apply our min-max-scaler to each column separately
df.apply(scale_min_max)

In [None]:
# or apply it to each row...
df.apply(scale_min_max, axis=1)

In [None]:
def min_max_index_diff(s: pd.Series) -> int:
    return s.idxmax() - s.idxmin()

In [None]:
df.apply(min_max_index_diff)

### map
- function argument is a scalar
- return value is a scalar
- return value of `map` is a `DataFrame`
- the function is applied to all elements of the DataFrame by `map`

In [None]:
df = pd.DataFrame(
    {
        'numeric': [1, 2, 3.1415, 4, 5],
        'text': ['hello', 'world', 'aint', 'this', 'fun']
    })
df

In [None]:
df.map(lambda x: len(str(x)))

### transform
- like `map`, but applying multiple functions, and getting a result for each
- similar argument syntax to `aggregate`

In [None]:
df = pd.DataFrame(np.random.randint(low=3, high=17, size=(5, 3)), columns=['col_1', 'col_2', 'col_3'])
df

In [None]:
# columns become a multi-index here, too...
df.transform(['sqrt', lambda x: x+1])

In [None]:
df.transform({'col_1': lambda x: x+1, 'col_2': lambda x: x+2, 'col_3': [lambda x: x+3, 'sqrt']})

In [None]:
# but this part of the `aggregate` syntax doesn't work for transform...
df.transform(x=('col_1', 'sqrt'))

## cleaning data

### dealing with NaNs

In [None]:
df = pd.DataFrame(np.random.randint(2, 12, size=(3, 4)), columns=['col1', 'col2', 'col3', 'col4'])
df.iloc[0, ::2] = np.NaN
df.iloc[1, 1::2] = np.NaN
df

In [None]:
# same shape DataFrame telling you for each element whether it as NA
df.isna()

In [None]:
# opposite of the above...
df.notna()

In [None]:
# many of the aggregation functions ignore NAs by default
df.mean()

In [None]:
# you can replace NA values by something new
df.fillna(-1)

In [None]:
# but you can also impute e.g. mean (or any other value) per column
df.fillna(df.mean())

In [None]:
# reminder, this is what `df.mean()` looked like:
df.mean()

In [None]:
# so this also works (unspecified columns are simply skipped):
df.fillna(pd.Series({'col1': -5, 'col3': -7}))

In [None]:
# and you don't even need the series, just use a dict
df.fillna({'col1': -5, 'col3': -7})

### duplicate rows

In [None]:
df = pd.DataFrame({
    'id': ['foo', 'bar', 'foo', 'bazz', 'bazz'],
    'color': ['red', 'red', 'green', 'blue', 'blue'],
    'length': [1, 2, 3, 2, 2]
})
df

In [None]:
# find rows that are all duplicates
df.duplicated()

In [None]:
# find rows where some columns are duplicated
df.duplicated(['id'])

In [None]:
# drop rows that are full duplicates
df.drop_duplicates()

In [None]:
# drop rows that are partial duplicates
df.drop_duplicates(['id'])

In [None]:
# by default you keep the first row
# of course you can change that
df.drop_duplicates(['id'], keep='last')

## group-by
- group dataframe by value in one or multiple columns
- and apply functions to the groups
  - **aggregating** functions: calculate one value per group (and column), e.g. group means
  - **transforming** functions: modify all group values with some group-specific function e.g. scaling by group
  - **filter** functions: discard some groups based on group-specific criteria, e.g. discard small groups

In [None]:
df = pd.read_csv('../data/iris.csv')
df.head(3)

In [None]:
# groupby returns a `DataFrameGroupBy` object -- not very helpful
df.groupby('variety')

In [None]:
# you can get information on the identified groups, and which rows belong to a group
df.groupby('variety').groups

In [None]:
# and you can pull out a single group
df.groupby('variety').get_group('Setosa').head(3)

#### aggregation over groups

In [None]:
# but mostly, you just call functions on the `GroupBy` object itself; that looks very useful :)
# the value of the group-by column becomes the index of the new DataFrame!
# (change that behaviour with the `as_index` parameter to `groupby`
df.groupby('variety').mean()

In [None]:
# you can also group by multiple columns, and get a result for each possible combination

df['is_odd'] = np.arange(df.shape[0]) % 2
df.groupby(['variety', 'is_odd']).max()

In [None]:
del df['is_odd']

In [None]:
# you can use `aggregate`, just like we saw for the `DataFrame` itself
# everything we saw above for aggregate should work here as well
df.groupby('variety').aggregate(['min', 'max'])

#### transformation of groups

In [None]:
# transformations don't return the group label...
df.groupby('variety').cumsum()

In [None]:
# usually the transformation result is combined back with the original dataframe
combined = pd.concat(
    [
        df,
        df.groupby('variety').cumsum().rename(columns=lambda x: x + '_cumsum')
    ],
    axis=1
)
combined[-50:-45]

In [None]:
# you can of course still use your own transformation functions
# your function is applied to each column in the group separately, so it still accepts a Series
def scale_min_max(s: pd.Series) -> pd.Series:
    return (s - s.min()) / (s.max() - s.min())

df.groupby('variety').transform(scale_min_max)

#### Filtration
- filter, i.e. return some of the rows in each group, specific to the group

In [None]:
# for example, the two rows for each group with the smalles (thanks to sorting first) values in some column
df.sort_values('sepal_length').groupby('variety').head(2)

In [None]:
df.groupby('variety').filter(lambda df: df['sepal_width'].mean() > df['petal_length'].mean())

## concatenation and joins
- you can stack `DataFrame`s on top, or next to each other igoring indices/columns using `concat`
- just like in SQL you can also join tables, i.e. combine them along one axis while choosing how to treat the other index
  - `merge` is typically used for joins based on column values
  - `join` is typically used for joins based on index values

### concatenation

In [None]:
left = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)

right = pd.DataFrame(
    {
        "key1": ["K5", "K6", "K7", "K8"],
        "A": ["C0", "C1", "C2", "C3"],
        "B": ["D0", "D1", "D2", "D3"],
    }
)


In [None]:
left

In [None]:
right

In [None]:
# stack one below the other
pd.concat([left, right])

In [None]:
# stack one to the right of the other
pd.concat([left, right], axis=1)

In [None]:
# same but when the columns/index have different names
left = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)

right = pd.DataFrame(
    {
        "key2": ["K5", "K6", "K7", "K8"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }, index=['a', 'b', 'c', 'd']
)


In [None]:
# content irrelevant, but labels preserved
pd.concat([left, right])

In [None]:
pd.concat([left, right], axis=1)

In [None]:
# you can e.g. reset the index to come to a more 'common' representations
pd.concat([left.reset_index(), right.reset_index()], axis=1)

### merging dataframes -- joins based on columns

In [None]:
left = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)

right = pd.DataFrame(
    {
        
        "key1": ["K0", "K1", "K2", "K3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)

In [None]:
# join two dataframes on the same key
pd.merge(left, right, on='key1')

In [None]:
# let's drop a row
smaller_right = right.drop(1)
smaller_right

In [None]:
# by default this is an inner join, i.e. only if the index is present in both will it show up in the result
pd.merge(left, smaller_right, on='key1')

In [None]:
# you can specify the join type (left/right/inner/outer) through the `how` variable
pd.merge(left, smaller_right, how='right')

In [None]:
# there can be multiple columns to join on
left = pd.DataFrame(
    {
        "key1": ["K0", "K0", "K1", "K2"],
        "key2": ["K0", "K1", "K0", "K1"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)

right = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K1", "K2"],
        "key2": ["K0", "K0", "K0", "K0"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)

In [None]:
pd.merge(left, right, on=['key1', 'key2'], how='outer')

### join based on indices

In [None]:
# same as before, but now the old `key` columns become the index

left = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }, index=["K0", "K1", "K2", "K3"],
)

right = pd.DataFrame(
    {
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }, index=["K0", "K1", "K2", "K3"],
)

In [None]:
left

In [None]:
right

In [None]:
# join automatically uses the index as join column
left.join(right)

In [None]:
small_right = right.drop('K1')

In [None]:
# default `how` is 'left' this time around
left.join(small_right)

In [None]:
# but you can pick whichever you like
left.join(small_right, how='right')

In [None]:
# you can do the same thing with `merge` if you like, so `join` is just a convenient shorthand
pd.merge(left, small_right, left_index=True, right_index=True)

## Timeseries

In [None]:
df = pd.DataFrame({
    'value_1': np.random.randint(3, 12, size=(4,)),
    'value_2': np.random.randint(3, 12, size=(4,))
}, index=[pd.Timestamp('2023-11-01'), pd.Timestamp('2023-11-02'), pd.Timestamp('2023-11-04'), pd.Timestamp('2023-11-05')])
df

In [None]:
# fill in gaps in a timeseries
daily_df = df.reindex(pd.date_range('2023-11-01', '2023-11-05', freq='D'))
daily_df

In [None]:
hourly_df = df.reindex(pd.date_range('2023-11-01', '2023-11-05', freq='H'))
hourly_df.head()

In [None]:
# indexing timestamps is 'fuzzy' if you use a string as index
hourly_df.loc['2023-11-01'].head()

In [None]:
# funky slicing
hourly_df.loc['2023-11-01 05:00:00':'2023-11-01 08:00:00'].head()

In [None]:
# using a Timestamp as index is precise, though
hourly_df.loc[pd.Timestamp('2023-11-01')]

In [None]:
# fill NaNs with older values
daily_df.ffill()

In [None]:
# you can also use `asfreq` to convert to a particular frequency, and fill missing values in one go
df.asfreq(pd.offsets.BDay(), method="ffill")

In [None]:
# just a reminder what `df` looks like
df

In [None]:
# shifting data
pd.concat([df, df.shift(1)], axis=1)

In [None]:
# better specify a frequency!
pd.concat([df, df.shift(1, freq='D')], axis=1)

In [None]:
# windowing functions
df.rolling(2).mean()

# matplotlib
- the basic library for visualizing/plotting data
- plays very well with numpy

## Installation and Import

In [None]:
!/home/atreju/.conda/envs/dhbw/bin/pip install matplotlib

In [None]:
# importing matplotlib is a bit special for common convenient use:
from matplotlib import pyplot as plt  # pyplot provided a 'nice' API for matplotlib functionality. `plt` is just convention again
%matplotlib inline
# ^ helpful in jupyter to show figures directly in the notebook

## basic usage

In [None]:
import numpy as np

In [None]:
x = np.linspace(0, 2 * np.pi, 200)
y = np.sin(x)
plt.plot(x, y);  # two variables for x and y position of points, by default draws lines between points and no markers at points

In [None]:
x = np.linspace(0, 2 * np.pi, 20)
y = np.sin(x)

plt.plot(x, y, 'cd-.');
# format strings: [marker][line][color] ([color][marker][line] is also understood (and common in examples) but may be ambiguous)
# markers: one of .,ov^<>12348spP*hH+xXDd|_
# line: one of - -- -. :
# color: one of rgbcmykw

In [None]:
# you can also specify these (and more) parameters as keywords
plt.plot(x, y, color='c', marker='>', linestyle='--', linewidth=2, markersize=10, alpha=0.5, fillstyle='none');

In [None]:
# points are connected in sequence, not by x-value:
indices = np.random.permutation(20)
plt.plot(x[indices], y[indices], 'cd-');

## two API flavours

### Explicit axes objects

In [None]:
x = np.linspace(0, 2, 100)  # Sample data.

fig, ax = plt.subplots(figsize=(5, 2.7), layout='constrained')
ax.plot(x, x, label='linear')  # Plot some data on the axes.
ax.plot(x, x**2, label='quadratic')  # Plot more data on the axes...
ax.plot(x, x**3, label='cubic')  # ... and some more.
ax.set_xlabel('x label')  # Add an x-label to the axes.
ax.set_ylabel('y label')  # Add a y-label to the axes.
ax.set_title("Simple Plot")  # Add a title to the axes.
ax.legend();  # Add a legend.

In [None]:
# let matplotlib figure it out (pyplot-style)
x = np.linspace(0, 2, 100)  # Sample data.

plt.figure(figsize=(5, 2.7), layout='constrained')
plt.plot(x, x, label='linear')  # Plot some data on the (implicit) axes.
plt.plot(x, x**2, label='quadratic')  # etc.
plt.plot(x, x**3, label='cubic')
plt.xlabel('x label')
plt.ylabel('y label')
plt.title("Simple Plot")
plt.legend();

<div class="alert alert-block alert-info">
<b>API Choice:</b> <br>
<a>
    For complex plots or longer-lived code prefer the axes-object-style.<br>
    For quick (interactive) visualizations pyplot-style is often more convenient.
</a>
</div>

## Plot Types

### scatter plot
- 2D, no connecting lines
- each point can have a separate style (eg size, color)

In [None]:
# also shows of the 'data' argument
# come back to that with pandas
data = {'x': np.arange(50),
        'c': np.random.randint(0, 50, 50),
        's': np.random.standard_normal(50)}
data['y'] = data['x'] + 10 * np.random.standard_normal(50)
data['s'] = np.abs(data['s']) * 100

plt.scatter('x', 'y', c='c', s='s', data=data);

### bar plot (like)

In [None]:
x = 0.5 + np.arange(8)  # to make the bars centered in integer intervals -- there's also an 'align' parameter
y = [4.8, 5.5, 3.5, 4.6, 6.5, 6.6, 2.6, 3.0]
plt.bar(x, y, width=1, edgecolor='white');

In [None]:
plt.stem(x, y, bottom=5);

### statistics - single distributions

In [None]:
x = np.random.normal(20, 5, 100000)
plt.hist(x, bins=100);

In [None]:
y = 1.2 * x + np.random.normal(5, 3, 100000)
plt.hist2d(x, y, bins=(np.linspace(0, 40, 200), np.linspace(0, 60, 100)));

### statistics -- multiple distributions

In [None]:
data = np.random.normal((3, 5, 4), (1.25, 1.00, 1.25), (400, 3))
plt.hist(data);

In [None]:
plt.boxplot(data);
# median, 1st and 3rd quartile, quartile +/- 1.5IQR, outliers

In [None]:
plt.violinplot(data, showmedians=True);

### plotting 2D data

In [None]:
arr = np.zeros((8, 8))
arr[::2, ::2] = 1
arr[1::2, 1::2] = 1
plt.imshow(arr)

In [None]:
X, Y = np.meshgrid(np.linspace(-3, 3, 256), np.linspace(-3, 3, 256))
Z = (1 - X/2 + X**5 + Y**3) * np.exp(-X**2 - Y**2)

In [None]:
plt.imshow(np.flipud(Z))

In [None]:
plt.pcolormesh(X, Y, Z);
plt.gca().set_aspect('equal')

In [None]:
plt.contour(X, Y, Z);
plt.gca().set_aspect('equal')

### plotting 3D data

In [None]:
data = {'x': np.arange(50),
        'c': np.random.randint(0, 50, 50),
        's': np.random.standard_normal(50)}
data['y'] = data['x'] + 10 * np.random.standard_normal(50)
data['z'] = data['y'] = 10 * np.random.standard_normal(50)
data['s'] = np.abs(data['s']) * 100

fig, ax = plt.subplots(subplot_kw={"projection": "3d"})
ax.scatter('x', 'y', 'z', c='c', s='s', data=data);


In [None]:
X, Y = np.meshgrid(np.arange(-5, 5, 0.25), np.arange(-5, 5, 0.25))
Z = np.sin(np.sqrt(X**2 + Y**2))

# Plot the surface
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})
ax.plot_surface(X, Y, Z);

In [None]:
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})
ax.plot_wireframe(X, Y, Z);

## Multiple figure in one plot

In [None]:
import matplotlib

X, Y = np.meshgrid(np.linspace(-3, 3, 128), np.linspace(-3, 3, 128))
Z = (1 - X/2 + X**5 + Y**3) * np.exp(-X**2 - Y**2)

fig, axs = plt.subplots(2, 2, layout='constrained')
axs[0, 0].pcolormesh(X, Y, Z, vmin=-1, vmax=1, cmap='RdBu_r')
axs[0, 0].set_title('pcolormesh()')
axs[0, 0].set_aspect('equal')

axs[0, 1].contour(X, Y, Z, levels=np.linspace(-1.25, 1.25, 11))
axs[0, 1].set_title('contourf()')
axs[0, 1].set_aspect('equal')

axs[1, 0].imshow(np.flipud(Z), vmin=-.5, vmax=.8)
axs[1, 0].set_title('imshow()')

axs[1, 1].imshow(np.flipud(Z**2), norm=matplotlib.colors.LogNorm(vmin=0.01, vmax=1))
axs[1, 1].set_title('imshow(Z**2) with LogNorm()');

# plotnine
- alternative API for plotting
- built on top of matplotlib
- works great with pandas
- not used so much in python
- but similar to a **very** popular R library, `ggplot2`
- based on concepts from [A grammar of graphics](https://www.cs.uic.edu/~wilkinson/TheGrammarOfGraphics/GOG.html)

## Installation and Import    

In [None]:
!/home/atreju/.conda/envs/dhbw/bin/pip install plotnine

In [None]:
import plotnine as p9  # convention, as usual

## Basic Example

In [None]:
df = pd.read_csv('../data/iris.csv')

In [None]:
(
    p9.ggplot(df, p9.aes(x='sepal_length', y='sepal_width', colour='variety', group='variety'))
      + p9.geom_point()
      + p9.geom_smooth(method='lm')
)

In [None]:
(
  p9.ggplot(df, p9.aes(x='variety', y='sepal_width', fill='variety')) 
      + p9.geom_violin(draw_quantiles=0.5, trim=False)
      + p9.scale_fill_brewer(type='qual')
)

# Seaborn
- another library for plotting & visualization
- specific focus on statistics
- with very high-level API

## Installation and Import

In [None]:
!/home/atreju/.conda/envs/dhbw/bin/pip install seaborn

In [None]:
import seaborn as sns  # guess what: convention
%matplotlib inline

In [None]:
tips = sns.load_dataset('tips')
tips

## Examples

In [None]:
# plot statistical relationship between multiple variables
sns.relplot(
    data=tips,
    x="total_bill", y="tip", col="time",
    hue="smoker", style="smoker", size="size",
);

In [None]:
# or for distributions
sns.displot(data=tips, x="total_bill", col="time", kde=True);

In [None]:
# or more violins
# here called `catplot`(categorical, not the animal :) because we're using many categorical variables
# (day, smoker)
sns.catplot(data=tips, kind="violin", x="day", y="total_bill", hue="smoker", split=True)

In [None]:
# plot correlations between many variables in one go, and color by group
df = pd.read_csv('../data/iris.csv')
sns.pairplot(data=df, hue="variety");

# scipy

## Installation and Import

In [None]:
!/home/atreju/.conda/envs/dhbw/bin/pip install scipy

In [None]:
# typically import parts of scipy, e.g.
from scipy import linalg

## linear algebra
- everything in the world needs linear algebra, all the time
- and for ML students/practitioners everything needs an extra dose of linear algebra
- so if you learn linear algebra somewhere, pay attention :)

In [None]:
from scipy import linalg

#### linear algebra -- basic operations

In [None]:
# we're doing math, so we're using mathy language...
matrix = np.array([[1, 2, 3], [3, 2, 1], [1, 0, -1]])
matrix = np.arange(9).reshape(3, 3)
matrix

In [None]:
vector1 = np.array([1, 0, 0])
vector2 = np.array([0, 1, 0])
vector3 = np.array([1, 1, 0])

In [None]:
# transpose an array (mirror along diagonal)
matrix.T

In [None]:
# calculate inner products between vectors
# some things still needed from numpy here

In [None]:
np.dot(vector1, vector2)

In [None]:
np.dot(vector1, vector3)

In [None]:
# shortcut using `@`-operator for matrix multiplication

In [None]:
vector1 @ vector3

In [None]:
# you can also multiply vectors and matrices
np.dot(matrix, vector1)

In [None]:
# and you can multiply matrices together

In [None]:
np.dot(matrix, matrix)

In [None]:
# and outer products :)
np.outer(matrix, matrix)

In [None]:
# you can calculate the determinant of a matrix
linalg.det(matrix)

In [None]:
# you can calculate the inverse of a matrix
invertible_matrix = np.array([[1, 2], [3, 4]])
linalg.inv(invertible_matrix)

In [None]:
# or pseudo-inverse
linalg.pinv(invertible_matrix)

In [None]:
invertible_matrix @ np.linalg.inv(invertible_matrix)

In [None]:
# you can calculate eigenvalues and eigenvectors of a matrix
values, vectors = np.linalg.eig(matrix)

In [None]:
values

In [None]:
vectors

In [None]:
np.dot(matrix, vectors[:, 0]) - values[0]*vectors[:, 0]

#### and what do we do all this for?

In [None]:
# mostly for solving systems of linear equations
#   x + 3y + 5z = 10
#  2x + 5y +  z = 8
#  2x + 3y + 8z = 3

we can express that problem as a matrix multiplication
  ⌈1  3  5⌉     ⌈x⌉   ⌈10⌉ 
  |2  5  1|  x  |y| = | 8|
  ⌊2  3  8⌋     ⌊z⌋   ⌊ 3⌋


In [None]:
matrix = np.array([
    [1, 3, 5],
    [2, 5, 1],
    [2, 3, 8]
])

In [None]:
solution = linalg.inv(matrix).dot(np.array([10, 8, 3]))
solution

In [None]:
# put that in for your x, y and z and the equation becomes true...
matrix.dot(solution)

**and lots more**
- linear least squares and pseudo-inverses (fitting curves to data)
- matrix decomposition (cholesky, singular value, LU, ...)
- matrix powers/logarithms/trigonometric functions
- some special matrices
- ...

## statistics
- random distributions are almost as important as linear algebra :)
- obviously closely related to generating random numbers in the first place...
- there's lots of them :)

common to all distributions:
- rvs: random variates
- pdf: probability density (continuous)
- pmf: probability mass (discrete)
- cdf: cumulative distribution function
- stats: mean, variance, ...
- ...

In [None]:
from scipy import stats

### intro example

In [None]:
import numpy as np
from scipy import stats
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
x = np.linspace(-3, 3, 601)
plt.plot(x, stats.norm.pdf(x))

In [None]:
plt.plot(x, stats.norm.cdf(x))

In [None]:
# get mean and variance
stats.norm.stats()

In [None]:
# get random samples
stats.norm.rvs(size=10)

In [None]:
# can specify mean and variance
plt.plot(x, stats.norm.pdf(x, loc=1, scale=2))

### better random numbers

In [None]:
# scipy uses a random number generator from numpy
from numpy.random import default_rng
rng = default_rng(seed=None)  # allows to set a seed
stats.norm.rvs(size=5, random_state=rng)

### more distributions: discrete -- three examples
- there's 19 different ones in scipy...
- I won't discuss them all...

In [None]:
# discrete: binomial (# successes in fixed number of trials (coin flip)
stats.binom.stats(n=5, p=0.5)

In [None]:
x = np.linspace(-.5, 5.5, 601)
plt.plot(x, stats.binom.pmf(x, n=5, p=0.5))

In [None]:
# discrete: poisson (# independent events in fixed interval)
stats.poisson.stats(mu=10)

In [None]:
x = np.linspace(0, 20, 2001)
plt.plot(x, stats.poisson.pmf(x, mu=10))

In [None]:
# discrete: uniform
stats.randint.stats(low=5, high=10)  # high is (as usual) exclusive

In [None]:
x = np.linspace(5, 10, 501)
plt.plot(x, stats.randint.pmf(x, low=5, high=10))

### more distributions: continuous -- three examples
- there's more than **90** different ones in scipy...
- I won't discuss them all...
- top place should go to the normal distribution, but we had that already above

In [None]:
# exponential: time between events in poisson
stats.expon.stats()

In [None]:
x = np.linspace(0, 5, 501)
plt.plot(x, stats.expon.pdf(x))

In [None]:
# uniform: continous version of discrete `
stats.uniform.stats(loc=5)

In [None]:
x = np.linspace(-1, 2, 301)
plt.plot(x, stats.uniform.pdf(x))

In [None]:
# beta: common in bayesian statistics
stats.beta.stats(a=2, b=5)

In [None]:
plt.plot(x, stats.beta.pdf(x, a=2, b=5))

## interpolation
- 'guess' function values between measured points

In [None]:
from scipy import interpolate

In [None]:
import numpy as np
x = np.linspace(0, 10, num=11)
y = np.cos(-x**2 / 9)

In [None]:
xnew = np.linspace(0, 10, num=1001)
ynew = np.interp(xnew, x, y)

In [None]:
xtrue = np.linspace(0, 10, num=1001)
ytrue = np.cos(-xtrue**2 / 9)

In [None]:
plt.plot(xnew, ynew, '-', label='linear interp')
plt.plot(x, y, 'o', label='data')
plt.plot(xtrue, ytrue, '--', label='true function')
plt.legend(loc='best')
plt.show()

In [None]:
spline_interpolator = interpolate.make_interp_spline(x, y)
ynew = spline_interpolator(xnew)

In [None]:
plt.plot(xnew, ynew, '-', label='linear interp')
plt.plot(x, y, 'o', label='data')
plt.plot(xtrue, ytrue, '--', label='true function')
plt.legend(loc='best')
plt.show()

## integration
- numerical integration of functions

In [None]:
from scipy import integrate
import numpy as np

In [None]:
def integrand(t, n, x):
    return np.exp(-x*t) / t**n

In [None]:
def integral(n, x):
    return integrate.quad(integrand, 1, np.inf, args=(n, x))[0]

In [None]:
vectorized_integral = np.vectorize(integral)

In [None]:
x = np.linspace(1, 5, 401)
plt.plot(x, vectorized_integral(0, x))
plt.plot(x, vectorized_integral(1, x))
plt.plot(x, vectorized_integral(2, x))
plt.plot(x, vectorized_integral(3, x))

## fft
- get frequency components of signals
- very important in analyzing signals

In [None]:
from scipy import fft

In [None]:
x = np.linspace(0, 10*2*np.pi, 10001)
y = (np.sin(x*np.pi) + np.sin(2*x*np.pi) + np.cos(3*np.pi*x)) 

In [None]:
plt.plot(x, y, '-')
plt.gca().set_xlim((0, 4*np.pi))

In [None]:
yf = fft.fft(y)
xf = fft.fftfreq(10001, 10*2*np.pi/10000)
plt.plot(xf, np.abs(yf))
plt.gca().set_xlim((-2, 2));

## optimization
- numerically find extrema of functions

In [None]:
from scipy import optimize

In [None]:
X, Y = np.meshgrid(np.linspace(-3, 3, 256), np.linspace(-3, 3, 256))
Z = (1 - X/2 + X**5 + Y**3) * np.exp(-X**2 - Y**2)

In [None]:
plt.pcolormesh(X, Y, Z)
plt.gca().set_aspect('equal')

In [None]:
def f(params):
    x, y = params
    return -(1 - x/2 + x**5 + y**3) * np.exp(-x**2 - y**2)

In [None]:
res1 = optimize.minimize(f, (0, 0), bounds=((-3, 3), (-3, 3)))

In [None]:
res2 = optimize.minimize(f, (3, 0), bounds=((-3, 3), (-3, 3)))

In [None]:
plt.pcolormesh(X, Y, Z)
plt.contour(X, Y, Z)
plt.gca().set_aspect('equal')
plt.plot(*res1.x, 'o')
plt.plot(*res2.x, 'x')

## and more...
- sparse arrays
- signal processing
- special functions
- spatial data structures