# Important Python packages and how to use them

A big part of pythons 'power' is the vast availability of packages for nearly everything.
If you want to do anything with python, it is always worth a quick look at google whether there
is a package which is doing just that.

### important packages we will discuss:

- scipy: a lot of features for curve fitting, integration, solving differential equations, statistics, ...
- numpy: handling of numpy arrays for fast processing of large data arrays. Implements a lot of mathematical functions -> usage highly recommended for any sort of math task
- matplotlib: for plots of all sort
- pandas: import and export of data from nearly any form, performing statistics, grouping, applying functions on whole data sets, ... very mighty data tool!
- seaborn: easy and powerful data visualisation 

### important packages we will not discuss in this tutorial:

- os: comes with some useful methods to create directories, change the current working directory, join paths platform independently, check for the existence of files, ...
- sys: lets you use arguments passed to a python script via console, controls the stdout flow, check the operating system, ...
- uncertainties: easy error propagation 
- pickle: store binary data (like trained learner from machine learning) or whatever else
- scikit-learn: machine learning package


## 1. import
Packages can be imported with the import keyword.
For some packages there is a common alias like np for numpy. You can specify an alias as shown below. To import only a certain module from a package or only a function or class from a module, you can do so as well.

In [None]:
# packages are included this way:
import numpy

# you can give an 'alias' to imported packages
import numpy as np

# you can import certain methods or classes from a package
from scipy.optimize import curve_fit

## 2. numpy: numerical python

There is nearly always the need of processing arrays of numbers. If you use arrays, you should "always" use numpy arrays

In [None]:
# the numpy array is always recommended to use when it is about doing something with arrays
a = np.array([1,2,3,4])

print(a)

# the handling is quite similar to lists
print(a[-2:-1]) # the slicing means [-2:-2:]

# very useful: functions executed with numpy arrays act on each entry normally without code changes:
def f(x):
    return x**2

# a big exception are if statements, which wont work as expected on arrays!

print(f(a))

### numpy - some important features
These are some methods you will need quite sure.

In [None]:
# important features:

# array of length 4 filled with zeros
a = np.zeros(4)
print(a)

# array of length 3 filled with ones
b = np.ones(3)
print(b)


# matrix of zeros with the size 3x4, meaning there will be 3 rows and 4 columns
a = np.zeros((3,4))
print(a)

In [None]:
# linear distributed numbers between start and end (end included)
# the third argument gives the length
x = np.linspace(0, 10, 21)
print('x =', x)

# exponentially distributed numbers with the exponents from lower to higher
# this means from 1*10**1 = 10 up to 1*10**5 = 100000
y = np.logspace(1, 5, 11)
print('y =', y)

# numbers between low (included) and high (excluded) with stepwidth as third argument
k = np.arange(0, 16, 3)
print('k =', k)

# many functions for arrays are efficiently implemented in numpy
y = np.exp(x)
print('np.exp(x) =', y)

#similar: np.log, np.sqrt, np.sin, np.cos,.........


# in np.random there are several random generators, for example normal, uniform, poisson, ...
x = np.random.normal(0, 1, 10) #this means, the mean of the distribution is 0, the sigma is 1 and we generate 10 numbers

### masking
Very often you want to "filter" an array by a value e.g. all values larger than a certain threshold. This can be done very easily!

In [None]:
# very useful for slicing
print('gaussian distributed x =', x)
print('Is x>0.2?', x>0.2)
print('The numbers with x>0.2 are ',x[x>0.2])

In [None]:
# often overlooked but an extremely powerful tool:
x = np.arange(-5, 5)
print("Values: ", x)

# entry-wise if evaluation
# you can chain multiple evaluations with 
# AND: (first condition in brackets) & (second condition in brackets)
# OR: (first condition in brackets) | (second condition in brackets)
print("Values are below zeros at these indices: ", np.where(x < 0)[0])
# you can also apply a function to the parts of the array where the condition is True or False
print("Rectified values: ", np.where(x < 0, 0, 1))

## 3. matplotlib
matplotlib is a quite flexible library for easy plot creation. Usually pyplot from matplotlib is imported with the alias plt

In [None]:
import matplotlib.pyplot as plt

# fill a histogram with the created numbers and show the histogram
x = np.random.normal(0, 1, 10)
plt.hist(x)
plt.show()

### python histograms
python sees histograms not quite as the same thing like for example root. In python they are stored and handles as a collection of:
> (bin edges, bin contents, "drawn" bars for display)
<br>

You can get these from the function call of plt.hist 

In [None]:
things = plt.hist(x)
print(things)

### Example 3.1: Distributions and histograms

Look up another distribution available from numpy random. Generate some data points, fill them into a histogram and show it.

### plot with matplotlib
This is an example of a more "complex" plot

In [None]:
# lets create an array and calc some formula on it
x = np.linspace(-4, 4) # remember: create 50 equally dirstibuted numbers between -4 and 4
y = x**2

# get a figure and specify the size (in inches ...)
# first number corresponds to the width, second to the height
# dpi are "dots per inch" and refer to image quality
# 300 is good for printing (like a thesis), but you should prefer pdf files for this anyway
f = plt.figure(figsize=(6,4), dpi=300) 

# plot the values, add a 'style' and a label
plt.plot(x, y, 'k+', label = '$f(x) = x^2$') # k stands for black, the + will plot a + as the marker

# add labels to the axis
plt.xlabel('$x$')
plt.ylabel('$y$')

# add a legend to the best position
plt.legend(loc='best')

# activate the grid (True optional as argument)
plt.grid()

# make things look nice if something has gone wrong
plt.tight_layout()

# save to file
plt.savefig('plot.png')

# show
plt.show()

### Example 3.2: Enhance your histograms
Add a label, x- and y- axis labels and a legend to your histograms. Use a different color and save it.
### Example 3.3: Red dotted sine function
Create an array from 0 to 2pi (np.pi is available ...) and compute the sine function of it. Plot it to a canvas using a red dashed line.

### Example 3.4: Slicing plots
Plot positive values black, negative ones red.

### Example 3.5: Add gaussian noise
Add gaussian noise to the sine plot and use the same color for all data points again.

### errorbar
For many plots in physics you need to also show the uncertainties of a given value. This can be done with the errorbar method, parsing the uncertainties as yerr=values.

In [None]:
# another very important 'plotstyle': errorbar

# get another figure
plt.figure(figsize=(6,4), dpi=300)

# the hist function returns the bin contents, the bin edges and the histogram itself
x = np.random.normal(0, 1, 1000)
content, edges, hist = plt.hist(x, label='histogram')
#print(edges)
#print(edges[:-1])
#print(edges[1:])
#print((edges[:-1]+edges[1:])/2)

# add the errorbar plot
# the style is specified using the fmt argument
## we need the bin middles for the position of the errorbars
## we can get them from the bin edges building the mean values of a pair of two edges
## we can use our known form of indexing for that
### plt.errorbar(x-position, y-position, yerr=values, xerr=values, ...)
plt.errorbar((edges[:-1]+edges[1:])/2, y=content, yerr=np.sqrt(content), fmt='+', label='uncertainties')

# adding the legend and show
plt.legend(loc='best')
plt.show()
plt.clf()
# errorbar also takes two arrays as yerr input for assymetric errorbars

# other useful plot styles: semilogy, semilogx, loglog

## 4. scipy
scipy is useful especially for curve fitting and integration

In [None]:
# we need to import the function curve_fit which implements a least squares algorithm
# it is included in the scipy package within the optimize module
from scipy.optimize import curve_fit

# to fit a gaussian, we define a function taking three parameters in addition to x:
def gaussian(x, mu, sigma, I):
    return I*np.exp(-(x-mu)*(x-mu)/2/sigma/sigma)

# we need the bin middles for the fit
xvals = (edges[1:]+edges[:-1])/2
yvals = content

# the actual fit is pretty straightforward
# we define a fit function, the x and y values and curve_fit returns a vector of best parameters
# (in our case mu, sigma and I)
## the variable covariance contains the covariance matrix of the fit parameters and 
## can therefore provide us with fit uncertainties
params, covariance = curve_fit(gaussian, xvals, yvals)
print(params)

# more parameters:
# - sigma, absolute_sigma: std. dev. of values and whether they are relative or not
# - maxfev: number of function evaluations
# - p0=[...]: initial values for fit start
# - bounds=[[...], [...]] limits for the fit

### plot fit values
you can plot fit results into a figure using for example a np.linspace

In [None]:
# first we create a larger figure
plt.figure(figsize=(6,4), dpi=300)

# we create a histogram and get its parameters
content, edges, hist = plt.hist(x, label='histogram')
# add some errorbars in the bin middles
plt.errorbar((edges[:-1]+edges[1:])/2, y = content, yerr=np.sqrt(content), fmt='+', label='uncertainties')

# create a linspace for plotting the fit results
xvals = np.linspace(-3.5, 3.5, 200)

# simply plot the results with the * operator
plt.plot(xvals, gaussian(xvals, *params), 'k--', label='gaussian fit')
plt.legend(loc='best')
plt.show()

### Example 4.1: Noisy sine fit
Use your noisy sine function from before. Fit a self defined sine function (adding parameters for moving along the x and y axis) and fit a curve on it which is also displayed. 
Save the figure with a name of your choice.

### Example 4.2: Weighted fit
Redo the fit of the gaussian histogram above, but use weights this time.


## 5. pandas
easy data import, export and statistical data analysis. pandas is often imported with the alias pd

In [None]:
import pandas as pd

In [None]:
# pandas works with DataFrames, similar to dicts
data = pd.DataFrame()

data['x'] = np.linspace(0,20, 200)
data['y'] = data['x']**2
# columns can be specified before as well
# data = pd.DataFrame(columns=['x', 'y'])

plt.figure(figsize=(6,4), dpi=150)
plt.plot(data['x'], data['y'])
plt.show()

### pandas element access
There are several ways to access elements in a pandas DataFrame.
The most powerful way is the df.loc[row, column] method.This one can also write new values which column only access can not do! This would lead to a quite cryptic "write to slice" warning. You will notice it when you see it.

In [None]:
# single elements can be accessed best with the loc method
print(len(data.loc[:, 'x']))
print(data.loc[0, 'x'])

# data export is very easy as well:
data.to_csv('myData.csv')
data.to_excel('myData.xls')

### import data from xls or csv

In [None]:
# DataFrames can be imported and exported from many formats like xls, csv, txt, ...
data = pd.read_excel('detectors.xls')

# check data content
print(data.columns)

# loop over all available columns
for col in ['det_0', 'det_1', 'det_2', 'det_3']:
    
    plt.figure(figsize=(6,4))
    
    # plot measurement
    plt.plot(data['time'], data[col], label='measurement')
    
    # add legend and show
    plt.legend(loc='best')
    plt.show()

### Example 5.1: Gaussian fit
Fit a gaussian function into the data and show the data and the fit on the same plot.

### Another data set
Let's have a look at another data set.

In [None]:
# import another data set
frame = pd.read_csv('iris.csv')

# inspect columns and length of the data frame
print(frame.columns)
print(len(frame))    

In [None]:
# let's get an overview over the frame entries:
for col in frame.columns:
    plt.figure(figsize=(10,7))
    plt.hist(frame[col])
    plt.ylabel('entries')
    plt.xlabel(col)
    plt.show()
    
# How would you get mean value and std of each feature??

### Group data by one column
Especially when you want to group multiple entries of a data set, you need to be able to group them by a certain column.

Naive Approach: Using masks like with numpy

In [None]:
# naive approach:
group1 = frame[frame['species']=='setosa']
group2 = frame[frame['species']=='versicolor']
group3 = frame[frame['species']=='virginica']


# mean1 = group1.mean()....

### groupby and aggregate - might pandas tools
pandas can be used for statistical data analysis of a data set. For example the values can be grouped by values in a certain column, for example if a column represents a certain classification group, you can group by that one to see marginalised distributions in each group.

With the aggregate method you can gather some statistical information about data sets. Used together with group for example it can be very useful.

In [None]:
# you can group by a column using pandas
grouped_frame = frame.groupby(frame['species'])
print(grouped_frame.mean())

In [None]:
# sometimes you need more than just one statistic measure
frame.groupby(frame['species']).aggregate(['mean', 'std', 'sem'])

In [None]:
# a mighty and useful tool: groupby and aggregate
for group, values in frame.groupby('species'):
    
    # agg is useful to obtain several statistical measures of a data set
    print(values.agg(['mean', 'std']))
    for col in values.columns:
        plt.hist(values[col])
        plt.title('%s: %s'%(group, col))
        plt.show()

### Example 5.2: Inspect correlations

You can inspect correlations in data by plotting features against each other. To be able to differentiate the different species, you can plot the classes in different colors with help of the *color='...'* for matplotlib.pyplot.plot(...).

Create at least 3 plots with each one feature against another one. Use different colors to differentiate the classes.

## seaborn
seaborn is very useful for easy data visualisation and has a lot to offer, we will only have a look at a single feature, the rest is up to you to google ;)

In [None]:
# seaborn is often imported as sns alias
import seaborn as sns

# let's use the data from the previous example
sns.pairplot(frame, hue='species');

# Quite impressive, isn't it?

In [None]:
# you can also specify which columns you want to see (of course you can ...)
sns.pairplot(frame[['sepal_length', 'sepal_width', 'species']], hue='species');
