## Basic Data Science Libraries

This notebook contains a brief introduction into the basic data science libraries `numpy`, `pandas`, and `matplotlib`. These libraries help you to efficiently store and look at your data. It is especially important that you know how to access individual elements/rows/columns from a matrix or table with various indexing techniques. Therefore, make sure to play around here a bit until you feel comfortable with these methods.

### numpy
The `numpy` library is used for mathematical operations and scientific computations. If you need more advanced operations, also check out the `scipy` library, which is closely related to numpy. If you have worked with MATLAB before, many things here will seem very familiar, just always remember that in Python you start indexing at 0, not 1.

The most important data structure in numpy is the so called `array`, which is used to represent a vector or matrix.

In [None]:
# import the library with its standard abbreviation
import numpy as np

In [None]:
# create a vector with 3 dimensions from a list
a = np.array([1., -2., 0.])
# look at the vector
a

In [None]:
# create a 2x3 matrix from nested lists
M = np.array([[1., 2., 3.], [4., 5., 6.]])
M

In [None]:
# multiply all elements in the matrix by 3
3*M

In [None]:
# multiply the matrix M with the vector a
np.dot(M, a)

In [None]:
# multiply the matrix M with its transpose
np.dot(M, M.T)

In [None]:
# make sure the dimensions always line up, otherwise you'll get an error like this
np.dot(M, M)

In [None]:
# check the shape of a matrix or vector
M.shape

In [None]:
# create a 3 dimensional identity matrix
np.eye(3)

In [None]:
# create a 3x2 matrix with zeros
np.zeros((3, 2))

In [None]:
# np.random provides different options to create random data
# create a 4x4 matrix with random, normally distributed values
# you might want to set a random seed first to get reproducible results:
# --> execute the cell a few times to see you always get a different matrix
# --> then uncomment the line below and excecute it again a few times
# np.random.seed(13)
R = np.random.randn(4, 4)
R

In [None]:
# indexing of matrices works similar to indexing lists
# remember: indexing starts at 0 and the last element is exclusive
# this gives you the first 2 rows with all columns
R[:2, :]  # for rows, you can also ommit the last :, i.e., write R[:2]

In [None]:
# all rows starting at the 3rd row with all columns
R[2:, :]

In [None]:
# column 2 and 4
R[:, [1, 3]]

In [None]:
# column 3 - notice the shape of the returned array, 
# i.e., it's a proper column vector (shape: (4, 1))
R[:, [2]]

In [None]:
# column 3 but as a flattened array (shape: (4,))
R[:, 2]

In [None]:
# create a binary mask that indicates which values in R are smaller than 0
M = (R < 0)
M

In [None]:
# set all entries in R that are smaller than 0 to -99
R[M] = -99.
R

### Pandas

The `pandas` library takes our basic data manipulation to the next level. It is a lifesaver if you need to read in nasty excel files and helps you with all your basic data science tasks, e.g., if you want to get a quick overview of your data or create some simple plots. If you have used the R programming language before, some concepts here might be familiar to you.

The most important data structure in pandas is the `DataFrame`, which is simply a table with all your data in it. 
We'll always assume that our data is structured such that each row corresponds to one data point (or observation), while each column represents a different attribute/variable that was measured for the data points (in the machine learning contexts, these different attributes are usually refered to as "features"). 

In [None]:
# import with standard abbreviation
import pandas as pd

#### Creating DataFrames

In [None]:
# create a dataframe from the matrix M
pd.DataFrame(M)

In [None]:
# create a more interesting dataframe from a dictionary (keys are columns ("features"))
df = pd.DataFrame(
       {
          'sex': ['m', 'w', 'm', 'w'],
          'height': [1.80, 1.77, 1.89, 1.65],
          'weight': [65.3, 73.4, 80.0, 77.0],
          'subject_id': ['subject1', 'subject8', 'subject12', 'subject23']
       }
)
# look at the dataframe
df

In [None]:
# notice the additional column with 0-3 above; this is the index column
# for easier handeling of the data, we can explicitly set
# the subject_id column as our index
df = df.set_index('subject_id')
df

#### Basic manipulations & statistics

In [None]:
# select the column "sex" from the dataframe 
# (returns a pandas series, similar to a flat array in numpy)
df['sex']

In [None]:
# add a new column (similar as adding a new key-value pair to a dict)
# compute with other columns from the dataframe
df['BMI'] = df['weight'] / (df['height'] ** 2)
df

In [None]:
# get all column names
df.columns

In [None]:
# we can compute basic statistics on the dataframe
df['BMI'].mean()

In [None]:
# summary statistics of the dataframe
df.describe()

In [None]:
# by grouping based on one column...
g = df.groupby('sex')
# ...we can compute statistics for different groups
g.BMI.mean()

In [None]:
# on a group we can also compute different aggregations for different columns
# maximum for height, minimum for weight, mean for BMI
g.agg({'height': max, 'weight': min, 'BMI': np.mean}) 

#### Import & Export

In [None]:
# we can export our data as a .csv file (other formats are also supported)
df.to_csv('bmi_dataset.csv')

In [None]:
# we can also read in files and create a dataframe from them
df_imported = pd.read_csv('bmi_dataset.csv')
df_imported

In [None]:
# with additional options, we can already correctly set our index column
# other options also allow to e.g. skip some lines at the beginning of a file, etc.
df_imported = pd.read_csv('bmi_dataset.csv', index_col='subject_id')
df_imported

#### Indexing

In [None]:
# get a view of the dataframe with a binary mask based on the column "sex"
df[df['sex'] == 'm']

In [None]:
# get only entries from the column "height" with the binary mask
df['height'][df['sex'] == 'm']

In [None]:
# filter the dataframe based on the height
df[df['height'] < 1.80]

In [None]:
# select a specific data point based on the index name
df.loc['subject12']

In [None]:
# select a specific data point based on the row number
df.iloc[0]

In [None]:
# select a specific entry in the dataframe using index name and column name
df.loc['subject8', 'BMI']

### Matplotlib

With the `matplotlib` library, it is possible to create highly customizable plots. It is also the basis for more advanced plotting libraries such as `seaborn`.

In [None]:
# import with standard abbreviation
import matplotlib.pyplot as plt

In [None]:
# get some data that we want to plot
x = np.arange(10)  # numpy array with numbers 0 - 9
y = x**2           # squared numbers
# create a very basic plot of x vs. y
plt.figure()   # new canvas
plt.plot(x, y) # simple line plot

In [None]:
# more advanced plot with axis labels etc.
plt.figure()
plt.plot(x, y, label="x^2")  # label is later used in the legend
plt.plot(x, x**3, "r", label="x^3") # "r" creates a red line
# axis labels, legend based on the specified labels, and title
plt.xlabel("X axis")
plt.ylabel("Y axis")
plt.legend(loc=0)  # loc=0 automatically determines the best location for the legend
plt.title("Title of the figure")

In [None]:
# create randomly distributed data
x = np.random.randn(100)
y = np.random.randn(100)
# create a scatter plot of x vs y
plt.figure()
# by passing for c an array of the same length as x and y,
# each dot can have its individual color
plt.scatter(x, y, c=x)
plt.xlabel("X axis")
plt.ylabel("Y axis")
plt.colorbar()  # creates a colorbar (more appropriate than a legend here)
plt.title("Title of the figure")