## Basic Data Science Libraries

This notebook contains a brief introduction into the basic data science libraries `numpy`, `pandas`, and `matplotlib`. These libraries help you to efficiently load, store, manipulate, and look at your data. It is especially important that you know how to access individual elements/rows/columns from a matrix or table with various indexing techniques. Therefore, make sure to play around here a bit until you feel comfortable with these methods.


**Exercise:** After you're done with the tutorial, find a dataset of your own, load it with pandas, and examine the different variables.

### numpy
The `numpy` library is used for mathematical operations and scientific computations. If you need more advanced operations, also check out the `scipy` library, which is closely related to numpy. If you have worked with MATLAB before, many things here will seem very familiar, just always remember that in Python you start indexing at 0, not 1.

The most important data structure in numpy is the so called `array`, which is used to represent a vector or matrix.

**Official `numpy` tutorial:** https://numpy.org/devdocs/user/quickstart.html

(If you're interested in more advanced scientific programming (e.g. optimization), you may also want to check out the official [scipy tutorial](https://docs.scipy.org/doc/scipy/reference/tutorial/index.html) for more information on the `scipy` library.)

In [None]:
# import the library with its standard abbreviation
import numpy as np

In [None]:
# create a vector with 3 dimensions from a list
a = np.array([1., -2., 0.])
# look at the vector
a

In [None]:
# create a 2x3 matrix from nested lists
M = np.array([[1., 2., 3.], [4., 5., 6.]])
M

In [None]:
# multiply all elements in the matrix by 3
3*M

In [None]:
# multiply the matrix M with the vector a
np.dot(M, a)

In [None]:
# multiply the matrix M with its transpose
np.dot(M, M.T)

In [None]:
# elementwise multiplication
M*M

In [None]:
# make sure the dimensions always line up, otherwise you'll get an error like this
np.dot(M, M)

In [None]:
# check the shape of a matrix or vector (e.g. to investigate errors like the one above)
M.shape

In [None]:
# create a 3 dimensional identity matrix
np.eye(3)

In [None]:
# create a 3x2 matrix with zeros
np.zeros((3, 2))

In [None]:
# np.random provides different options to create random data.
# here we create a 4x4 matrix with random, normally distributed values.
# you might want to set a random seed first to get reproducible results:
# --> execute the cell a few times to see you always get a different matrix
# --> then uncomment the line below and excecute it again a few times
# np.random.seed(13)
R = np.random.randn(4, 4)
R

In [None]:
# indexing of matrices works similar to indexing lists
# remember: indexing starts at 0 and the last element is exclusive
# this gives you the first 2 rows with all columns
R[:2, :]  # for rows, you can also ommit the last :, i.e., write R[:2]

In [None]:
# all rows starting at the 3rd row with all columns
R[2:, :]

In [None]:
# column 2 and 4
R[:, [1, 3]]

In [None]:
# column 3 - notice the shape of the returned array, 
# i.e., it's a proper column vector (shape: (4, 1))
R[:, [2]]

In [None]:
# column 3 but as a flattened array (shape: (4,))
R[:, 2]

In [None]:
# create a binary mask that indicates which values in R are smaller than 0
M = (R < 0)
M

In [None]:
# set all entries in R that are smaller than 0 to -99
R[M] = -99.
R

### Matplotlib

With the `matplotlib` library, it is possible to create highly customizable plots. It is also the basis for more advanced plotting libraries such as `seaborn`.

[This set of cheat sheets](https://github.com/matplotlib/cheatsheets) may be helpful for creating the perfect plots.

In [None]:
# import with standard abbreviation
import matplotlib.pyplot as plt

In [None]:
# get some data that we want to plot
x = np.arange(10)  # numpy array with numbers 0 - 9
y = x**2           # squared numbers
# create a very basic plot of x vs. y
plt.figure()   # new canvas
plt.plot(x, y) # simple line plot

In [None]:
# more advanced plot with axis labels etc.
plt.figure()
plt.plot(x, y, label="x^2")  # 'label' is later used in the legend
plt.plot(x, x**3, "r", label="x^3") # "r" creates a red line
# axis labels, legend based on the specified labels, and title
plt.xlabel("X axis")
plt.ylabel("Y axis")
plt.legend(loc=0)  # loc=0 automatically determines the best location for the legend
plt.title("Title of the figure")

In [None]:
# create randomly distributed data
x = np.random.randn(100)
y = np.random.randn(100)
# create a scatter plot of x vs y
plt.figure()
# by passing for c (=color) an array of the same length as x and y,
# each dot can have its individual color
plt.scatter(x, y, c=x)
plt.xlabel("X axis")
plt.ylabel("Y axis")
plt.colorbar()  # creates a colorbar (more appropriate than a legend here)
plt.title("Title of the figure")

### Pandas

The `pandas` library takes our basic data manipulation to the next level. It is a lifesaver if you need to read in nasty excel files and helps you with all your basic data science tasks, e.g., if you want to get a quick overview of your data or create some simple plots. If you have used the R programming language before, some concepts here might be familiar to you.

The most important data structure in pandas is the `DataFrame`, which is simply a table with all your data in it. 
We'll always assume that our data is structured such that each row corresponds to one data point (or observation), while each column represents a different attribute/variable that was measured for the data points (in machine learning contexts, these different attributes are usually refered to as "features"). 

**Official `pandas` tutorial:** https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html

In [None]:
# import with standard abbreviation
import pandas as pd

#### Creating DataFrames

In [None]:
# create a dataframe from a numpy array M with 10 rows and 3 columns
M = np.random.random((10, 3))
df = pd.DataFrame(M)
df

In [None]:
# convert dataframe back to a numpy array
df.to_numpy()

In [None]:
# create a more interesting dataframe from a dictionary (keys are columns (="features"))
df = pd.DataFrame(
       {
          'sex': ['m', 'w', 'm', 'w'],
          'height': [1.80, 1.77, 1.89, 1.65],
          'weight': [65.3, 73.4, 80.0, 77.0],
          'subject_id': ['subject1', 'subject8', 'subject12', 'subject23']
       }
)
# look at the dataframe
df

In [None]:
# notice the additional column on the left with 0-3 above; this is the index column.
# for easier handeling of the data, we can explicitly set the subject_id column as our index
df = df.set_index('subject_id')
df

#### Basic manipulations & statistics

In [None]:
# select the column "sex" from the dataframe 
# (returns a pandas Series, similar to a flat array in numpy)
df['sex']

In [None]:
# by selecting a list of columns, the DataFrame structure is preserved
df[['sex']]

In [None]:
# add a new column (similar as adding a new key-value pair to a dict)
# compute with other columns from the dataframe
df['BMI'] = df['weight'] / (df['height'] ** 2)
df

In [None]:
# get all column names
df.columns

In [None]:
# we can compute basic statistics on the dataframe
df['BMI'].mean()

In [None]:
# summary statistics for the whole dataframe
df.describe()

In [None]:
# we can also compute different aggregations for different columns; just pass
# any function that is then called on the specified column, e.g. np.mean(df["BMI"])
# maximum for height, minimum for weight, mean for BMI
df.agg({'height': max, 'weight': min, 'BMI': np.mean}) 

In [None]:
# by grouping based on one column...
g = df.groupby('sex')
# ...we can compute statistics for different groups
g["BMI"].mean()

In [None]:
# aggregations work here too
g.agg({'height': max, 'weight': min, 'BMI': np.mean}) 

#### Import & Export

Have a look at the [pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) for more info on possible file formats as well as additional options for saving and loading data.

In [None]:
# we can export our data as a .csv file (other formats are also supported)
# (--> have a look at the folder to see the file that was created!)
df.to_csv('bmi_dataset.csv')

In [None]:
# we can also read in files and create a dataframe from them
df_imported = pd.read_csv('bmi_dataset.csv')
df_imported

In [None]:
# note how our index column was treated like just a regular column.
# with additional options, we can already correctly set our index column while loading.
# other options also allow to e.g. skip some lines at the beginning of a file, etc.
df_imported = pd.read_csv('bmi_dataset.csv', index_col='subject_id')
df_imported

In [None]:
# pandas infers the data type of the columns when reading in the data 
df_imported.dtypes

#### Indexing

In [None]:
# get a view of the dataframe with a binary mask based on the column "sex"
df[df['sex'] == 'm']

In [None]:
# get only entries from the column "height" with the binary mask
df['height'][df['sex'] == 'm']

In [None]:
# filter the dataframe based on the height
df[df['height'] < 1.80]

In [None]:
# select a specific data point based on the index name
df.loc['subject12']

In [None]:
# select a specific data point based on the row number
df.iloc[0]

In [None]:
# iloc works similarly to numpy indexing
df.iloc[:2, [0, 2]]

In [None]:
# select a specific entry in the dataframe using index name and column name
df.loc['subject8', 'BMI']

In [None]:
# also works with lists of names
df.loc[['subject8', 'subject12'], ['BMI', 'sex']]

#### Dealing with missing values (NaNs)
NaN ("Not a Number") or missing values in your data are no fun and can lead to errors when using this data down the road. Therefore, you should remove these entries (or fill them with sensible defaults, though this can result in other problems).

In [None]:
# read in a dataset
df = pd.read_csv("data/test_na.csv")
# there are some NaNs!
df

In [None]:
# check how many NaNs you have per column
# (--> try also without the .sum() or .sum(axis=1))
df.isnull().sum(axis=0)

In [None]:
# you could fill the missing values with some defaults, e.g. 0
df.fillna(0)

In [None]:
# or with mean values - use with EXTREME caution!!!
df.fillna(df.mean())

In [None]:
# usually it's safest if we just remove the data points that have NaNs anywhere
# instead of adding garbage values to the data.
# but be aware that NaNs might not be random, e.g., in surveys rich people might more
# often decline to answer questions about their wealth than middle class people
# so removing data points this way can create a systematic bias in the dataset.
# what ever you do, just be sure to note your decision somewhere
# and communicate it when you present your results!
df = df.dropna(axis=0, how="any")  # axis=0: drop rows, not columns; how='any': drop if there is a NaN in one field
# rows 1 and 3 are missing now
df

In [None]:
# since some rows were removed, be careful when indexing the df!
# row 0 is still present
print("row 0:\n", df.iloc[0])
# row 0 also has index 0
print("\n\nindex 0:\n", df.loc[0])
# row 1 now corresponds to index 2
print("\n\nrow 1:\n", df.iloc[1])
print("\n\nindex 2:\n", df.loc[2])
# index 1 is missing - this will give a KeyError!
print("\n\nindex 1:\n", df.loc[1])

In [None]:
# to avoid errors, it's often a good idea to reset the index
# inplace: change df directly instead of returning a new object
# drop: don't keep old index
df.reset_index(inplace=True, drop=True)
# df.loc[1] would work again now
df

#### Examining and transforming features
In machine learning you usually want your variables to be normally distributed. When you get a new dataset, always plot all the variables and if they aren't (approximately) normally/uniformly distributed, consider applying a transformation, e.g., take the logarithm of a variable with a long tail of extreme values.

In [None]:
# plot the distribution of values for each column
df.hist(bins=20)

In [None]:
# plot x2 again to examine these few large values in a bit more detail
plt.figure()
# plot x2 against some small random jitter so the dots don't overlap too much
plt.scatter(df["x2"], 0.1*np.random.randn(len(df["x2"])))
plt.xlabel("x2")
plt.ylabel("random jitter")

In [None]:
# try a transformation of x2 to get the values more normally distributed, i.e.,
# without these few very extreme values
x2_new = np.log(df["x2"] + 1)
plt.figure()
# not perfect, but much better!
plt.scatter(x2_new, 0.1*np.random.randn(len(x2_new)))
plt.xlabel("x2")
plt.ylabel("random jitter")

#### Dealing with time series data
Time series data is often stored in specialized databases, which sometimes export the data in a so-called "long format", where the variables are no longer stored in individual columns (like in the regular "wide format" that we were dealing with so far). To later use this data with our machine learning methods, we therefore first need to transform it into the wide format (data points/observations in rows, variables/features in columns).

The dataset is originally from [here](http://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+) but was modified for our purposes. It contains timestamped data of different sensors (temp, co2, etc.) in a room and how many people are currently present in the room.

In [None]:
# load a time series dataset in long format
df = pd.read_csv("data/test_timeseries.csv")
# variable contains the column name and value the corresponding value at this timestamp
df

In [None]:
# check what kind of variables there are --> these will be our columns later
df["variable"].unique()

In [None]:
# check what data types were detected
df.dtypes

In [None]:
# apparently, timestamp was detected as a string ("object").
# we want to transform it into a proper datetime format to enable
# a bunch of data operations on time stamps (incl. correct plotting)
df["timestamp"] = pd.to_datetime(df["timestamp"], format="%Y-%m-%d %H:%M:%S")
df.dtypes

In [None]:
# now we want to convert the long format to our regular wide format
df_wide = df.pivot(index='timestamp', columns='variable')
# by default, the columns are now a MultiIndex; with droplevel we can transform them 
# into a regular Index so we can work with the dataframe as we're used to
df_wide.columns = df_wide.columns.droplevel(0)
df_wide

In [None]:
# lets plot the time series (this only works so nicely because 
# the timestamp is the index column and was correctly formatted!)
df_wide.plot(subplots=True);  # the ";" at the end prevents unnecessary additional output

In [None]:
# often, time series data is sampled with a very high frequency (every few seconds)
# but the sensor measurements might not change that much over time
# you can use "resample" to change the sampling frequency and reduce the size of your dataset.
# by calling .mean() on the resampler, we tell it to compute the new values as the mean
# of the values in one interval. you could also use the function to upsample your data.
print("original shape:", df_wide.shape)
df_wide = df_wide.resample("10min").mean()
print("downsampled shape:", df_wide.shape)
# looks a bit smoother than above
df_wide.plot(subplots=True);

## Exercise: Work with your own data

Find a dataset, load it with pandas, make sure it is in the correct (wide) format (data points in rows, variables/features in columns; columns should have the correct column names), and examine the different variables (e.g. by plotting them) to see if they are (approximately) normally/uniformly distributed.

You can also do this in a new notebook, just remember to import all needed libraries.

If you don't have a dataset of your own that you want to explore, maybe you'll find one here that interests you: https://www.kaggle.com/datasets?fileType=csv&sizeEnd=100%2CMB

If you're not working on your own computer, but e.g. online with Binder you can go back to the folder view with the list of notebooks and in the top right there is a button "Upload" where you can e.g. upload a .csv file from your computer to work with online.

Don't be afraid to google if you don't know how to do something (tip: search in English to get more results!). You can also get the documentation of any function by writing a `?` after its name to see, for example, what parameters you can pass to it.

In [None]:
# execute this to see the documentation of the read_csv function
pd.read_csv?