# Contact
Kathrin Kefer, University of Applied Sciences Upper Austria
kathrin.kefer@fh-hagenberg.at
2021/22

# Goals for the ML course (in general)
* quickstart to practical aspects of (the basics of):
* loading and handling data in Python
* visualizing data and features in Python
* feature co-correlation
* regression and classification API of sklearn
* quantfying and displaying error of regression and classification
* overfitting
* data partitioning


# TOOL RECOMMENDATIONS
## Python distribution: 
* consider using anaconda: https://www.anaconda.com/what-is-anaconda/
* use Python3!

## IDE: 
* for scientific computing consider using Jupyther Notebooks or Spyder (shipped with anaconda).
* other IDEs are good too, like PyCharm and similar ones.
** hint 1 in Spyder: F9 runs the currently selected code, or, if no code is selected, it executes the currently selected line. ** hint 2 in Spyder: a "cell" is a code block, like in Matlab. Cells range from the beginning of a cell start comment, which is "#%%", to the beginning of the next cell. In Spyder you can run the cell the cursor currently is in with Shift+Enter or Ctrl+Enter.
** hint 3 in Spyder: Ctrl+I in editor with cursor on a function name opens the help for that function

## Libraries:
* library for data wranging (loading, structuring, etc): pandas (https://pandas.pydata.org/)
* library for machine learning: Scikit-learn/SKlearn (https://scikit-learn.org/stable/)
* library for deep learning: keras (standardized "interface" to TensorFlow, etc)
* libraries for numeric/math/scientific computations (pandas, SKlearn build on them): numpy, scipy
* library for plotting: matplotlib


# FURTHER MATERIAL FOR QUICKSTART
* super quick intro to python: https://learnxinyminutes.com/docs/python3/
* super quick intro to statistical computing with python: https://learnxinyminutes.com/docs/pythonstatcomp/
* intro to pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

# INTRODUCTION TO PYTHON: SYNTAX

In [None]:
# comments start with #
# cells start with a #%% comment

""" multi-line comments are in 3 " - they are used as documentation
"""

# basic operations in interactive console
1
1;
1+1

# interactive console autmotatically prints the return value. a script does not. you can print with print(...)
print(1)

# float division
5 / 3
# int division
5 // 3 # int, rounded
5.0 // 3.0 # float, but rounded as well

# code style: whitespace before and after +, -, *, /, %, etc. Also before and
# after = except for when it is used for a named function parameter 0 /

# exponentation
7 ** 2

# modulo
7 % 2

# parentheses
(1 + 3) * 2

# boolean
True
False
not True
not False

# compares the content
True == (not False)
# compares the object: is only true if pointers point to the same object
# see e.g. https://dbader.org/blog/difference-between-is-and-equals-in-python
True is (not False)

# variables
a = 1
b = 'foo'
c = 'bar'
# d = a + b # does not work due to int + str
# d = b + a # does not work either
d = b + c
e = str(a) + b
# useful to print debug messages
print('my debug message, x=' + str(a))

# variables cannot start with a number
# 1a = 1 # nope...
a1 = 1

# we can selectivly delete objects too, e.g. if they take too much memory
del a
del b
del c
del d
del e

# if you want to explicitly free memory you can call the garbage collector
import gc
gc.collect()

# INTRODUCTION TO PYTHON: VECTORS, LISTS

In [None]:
# this is the basics version. before you use them too much:
#   use numpy.array, numpy.arange, etc, instead! they are better for our cases...
#   same indexing syntaxs works there too
#   https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html
#   https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.arange.html
my_list1 = [1, 2, 3, 4, 5]

my_list1[0]

# indexing with ranges is inclusive for left bracket and exclusive for right bracket
# means: you can index up do the length of the element without errors
my_list1[0:len(my_list1)]
my_list1[0:4]
my_list1[:4]
my_list1[2:]
my_list1[:-1]
my_list1[:-2]
my_list1[1:-2]

my_list2 = [1, 'b']

my_list3 = list(range(10))
my_list3 = list(range(1, 10))
my_list3 = list(range(10, -1, -1))

# dictionary = (hash)map in other languages
my_dict = {'a': 1, 'b': 'foo', 3: 'bar'}
my_dict['a']

# INTRODUCTION TO NUMPY: the #1 python lib for numerical computation
   this is what you should use to handle your numeric lists

In [None]:
import numpy as np

a = np.array([1,2,3])
a = np.arange(10)
a = np.arange(10, -1, -1)
a[2:]
a[:-1]
a[:]

# Check out the available functions in the numpy API:
#   https://docs.scipy.org/doc/numpy/reference/

# We'll later also use scipy, which does scientific computations on
#   numbers, such as signal filtering in the scipy signal module
#   for scipy API see https://docs.scipy.org/doc/scipy/reference/

# INTRODUCTION TO PANDAS
   more details: https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

In [None]:
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt

# ensure your working directory!
# those are built-in magic commands in ipython, which is spyder's interactive console
#   you can do cool stuff with that!
# https://ipython.readthedocs.io/en/stable/interactive/magics.html
#pwd
#cd('the_directory_i_wanna_work_in')
# in Spyder you can right click on the tab of the currently open file in the top and set the working directory to be the one of the script

# load data from csv file
# loading and writing to/from csv files works with:
#   regular or compressed local file, remotely, etc
df_iris = pd.read_csv('iris.csv')
df_iris = pd.read_csv('iris.csv.bz2')
df_iris = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/iris.data')

# you can also load data in other formats, e.g. json, even... excel, or from e.g. databases. check the documentation for that!

# print data to console
print(df_iris) # explicit print
df_iris.head(5)
df_iris.tail(5)

# pandas internally uses numpy arrays


# STATISTICS & INFO ABOUT DATA - a few examples
df_iris.dtypes
df_iris.describe() # statistics

# measures of "average" (average != mean)
df_iris.mean()
df_iris.mean(axis=1) # not meaningful on the iris data!
df_iris.median() # don't know what the median is? google it!

# don't know what quantiles are? google them!
df_iris.quantile(0.1)
# Q1 = 25%
df_iris.quantile(0.25)
# median = Q2 = 50%
df_iris.quantile(0.5)
# Q3 = 75%
df_iris.quantile(0.75)
df_iris.quantile(0.9)

# measures of spread: standard deviation (sd or std), mad, innerquartile range

df_iris.std() # don't know what the std is? google it!
df_iris.mad() # don't know what mad is? google it!

# innerquartile range = Q3 - Q1
df_iris.quantile(0.75) - df_iris.quantile(0.25)

# statistics about the discrete feature = the class
df_iris['Name'].value_counts()

# COLUMNS AND INDEXES
# redefine columns?
df_iris.columns
df_iris.columns = ['feature1', 'feature2', 'feature3', 'feature4', 'target']
df_iris.columns
# redefine indexes?
df_iris.index
list(df_iris.index)
df_iris.index = ['row' + str(i) for i in range(df_iris.shape[0])]
list(df_iris.index)
# we don't want modified columns/rows in this example so we load the original again
df_iris = pd.read_csv('iris.csv')

# SUBSETTING PANDAS DF
#   the optimized access methods are: .at, .iat, .loc, and .iloc
# first sample = first row
df_iris.iloc[0,:]
# first 10 samples
df_iris.iloc[0:10, :]
# sample nr 100-115
df_iris.iloc[100:115,:]
# only the last sample
df_iris.iloc[-1,:]
# all but the last sample
df_iris.iloc[:-1,:]
# first feature = first column
df_iris.iloc[:,0]
# first 2 features
df_iris.iloc[:,0:2]
# 3 features by position
df_iris.iloc[:,[0,2,3]]
# sample 100-105, feature nr 2-5
df_iris.iloc[100:105, 2:5]
# the last feature
df_iris.iloc[:,-1]
# all but the last feature
df_iris.iloc[:,:-1]
# select features by name
df_iris['SepalLength']
df_iris.SepalLength
df_iris[['SepalLength', 'SepalWidth']]
# certain samples of certain features selected by name
df_iris[['SepalLength', 'SepalWidth']][0:20]
# boolean subset: subset e.g. by criteria
df_iris[df_iris['SepalLength'] > 7]

# we can also transpose the dataframe
df_iris.transpose().columns
df_iris.transpose().index
df_iris.transpose().dtypes # careful about datatypes: we have a string column in original data: with transposing this causes all data to become objects
df_iris.iloc[:, :4].transpose().dtypes # if we exclude the string columns in the data before transposing: all floats instead

# CONCAT DFs
# concat multiple DFs vertically
tmp3 = pd.concat([df_iris, df_iris])
# concat multiple DFs horizontally
tmp4 = pd.concat([df_iris, df_iris], axis=1)
# take a look at tmp3 and tmp4 (e.g. in your variable explorer)

# PLOTTING = DATA VISUALIZATION
* pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html
* matplotlib: https://matplotlib.org/tutorials/index.html#introductory
* seaborn: https://seaborn.pydata.org/index.html

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="ticks")

# line plots
df_iris.plot()
df_iris.plot(linestyle=':', linewidth=2) # color='red', ...

# scatterplot matrix, pairplot
pd.plotting.scatter_matrix(df_iris)
pd.plotting.scatter_matrix(df_iris, alpha=0.2)
pd.plotting.scatter_matrix(df_iris, alpha=0.2, diagonal='kde')
# scatterplot matrix from seaborn
sns.pairplot(df_iris, hue="Name")

# with selected features
df_iris.plot.scatter(x=0, y=1, c='red')
# "." as a marker can speed up plotting significantly
# you can also use column names for x and y
df_iris.plot.scatter(x="SepalLength", y="SepalWidth", marker='.', c='blue')

# boxplot, histograms, density plots, etc...
df_iris.plot.box()
df_iris.boxplot(by='Name')
df_iris.hist()
df_iris.plot.hist(alpha=0.5)
df_iris.plot.hist(alpha=0.5, stacked=True)
df_iris.plot.density()

# MATPLOTLIB PLOTTING (pandas plotting is based on matplotlib)

fig = plt.figure()
plt.plot(df_iris['SepalLength'])
plt.plot(df_iris['SepalWidth'], color='red', linestyle=':')
plt.title('title!')
plt.xlabel('x axis!')
plt.ylabel('y axis!')
plt.legend(['SepalLength', 'SepalWidth']) #adding legend manually
# if we have the labels specified in the plot like that: 
# plt.plot(data, label=['col1','col2','col3'])
# we could also use that command to plot the legend plt.legend()

# save plot info file
# svg and pdf are vector graphics, png is notdf_iris.boxplot()
plt.savefig('iris_boxplot.png')
plt.savefig('iris_boxplot.svg')
plt.savefig('iris_boxplot.pdf')
# in the interactive console you need to execute the savefig in the same call (e.g. running all code with F9 at once)
#   --> line by line execution in the interactive console causes the plots to be empty
df_iris.plot.box()
plt.savefig('iris_boxplot.png')
plt.savefig('iris_boxplot.svg')
plt.savefig('iris_boxplot.pdf')

# now take a quick look at the pandas plot possibilities in
#   https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html


# VECTORIZED OPERATIONS
   for operations on data, you usually do NOT need any for loops, etc!
   e.g. https://www.oreilly.com/library/view/python-for-data/9781449323592/ch04.html

In [None]:
import math

my_array = np.arange(0, 10, 2)
print(my_array)
my_array = 3.0 ** np.arange(-5, 5)
print(my_array)
plt.plot(my_array)
plt.yscale('log')

# define function
def my_func(p1, p2):
    return p1 ** 2 + p2
# apply in list comprehension: https://www.pythonforbeginners.com/basics/list-comprehensions-in-python
print([my_func(p1, p1) for p1 in my_array])
print([my_func(p1, p2) for p1, p2 in zip(my_array, my_array + 1)])

# we can apply functions to a whole dataframe at once
(df_iris * 2).head()

# both rows and columns are possible as axis here:
#   https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
df_iris.apply(func=np.cumsum).head()
df_iris.iloc[:,:-1].apply(func=np.cumsum, axis=1).head()

# we can apply functions to each value in a dataframe
#   https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html
df_iris.iloc[:, :4].applymap(func=math.log) #
df_iris.iloc[:, :4].applymap(func=lambda x: x**2)

# filter example (we abuse the columns of iris as time series signal now)
# we will see more filtering later in the course
from scipy.signal import savgol_filter
plt.plot(df_iris.iloc[:, 0])
plt.plot(savgol_filter(x=df_iris.iloc[:, 0], window_length=15, polyorder=2))
plt.legend(['original', 'filtered'])
# how to vectorize this for all columns?
# original data
df_iris.plot()
# filtered data
df_iris.iloc[:, :4].apply(func=lambda row: savgol_filter(x=row, window_length=15, polyorder=2), axis=0).plot()
# be aware that all those functions, including apply, etc, return a dataframe again (of which we call .plot() above)