Course: scientific and empirical methods  
Lecturer: Miroslav Despotovic

# 1. Introduction

## The course
### The goal of this course is to teach you data analysis with Python with use cases from the field of real estate and energy management.

### The concept of the course is structured as follows:


- **Introduction**: presentation of syntax and programming methodologies with examples.
- **Data analyses**: Treatment of specific statistical problems to consolidate the learning content.
- **Exercises**: Individual and group exercises
- **Grading**: Group project

## Why Python?
Python is one of the most popular and widely used high-level programming languages today.

The Python language is relatively easy to learn, very flexible and is supported by the extensive standard library.
Python offers the possibility for object-oriented programming and writing complex software systems.  
### The Python programming language is widely used especially in the field of **data analysis and machine learning**.

## Code editing

Similar to a natural language, a programming language has a certain vocabulary (**keywords**) and a grammar (**syntax**), according to whose rules the program code is to be written. The code in a programming language is the collection of syntactically correct instructions. 

The code can be written with the help of a code editor software, e.g. PyCharm, Spyder, Visual Studio. or via web browser using Jupyter Notebook. **Within the scope of this course, we will use Jupyter Notebook for coding!**<br> 
Here are web links for some code editor software as well as for very good Python knowledge library provided by W3Schools: <br>
https://www.jetbrains.com/pycharm/ <br> 
https://www.spyder-ide.org/ <br> 
https://visualstudio.microsoft.com/vs/features/python/ <br>
https://jupyter.org/ <br>
https://www.w3schools.com/python/default.asp


## Working with Jupyter Notebook

The **Jupyter Notebook** is the **standard tool** for editing and executing a programming code - especially in the area of **data analysis**.

Jupyter Notebook is a free, web-based notebook that can execute texts and codes. The software is a universal tool for over 40 programming languagese. It is primarily used as a web-based extension of IPython for running Python code.

A notebook consists of individual cells that can display different types of content, such as source codes, texts in HTML or Markdown, images or mathematical formulas (in Latex).


**The individual steps for installing Python and Python modules, including Jupyter Notebook, can be found on the course website on the Moodle!**

### Creating a notebook
+ To create a new notebook, go to "File" in the menu and select "New Notebook".
+ To save and name a notebook, go to "File" in the menu and select "Save as".

### Creating cells in a notebook
#### Each cell in a notebook is called a "chunk".
+ To insert a chunk, go to "Insert" in the menu.
+ To delete, copy or paste chunk(s), click on "Edit" in the menu.

### Exercise:
1. Create a new notebook.
2. Insert 2 new chunks.
3. Copy 2 chunks and paste them.
4. Delete 2 empty chunks.
5. Name and save the notebook.

### Executing chunks

+ To make a chunk with the text, write text and **select "Markdown" from the dropbox on the menu**. 
Then execute chunk by clicking " Run" in the menu.
<br>
+ To make a chunk with the code, write code and **select "Code" from the dropbox on the menu**. 
Then execute chunk by clicking " Run" in the menu.

### Exercise:
1. Write some text in an empty chunk
2. Execute chunk as Markdown
3. Write **print("FH Kufstein")**  in an empty chunk
4. Execute chunk as code

### Rules for writting text in notebooks

It is possible to format text in a notebook in a similar way as in a common text editor. You can find a cheatsheet with  text formatting rules under: <br>
https://www.ibm.com/docs/en/watson-studio-local/1.2.3?topic=notebooks-markdown-jupyter-cheatsheet

#### Headings
#### To create a heading in the text, write some text and place hash followed with space before the text, e.g. # myText
1. You need to select the formatting for heading as heading or markdown from the menu dropbox.   
(Multiple hash marks before heading text reduce the size of the heading.)
2. After selecting the correct formatting, execute the chunk by clicking Run in the menu.

> Further rules for bullet points, numbering, bold and italics, etc., can be found in the above link with the cheatsheet. 

### Exercise:
Write some
1. text with numberings
2. text with bullet points
3. bold text 
4. italic text
5. raw text

### Importing images
To import an image go to menu: edit ==> insert image ==> select  a file


# 2. Python modules & Python syntax



### 2.1 Modules
Most functions in Python are packed in so-called modules. A module provides functions that serve a specific programatical purpose, such as the "math" module for calculating mathematical functions or "seaborn" module for drawig diagrams. Two types of modules exist in Python:
- **Global modules**: Python packages that are already installed system-wide (built-in module as the part of Python Standard Library like e.g. math or calendar modules).
- **Local modules**: Packages with functions that are only available in the included package and must be installed separately (e.g. numpay, matplotlib).



#### The Python Standard Library
describes the standard module library that is distributed with Python. It also describes some of the optional components that are commonly included in Python distributions.
https://docs.python.org/3/library/

#### Importing modules into Python session

Every time we start the Jupyter and open a notebook, we start a working session.
While the changes in the code can be saved as a notebook, the validity of all executed functions and imported modules during an active programming session is temporary.<br>
To import a local module during the session use the **import** function. 

In [None]:
import pandas

Every time we want to utilize a function in Python, we need to refer the function to its associated module by putting a prefix in front of the function name, e.g. in the code we write the function **DataFrame** from the Pandas module as **pandas.DataFrame**. <br>
We can also assign an abbreviated **alias** of the module name to a function. This makes it a little easier for us to code during the session. For this we have to define an alias when we import a module:

In [None]:
import pandas as pd

Now we can write for example the DataFrame function from the Pandas module as **pd.DataFrame**. <br>

It is also possible to import only one arbitrary function from a module. This prevents name collisions and makes coding a bit easier, e.g.:

In [None]:
from pandas import DataFrame

Now we can write the DataFrame function from the Pandas module as **DataFrame** in my notebook without any prefix like pandas or pd.

It is also possible to  import one arbitrary function from a module and asign alias to it:

In [None]:
from pandas import DataFrame as df

In addition, it is also possible to import a collection of functions from a specific module: 

In [None]:
import matplotlib.pyplot as plt

See https://matplotlib.org/stable/tutorials/introductory/pyplot.html

### Exercise:
Import
1. numpy module
2. numpy module with alias np
3. function mean from numpy module

### 2.2 Syntax

#### The Python Language Reference describes the syntax and “core semantics” of the Python programming language. 
https://docs.python.org/3/reference/index.html#reference-index

Every programming language has fixed syntax that allows only special combinations of selected symbols and keywords. The keywords are the "vocabularies" of the language with a fixed meaning, which cannot be used for other purposes (e.g. for naming variables). The syntax of a programming language covers all possibilities for the definition of different data structures, control structures (control of the program flow) and instructions.

#### Python Keywords
Python 3.x interpreter has 33 keywords defined in it. Since they have a predefined function in the code, they **cannot be used as identifier** for variable names, function names, or any other identifiers. With the function **help** we can list all available keywords:

In [None]:
help("keywords")

#### Python identifiers

Apart from keywords, a Python program can have variables, functions, classes, modules, etc. An identifier is the **user-defined name** given to these programming elements. <br>
The main advantage of using identifiers is to: 
+ distinguish or identify one programming element from another,
+ abbreviate the code and make it more readable, 
+ make it easier to refer from one element to another, and 
+ adapt the code to your own coding design. <br>

There are some **important rules for writing identifiers**:

1. The Python identifier can be made with a combination of lowercase or uppercase letters, digits or an underscore.

These are the valid characters.

+ Lowercase letters (a to z)
+ Uppercase letters (A to Z)
+ Digits (0 to 9)
+ Underscore (_) <br>

Examples:
- num1
- FLAG
- get_user_name
- myTable
- _1234
- df 

2. An identifier cannot start with a digit. If we create an identifier that starts with a digit then we will get a syntax error.
3. We also cannot use special symbols in a identifiers name.

Symbols like ( !, @, #, $, %, . ) are **invalid**. <br>


The following chunk shows an example of a valid identifier with arbitrary name n for the variable of an integer type  containing the value 4. <br>
In combination with the function **for**, we can use the variable **n** to create a loop in which all natural numbers that are smaller than n are displayed one after another:

In [None]:
n = 4
for i in range(0, n):
    print(i)

#### Assignment of an identifier in a Python statement

Each code line terminated by a new line in a Python script is called a **statement**. Following example shows statements with identifiers for variables with numeric and string values:

In [None]:
msg="Willkommen an der FH Kufstein"
code=842744

We can print variable contents by simply using the identifier:

In [None]:
print(msg)
print(code)

Just as in other programming languages such as C/C++/C#, R, or Matlab a semicolon ; denotes the end of a statement in the current line. Thus, we also can write multiple statements in a single line.

In [None]:
from datetime import date; dt = date.today();
msg="Guten Tag"; code=dt; name="Markus";
print(code); print(msg); print(name);

We can also write text spread over more than one lines in a single statement by using the backslash (\) as a continuation character. This is called a **multiline statement**. Look at the following example:

In [None]:
msg="Hello. Welcome to \
data analysis \
course with Python."
aa=print(msg)
aa

#### Code description:
+ we have defined an identifier **msg** for a string variable with text
+ we have executed **print** function with the identifier **msg** as argument and assigned the output to identfier **aa**
+ Now we can use the printed content by simply using identifier **aa** anywhere in the code.

Another example for a multiline statement:

In [None]:
a = 1 + 2 + 3 + \
    4 + 5 + 6 + \
    7 + 8 + 9
a

### Exercise:
Write 3 identifiers with an arbitrary content and name them whatever you like. Combine lowercase letters (a to z), uppercase letters (A to Z), digits (0 to 9) and underscore (_) for naming.
1. an identifier with numerical value
2. an identifier with string (text)
3. an identifier for executed print function.

#### Indents in Python

In Python, indentations are used to define the code blocks. Commonly, indentations are written with 1,2 or more commonly **4 spaces**. The first line on the top, before the indented block, always ends with a colon (:).
The indentations should be marked by the consistent number of spaces for the entire project. <br>
Here we construct a custom function using **def** keyword with an indentation of 4 spaces:

In [None]:
def greeting(name):
    print("Servus ",name)

Let's execute custom function:

In [None]:
greeting("Markus")

Here we utilize a multiple indentation with an **incorrectly** typed indent and therefore get an **error message**:

In [None]:
newName="Stefan"
if not newName=="Stefan":  
    print("falscher User")
else:  
    greeting(newName)
     print("Du bist jetzt eingeloggt")

#### We correct the indent and the statement can now be executed without the error:

In [None]:
newName="Stefan"
if not newName=="Stefan":  
    print("falscher user")
else:  
    greeting(newName)
    print("Du bist jetzt eingeloggt")

### Dealing with errors in Python
Python can return various types of errors. Mostly of them are the syntax errors or errors related to a module that cannot be accessed. When an error occurs, the compiler outputs an** error message, error position and explanatory text**. <br>
Most common errors in Python can be tracked relatively easily - if you read them. For complex errors, consider this list of all the errors existing in Python and their interpretation, under following link: https://docs.python.org/3/library/exceptions.html

### Comments in Python

Comments in Python code are very important, as they can be used to write information about individual statements or blocks within the code. In a Python code, the symbol # indicates the start of a comment line. It is effective till the end of the line in the editor. If # is the first character of the line, then the entire line is a comment. A comment can be also applied at some arbitrary position of a line. In the following example the text before # sign is a valid Python code expression, while the text following # sign is comment.

In [None]:
# this is comment before code statement
print ("Hello World")
print ("Welcome to Python Tutorial") # this is also comment but after code statement.

**A triple quoted multi-line string is also treated as a comment if it is not a part of a function or a class.**

In [None]:
'''
comment1
comment2
this is a comment. Let us execute this chunk with comment 
and the function "print"
'''
print ("Hello World")

**Here, the triple quoted multi-line string is a part of a function:**

In [None]:
print('''This string literal
has more than one
line''')

### Exercise:
Write multiple comments with an arbitrary text. 
1. a comment before code statement
2. a comment in-line with some code statement
3. a multiline comment
4. a multiline comment within the function print

# 3. Python data types
Data type in Python represent a type of value that determines which programatical operations can be performed on that data type. **Numeric, non-numeric and Boolean (logical)** data are the most used data types. However, each programming language has its own data type classification largely reflecting its programming philosophy.

Python has the following standard or built-in data types:

### Numeric data type
A numeric data type is any representation of data which has a numeric value(s). Python identifies three numeric data types:

<b>Integer</b>: Positive or negative whole numbers.
<br>
<b>Float</b>: Any real number with a floating point representation in which a fractional component is denoted by a decimal symbol or scientific notation. <br>
<b>Complex</b>: Combination of real number and an imaginary number
<br>

In the following example, we construct a data table using the **DataFrame** function from the Pandas module. The table has 2 columns where one column contains the decimals (so called floats) and the other one contains the integers. We also assign deliberately one missing value and the column names ('Dezimalzahlen' and 'Ganzzahlen'). 
<a id='the_destination1'></a>

In [None]:
import pandas as pd
df = pd.DataFrame({'Dezimalzahlen': [1.1,3.2,4.3,6.4,4.5, "NaN"],
                   "Ganzzahlen": [1,4,2,9,6,6]})
df

**This is how we check data types in a data frame**

In [None]:
df.dtypes

### Boolean data type
A boolean data type is any representation of data which has one of two built-in values **True or False**. Notice that 'T' and 'F' are capital. If you writte true and false  using lower case, Python will throw an error. 

In [None]:
ans = 10 < 9

In [None]:
print(ans)
type(ans)

In [None]:
a = True
print(a)
type(a)

### String data type

A string data type is any representation of alphanumerical data (including special characters), put in a single, double or triple quote. <br>
For simplicity we can say string data type is text data type.

In [None]:
strvar = 'Yoga à Paris dans le parc magique de @holistikatulum '

In [None]:
print(strvar)
type(strvar)

If we use single or double quotes within a text then we define a string with triple quotes:

In [None]:
print('''This string has a single (') and a double (") quote.''')

If we define the same text with single or double quotes we will get an error: 

In [None]:
print("This string has a single (') and a double (") quote.")

## Data collection types

### Sequence type
A sequence is an ordered collection of same or different data types. Python has following built-in sequence data collection types:

* <b>List</b> : A list object is an ordered collection of one or more data items, not necessarily of the same type, put in **square** brackets. **We can modify** a list due to its mutable nature. As lists are mutable these are variable in size.
* <b>Tuple</b>: A tuple object is an ordered collection of one or more data items, not necessarily of the same type, put in **parentheses**. We **can't modify** a tuple due to its immutable nature. As tuples are immutable these are fixed in size.
* <b>Set</b>: A set contains an unordered collection of unique and immutable objects. Sets unlike lists or tuples can't have multiple occurrences of the same element and are enclosed in **curly brackets**.

In [None]:
lst = [34,22,'mao',44] # list
tup = (22,'long','hao') # tuple
st = {'foo', 65, 6+4j, 65, 'foo'} # set
print(tup, lst, st)

In [None]:
print('data types: ', type(lst), type(tup), type(st))

**Explanation: Ordered vs Unordered**

* ordered datatype – data is retained in the order you insert them
* unordered datatype – data is NOT retained in the order you insert them

### Dictionary

A dictionary object is an ordered collection of data in a "key: value" pair form. A collection of such pairs is enclosed in curly brackets.

In [None]:
dic = {"person 1":"Steve", "person 2":"Bill", "person 3":"Ram", "person 4": "Farha"}
dic

In [None]:
dic = {"person 1":["Steve","Maray"], "person 2":["Steve","Jobs"], "person 3":["Steve","Jones"], 
       "person 4":["Steve","Stevenson"]}
dic

In [None]:
type(dic)

Using dictionaries we can construct a table (dataframe). <br>
(see next example or 
[our previous example](#the_destination1))

### Data Frame
A data frame represents a table, similar to an Excel spreadsheet, which is constructed from dictionaries. Here we construct a data frame with arbitrary real estate data:

In [None]:
df = pd.DataFrame({'Immobilienart': ['Wohnung', 'Reihenhaus', 'Bauland'], 'Gemeinde': 
                   ['Graz','Leibnitz', 'Gratwein'], 'PLZ': [8020, 8430, 8112], 
                   'Preis': [210000, 330000, 85000.55]})
df

In [None]:
df.dtypes

### Datetime
In Python, date and time are not treated as a data types of their own, but using the built-in module named datetime or with the pandas function Timestamp, we can work with the date and time values. In the following example, an unformatted date is correctly formatted as a date using the Timestemp function:

In [None]:
dtm = pd.Timestamp('20180310')
dtm

In [None]:
type(dtm)

Example for formating date with the Python built-in module datetime:

In [None]:
from datetime import date
# initializing constructor and passing arguments in the 
# format year, month, date
my_date = date(1996, 12, 31)
# my_date = date.today();
print("Today is", my_date)

### Recapitulation: Use of brackets in the Python syntax

* "[ ]" Brackets are used for lists and arrays. They are also used to retrieve a single or multiple items from a list or from a table or to assign a new value to an existing list.
* "( )" Parentheses are used for calling functions, constructing functions and classes, calculations and for creating tuples. 
* "{ }" Braces are used to define a dictionary or a set.

### Indexing
Indexing means accessing data items.
To access data items in a list, you can use the square bracket notation.

Note
python lists are 0-indexed. So the first element is 0, second is 1, so on. So if the there are n elements in a list, the last element is n-1. 

### Referencing and modifying lists

In [None]:
# Let’s create a new list and assign it to a variable.
a = ["apples", "bananas", "oranges", "strawberries", "pears"]
a

Let's return only specific values from the list by referencing the position of a value:

In [None]:
ff=a[1]; ll=a[-2]; gg=a[2:4];
# call print function
print(ff)
print(ll)
print(gg)

Now let’s see what happens when we try to modify the first item of the list. Let’s change “apples” to “berries”.

In [None]:
a[0] = "berries"
a

## Arrays in Python
Arrays are used to store multiple values of the same type in one single variable. Array can be uni and multidimensional. Arrays have following properties:
+ they consist of elements belonging to the same data type
+ numerical arrays need to be declared.
+ arrays with nummerical data types can handle arithmetic operations
+ need to explicitly import a module for declaration
+ declared arrays are more compact in memory size

If we create a list by simply placing a series of quoted alphanumeric elements in square brackets we get an array of strings. 

In [None]:
y = ["one", "two", "three", "four"]
type(y)

Thus, an array of strings is a list, but,

in order to create an numerical array we need to declare it with specific function from either the array module (i.e., array.array()) or NumPy module (i.e., numpy.array()):

In [None]:
import numpy as np
x = np.array((3,6,9,12))
type(x)

Let's see what happens if want to apply math function on an numerical array and on a list with numerical values:

In [None]:
x/3.5

Let's create a list with numerical value:

In [None]:
z = [3, 6, 9, 12]
type(z)

If want to apply math function we got an error because z is a list:

In [None]:
z/3.5

### Types of numerical arrays in Python

In statistics, vertical or horizontal arrangement of numbers is called a **vector**. A **matrix** is the combination of horizontal and vertical vectors. That is, vector is one-dimensional  and matrix is two-dimensional arrangement of numbers. Let us examine these types of nummerical arrays.
### Vectors and Matrices
A one-dimensional horisontal vector:

In [None]:
v5a = np.arange(5) + 10         # A horisontal one-dimensional vector of length 5
v5a

Multiplication of two vectors of the same size produces a vector of the same size.

In [None]:
v5d=v5a*v5a
v5d

To check the size of a vector use:

In [None]:
v5d.shape 

We can transpose a one-dimensional horisontal vector to vertical vector. However a new vector will be two-dimensional:

In [None]:
v5b=v5a[:,None]
v5b

We can transpose vertical vector back to horisontal using function **T**. However a new horisontal vector will be also two-dimensional:

In [None]:
v5a=v5b.T
v5a

The function newaxis allows us to create a two-dimensional vector ad hoc:

In [None]:
v5a = np.array([10, 11, 12, 13, 14])[np.newaxis] # A horisontal two-dimensional vector of length 5
v5a

#### Multiplication of a horisontal and a vertical vector produces matrix.

In [None]:
v5a*v5b

**Let's transpose vertical vector back to horisontal:**

In [None]:
v5b=v5b.transpose()
v5b

**Let's now multiplicate both vectors:**

In [None]:
v5d=v5a*v5b
v5d

**As we can see, the product od two vectors of same dimension is a vector - not matrix**

#### Multiplication of two matrices
A matrix is two-dimensional rectangular array. We can apply math calculation with two matrices. Let's declare two simple matrices:

In [None]:
A = np.array([[1, 4, 5], 
             [-5, 8, 9],
             [3, 7, 23]])
A

In [None]:
B = np.array([[6, 4, 2], 
             [9, -7, 19],
             [11, 1, 16]])
B

Now we apply multiplication on two matrices:

In [None]:
R=A*B
R