# Applied Data Visaulization Lecture 2: Jupyter Notebooks, Python Data Wrangling


Welcome to your first Jupyter notebook! This will be our main working environment for this class.


Hi there, welcome to our first coding lecture. We will be using Python, a popular data science programming language in the lectures, homeworks, and projects. As part of Homework 0, you should have already setup Python, IPython and Jupyter notebooks, so it's time to get started!

## Executing your first program

Now it's time to run python! Open a terminal and execute:

```bash
$ python
```

You'll see something like that:

```bash
% python
Python 3.11.3 (main, Apr 19 2023, 18:49:55) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
```
What does this tell us? It shows us the version number of Python. At the end of this statement, you see the three `>>>` signs: these indicate a prompt, but it looks different from your console prompt (`$` or `%`), to indicate you're in an interactive python environment.

There are two fundamental ways you can run Python: in interactive mode (what we're doing here) or in batch mode. 

In interactive mode you write your program interactively, i.e., each new statement is interpreted as you type it. 

If you just run ```python``` without any other parameter, you enter the **interactive** mode. Let's write our very first program:

```python
>>> print("Hello World!")
Hello World!
```

"Hello World!" is by tradition the very first program that you should write in a new programming language! And see, when we instructed python to print the text "Hello World!", it did just that.

So, let's briefly take that statement apart: it contains a call to the `print()` function and passes a parameter to that print function, the string `Hello World!`. 

The string is enclosed in quotation marks `"`. Given that information, python knows you want to print the string, and it does exactly that.

Print is a built-in function of python, there are several built-in functions, which you can check out [here](https://docs.python.org/3/library/functions.html).

Let's define our first variable. Type

```python
>>> my_string_var = "Are you still spinning?"
```

This statement is executed without any feedback. What you're doing here, intuitively, is that first, you create a new variable of type string with the name ```my_string_var```, and then you assign a value to it, "Are you still spinning?".


Note that the equals sign `=` is NOT a test for equality here, but an ASSIGNMENT. This can be confusing for beginning programmers. 

Equality is tested with a double equals sign `==` in many programming languages including python. Arguably, a different assignment operator such as `:=` would be a better idea and is implemented in other programming languages.

We now can print this variable:

```python
>>> print(my_string_var)
Are you still spinning?
```

which produces the result we expected!

There are many different types of variables. For example, Python has three different data types for numbers (integers, floats – that represent real numbers, and complex). Check out the details about the built-in data types [here](https://docs.python.org/3/library/stdtypes.html).

Let's start with a simple example:

```python
>>> a = 3
>>> b = 2.5
>>> c = a + b
>>> c
5.5
```

Here we've created three variables (`a, b, c`) and executed an operation, the addition of `a` and `b` using the `+` operator, which we have then assigned to `c`. Finally, we've printed `c`.

The data types of `a` and `b`, however, are subtly different. `a` is an integer and `b` is a float. We can check the data type of any variable using the `type()` function:

```python
>>> a = 3
>>> type(a)
<class 'int'>
>>> b = 2.5
<class 'float'>
>>> c = "hello"
>>> type(c)
<class 'str'>
```

## Writing code in a file

Let's look at another way to run python: by executing a file. Exit the interactive environment, by calling the exit function:

```python
exit()
```

Now, open up your favorite text editor and create a new file called "first_steps.py". We've created such a file for you [here](first_steps.py).

You can also copy and paste this code into the file:

```python
def double_number(a):
    # btw, here is a comment! Use the # symbol to add comments or temporarily remove code
    # shorthand operator for 'a = a * 2'
    a *= 2
    return a

print(double_number(3))
print(double_number(14.22))
```

Here we've also defined or first function! We'll go into details about functions at a later time. For now, just notice that the indentation matters!

Now, run

```bash
$ python first_steps.py
6
28.44
```

What happened here? Python executed the commands in the file, and then terminated. You saw the result, but it was not interactive anymore, but executed in a couple of milliseconds.

Larger and bigger programs are commonly written using source code files and are not run interactively. They will read data from files, wait for user input, etc.


# Jupyter Notebooks Basics

First, let's get familiar with Jupyter Notebooks. 

Notebooks are made up of "cells" that can contain text or code. Notebooks also show you output of the code right below a code cell. These words are written in a text cell using a simple formatting dialect called [markdown](https://jupyterlab.readthedocs.io/en/latest/user/notebook.html). 

Double click on this cell text or press enter while the cell is selected to see how it is formatted and change it. We can make words *italic* or **bold** or add [links](https://www.dataviscourse.net/2023-applied/) or include pictures:

![Data science cat](datasciencecat.jpg)

The content of the notebook, as you edit in your browser, is written an `.ipynb` file. 

If you want to read up on Notebooks in details check out the [excellent documentation](https://jupyterlab.readthedocs.io/en/latest/index.html).

## Notebook Editors

Jupyter Notebooks can be edited and run in a number of ways. The easiest way is Jupyter Labs, a development environment explicitly made for notebooks. You can run labs using this command: 

```bash
$ jupyter-lab
```

You then edit your notebook in the browser. 

The alternative we're using because of co-pilot support is to install **Visual Studio code**, the notebook extensions, and then edit there (see Homework 0).


An alternative to native Jupyter Notebooks are cloud-hosted [google colab](https://colab.research.google.com/) notebooks; basically a cloud-based notebook solution that can read notebook files. Because it's cloud based, things like reading from files is different. For this class you should have the notbeooks installed locally, but google colab is a good alternative if you need to collaborate.



## Writing Code

The most interesting aspect of notebooks, however, is that we can write code in the cells. You can use [many different programming languages](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels) in Jupyter notebooks, but we'll stick to Python. So, let's try it out:

In [8]:
print ("Hello World!")
a = 3
# the return value of the last line of a cell is the output
a 

Hello World!


3

Notice that the output here is directly written into the notebook. 

You can change something in a cell and re-run it using the "run cell" button, or use the `CTRL+ENTER` shortcut.

Another cool thing about cells is that they preserve the state of what happened before. Let's initialize a couple of variables in the next cell: 

In [9]:
age = 2
gender = "female"
name = "Datascience Cat"
smart = True

These variables are now available to all cell below or above **if you executed the cell**. In practice, you should never rely on a variable from a lower cell in an earlier cell. **This behavior is different from if you were to execute the cells as a python file".**

If you make a change to a cell, you need to execute it again. You can also batch-executed multiple cells from the toolbar. 

Let's do something with the variables we just defined:

In [10]:
print (name + ", age: " + str(age) + ", " + 
       gender + ", is smart: " + str(smart))

Datascience Cat, age: 2, female, is smart: True


In the previous cell, we've [concatenated a couple of strings](https://docs.python.org/3.5/tutorial/introduction.html#strings) to produce one longer string using the `+` operator. Also, we had to call the `str()` function to get [string representations of these variables](https://docs.python.org/3.5/library/stdtypes.html#str).

An alternative way to do this is not to concatenate the string but to pass each variable in as a separate argument to the print function: 

In [12]:
print (name, 
       "age: " + str(age),
       gender, 
       "is smart: " + str(smart),
       sep=", ")

Datascience Cat, age: 2, female, is smart: True


The last argument, `sep=", "` tells the print function to use a comma and a space between each argument.

## Modes

Notebooks have two modes, a **command mode** and **edit mode**. These modes are shown differently in different notebook environments; in jupytper labs: 
 * **green** means edit mode, 
 * **blue** means command mode. 
 
Many operations depend on your mode. For code cells, you can switch into edit mode with "Enter", and get out of it with "Escape".




## Shortucts

While you can always use the tool-bar above, you'll be much more efficient if you use a couple of shortcuts. The most important ones are:

**`Ctrl+Enter`** runs the current cell.  
**`Shift+Enter`** runs the current cell and jumps to the next cell.   
**`Alt+Enter`** runs the cell and adds a new one below it.

In command mode:

**`h`** shows a help menu with all these commands (doesn't work in visual studio).  
**`a`** adds a cell before the current cell.  
**`b`** adds a cell after the current cell.  
**`dd`** deletes a cell.  
**`m`** as in **m**arkdown, switches a cell to markdown mode.  
**`y`** as in p**y**thon switches a cell to code.  

## Kernels

When you [run code](http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Running%20Code.html), the code is actually executed in a **kernel**. You can do bad things to a kernel: you can make it stuck in an endless loop, crash it, corrupt it, etc. And you probably will do all of these things :). 

So sometimes you might have to interrupt your kernel or restart it. Use the "Kernel" menu to restart the kernel, re-run your notebook, etc.

Also, before submitting a homework or a project, make sure to `Restart` and `Run All`. This will create a clean run of your project, without any side effects that you might encounter during development. We want you to submit the homeworks **with output**, and by doing that you will make sure that we actually can also execute your code properly.



## Storing Output

Notebooks contain both, the input to a computation and the outputs. If you run a notebook, all the outputs generated by the code cells are also stored in the notebook. That way, you can look at notebooks also in non-interactive environments, like on [GitHub for this notebook](https://github.com/dataviscourse/2023-applied-vis-homeworks/blob/main/HW0/notebook-demo.ipynb). 

The Notebook itself is stored in a rather ugly format containing the text, code, and the output. This can sometimes be challenging when working with version control.

# Python Basics

## Functions

In math, functions transfrom an input to an output as defined by the property of the function. 

You probably remember functions defined like this

$f(x) = x^2 + 3$

In programming, functions can do exactly this, but are also used to execute “subroutines”, i.e., to execute pieces of code in various order and under various conditions. Functions in programming are very important for structuring and modularizing code. 

In computer science, functions are also called “procedures” and “methods” (there are subtle distinctions, but nothing we need to worry about at this time). 

The following Python function, for example, provides the output of the above defined function for every valid input: 

In [13]:
def f(x):
    result = x ** 2 + 3 
    return result

We can now run this function with multiple input values: 

In [14]:
print(f(2))
print(f(3))
f(5)

7
12


28

Let's take a look at this function. The first line
```python
def f(x):
```
defines the function of name `f` using the `def` keyword. The name we use (`f` here) is largely arbitrary, but following good software engineering practices it should be something meaningful. So instead of `f`, **`square_plus_three` would be a better function name in this case**.  

After the function name follows a list of parameters, in parantheses. In this case we define that the function takes only one parameter, `x`, but we could also define multiple parameters like this:
```python 
def f(x, y, z):
```

The parameters are then available as local variables within the function.

The second line does the actual computation and assigns it to a **local variable** called `result`. 

The third line uses the `return` keyword to return the result variable. Functions can have a return value that we can assign to a variable. For example, here we could write: 

```python
my_result = f(10)
``` 

Which would assign the return value of the function to the variable `my_result`.

Note that the lines of code that belong to a function are **intended by four spaces** (you can hit tab to intend, but it will be converted to four spaces). Python defines the scope of a function using intendation. Many other programming languages use curly brackets `{}` to do this. 

A function is ended by a new line.

For example, the same function wouldn't work like this:

In [7]:
def f(x):
    result = x ** 2 + 3
# Throws a NameError becauser result isn't defined outside the function
return result

SyntaxError: 'return' outside function (562016350.py, line 4)

Equally, we can't intend by too much:

In [None]:
def f(x):
    result = x ** 2 + 3
    # Throws an IndentationError
        return result

## Scope

Another critical concept when working with functions is to understand the scope of a variable. Scope defines under which circumstances a variable is accessible. For example, in the following code snippet we cannot access the variable defined inside a function:

In [15]:
def scope_test():
    function_scope = "only readable in here"
    # Within the function, we can use the variable we have defined
    print("Within function: " + function_scope)

# calling the function, which will print     
scope_test()

Within function: only readable in here


In [16]:
# If we try to use the function_scope variable outside of the function, we will find that it is not defined. 
# This will throw a NameError, because Python doesn't know about that variable here
print("Outside function: " + function_scope)

NameError: name 'function_scope' is not defined

You might wonder “Why is that? Wouldn't it make sense to have access to variables wherever I need access?”. The reason for scoping is that it's simply much easier to **build reliable software when we modularize code**. When we use a function, we shouldn't have to worry about its internals. 

Another practical reason is that this way we can **re-use variable names** that were used in other places. This is really important when we work with other peoples' code (e.g., libraries). If that weren't possible, we might get nasty side-effects just because the library uses a variable with the same name somewhere. 

You can, however, use variables defined in the larger scope in the sub-scope:

In [17]:
name = "Science Cat"

def print_name_with_dr():
    print("Dr.", name)
    
print_name_with_dr()

Dr. Science Cat


This is generally **not considered good practice** - functions should rely on their input parameters. Otherwise it can easily lead to side effects. This would be the better approach: 

In [18]:
# notice that we're re-using the parameter name
def print_name_with_dr(name):
    print("Dr.", name)
    
print_name_with_dr(name)

Dr. Science Cat


## More on Data Types and Operators

We've already covered the basic data types and operators. Now we'll recap and go into some more details. 

Also, make sure to check out the [complete documentation of standard types and operations](https://docs.python.org/3/library/stdtypes.html).

### Boolean

Boolean values represent truth values `True` and `False`. Booleans can be used as any other variable:

In [19]:
my_true_var = True
print (my_true_var)
my_false_var = False
print (my_false_var)

True
False


`True` and `False` are reserved keywords in their capitalized form. 

There are three operations defined on booleans: `and`, `or`, and `not`. 

| Operation | Result | 
|------|------|
| `x or y`	| if x is false, then y, else x  |
| `x and y`	| if x is false, then x, else y  |
| `not x`	    | if x is false, then True, else False  |


In [20]:
True or False

True

In [21]:
True and False

False

In [22]:
not True

False

In [23]:
not False

True

#### Comparisons

Comparisons are very important in programming: they let us decide on conditional flows, which we will discuss later. To compare two entities, Python provides eight comparison operators: 


| Operation	| Meaning
| - | - |
| <	| strictly less than
|<=	| less than or equal
|> |	strictly greater than
|>= |	greater than or equal
|==	 |equal
|!= |	not equal
|is	| object identity
|is  not |	negated object identity

These operators take two operands and return a boolean. We'll glance over the last two for now, but here are some examples of the others:

In [24]:
1 < 2 

True

In [25]:
1 <= 1

True

In [26]:
14 == 14

True

In [27]:
14 != 14 

False

In [28]:
"my text" == "my text"

True

In [29]:
"my text" == "my other text"

False

In [30]:
"a" > "b"

False

In [31]:
"a" < "b"

True

In [32]:
"aa" < "aba"

True

In [33]:
"aaa" < "aa"

False

We see that the operations work on numbers just as we would expect. 

Strings are also compared as we'd expect. The greater and less than operators use lexicographic ordering. 

### Numerical Data Types

Python supports three built in numerical data types, `int`, `float`, and `complex`. Since Python is dynamically typed, we don't have to define the data types explicitly!

The **int** data type is used to to represent integers $\mathbb{Z}$. Python is special in the way it handles integers as it allows arbitrarily large integers, while most other programming languages reserve a certain chunk of memory for integers, which can lead to a number "overflowing". This, for example, would not work properly in C or Java:

In [34]:
2 ** 200

1606938044258990275541962092341162602522202993782792835301376

However, we can still experience overflows in Python if we work with pandas, a library we will extensively use.

Integers can be **positive, zero, or negative**, as you would expect. 

The **float** datatype is used to represent real numbers $\mathbb{R}$. Floats, however, can not be precisely represented by a computer. Take the example of $1/3$. Representing $1/3$ accurately would require the computer to store an infinitely large number of $0.33333333333333333333....$ (if a computer used a decimal number system). 

Since computers use binary numbers, also seemingly simple numbers such as 0.1 cannot be accurately represented. Check out this example: 

In [35]:
.1 + .1 + .1 == .3

False

What computers do is that they store approximations using a limited chunck of memory to store the number. At the same time, Python rounds the output of numbers:


In [36]:
1 / 10

0.1

This number is in fact not 0.1 but is stored in the computer as: 

`0.1000000000000000055511151231257827021181583404541015625`

This representation, however, is rarely useful, hence the number is rounded. 

The lesson that you should remember is that **you CANNOT compare two float numbers with the `==` operator**. 

In [37]:
a = .1 + .1 + .1 
b = .3
a == b

False

Instead, you can do something like this: 

In [38]:
# Compare for equality up to a constant value
a < b + 0.00001 and a > b - 0.00001

True

This, of course, only compares up to the 5th digit behind the comma. 

A better way to do this is the [isclose](https://docs.python.org/3/library/math.html#math.isclose) function from the math package. 

In [39]:
# this is how we import a package
import math 
# here we call the isclose function that comes with the math package. 
math.isclose(a, b, rel_tol=0.00001)

True

Here we've also used our first package, the package `math`! 

Packages extend the basic functionality of python. We'll work a lot with packages in the future, details will follow.

**Type Annotations**

Python now supports [type annotations](https://docs.python.org/3/library/typing.html), but those are not enforced. They can be used by IDEs or linters to check your code. 

In [40]:
# a type annotation for string
greeting: str = "Hello World"
print(greeting)
# we can still override that
greeting = 3
print(greeting)

Hello World
3


In [41]:
# we can also hint at the return type of a function
def greet(name: str) -> str:
    return "Hello " + name

greeting = greet("Max")
print(greeting)

Hello Max


#### Numerical Operators

Here is a selection of operators and functions that work on numerical data types. 

| Operation | Result
| - | - |
|`x + y`	|sum of x and y	 	 
|`x - y`	|difference of x and y	 	 
|`x * y`	|product of x and y	 	 
|`x / y`	|quotient of x and y	 	 
|`x % y`	| remainder of x / y
|`-x`	| x negated	 	 
|`abs(x)` |	absolute value or magnitude of x	 
|`int(x)` |	x converted to integer	
|`float(x)` |	x converted to floating point	
|`pow(x, y)` |	x to the power y	
| `x ** y` | x to the power y

Most of these should be rather straight-forward.

You might not have heard of the "modulo operator" `%` which returns the remainder of a devision x / y. Here is an example:

In [42]:
7 % 2

1

Also, remember, that many operations have a shorthand assignment version, i.e., instead of:

In [43]:
x = 2
y = 3
x = x+y
x

5

you can also write: 

In [44]:
x = 2
y = 3
x += y
x

5

This works also for other operators: 

In [45]:
x = 2
y = 3
x -= y
x

-1

In [46]:
x = 2
y = 3
x /= y
x

0.6666666666666666

In [47]:
x = 2
y = 3
x **= y
x

8

### Exercise 1:

**Task 1.1:** Try how capitalization affects string comparison, e.g., compare "datascience" to "Datascience".

**Task 1.2:** Try to compare floats using the `==` operator defined as expressions of integers, e.g., whether 1/3 is equal to 2/6. Does that work?

**Task 1.3:** Write an expression that compares the "floor" value of a float to an integer, e.g., compare the floor of 1/3 to 0. There are two ways to calculate a floor value: using `int()` and using `math.floor()`. Are they equal? What is the data type of the returned values?

In [None]:
# your solution

## 3. Conditions: if-elif-else statements

We've learned how to make comparisons between items and do boolean operations. The result of these operations was usually a boolean value. 

We can now make use of these boolean values to **steer the program flow using conditions**. 

We can do that using **if statements**. If conditions evaluate an expression for its boolean value and execute one branch of code if they are true, and, optionally, another branch if they are false:

In [48]:
def isOdd(x):
    # the statement within the brackets is evaluated for truth
    if (x % 2 == 1):
        # body, executed if true
        print(str(x), "is in fact an odd number")
    else:
        # executed if false
        print(str(x), "is an even number")

isOdd(144)
isOdd(13)

144 is an even number
13 is in fact an odd number


Notice the **"body" of the if statement is intended**, just as for functions.

Here's an example of a more complex boolean expression:

In [49]:
if ((True and False) or True):
    print(True)

True


In addition to the explicit boolean values that we can use to test for truth, most **programming languages define a range of things to be true or false**. 

By definition, **false is**:
 * 0 of any numeric type, 
 * empty sequences or lists, including empty strings,
 * `none` values, etc., 
are considered false. 

Everything else is considered true.

In [50]:
if (0):
    print("This should never happen")
else:
    print("0 is false")

undefined_var = None
if (not undefined_var):
    print("An undefined variable is false")
    
if (not []):
    print("An empty list is false")
    
if (not ""):
    print("An empty string is false")


0 is false
An undefined variable is false
An empty list is false
An empty string is false


You can also **chain conditions using the `elif` statement**, which is short for else if:

In [51]:
def smallest_factors(x):
    # notice the use of the negation and the use of 0 as false
    if(not x % 2):
        print("2 is a factor of " + str(x))  
    elif(not x % 3):     # only evaluated when if was false
        print("3 is a factor of " + str(x))
    else: # only evaluated when both if and elif were false
        print("Neither 2 nor 3 are factors of " + str(x))

smallest_factors(4)
smallest_factors(9)
smallest_factors(12)

2 is a factor of 4
3 is a factor of 9
2 is a factor of 12


Notice that the `elif` (or the `else`) branch is not evaluated if the `if` branch matches. A function that prints whether both, 2 and 3 is a factor could be written like this: 

In [52]:
def factors(x):
    # notice the use of the negation and the use of 0 as false
    if(not x % 2):
        print("2 is a factor of " + str(x))  
    if(not x % 3):     
        print("3 is a factor of " + str(x))
    if (x % 2) and (x % 3):
        print("Neither 2 nor 3 are factors of " + str(x))

factors(4)
factors(9)
factors(12)
factors(13)

2 is a factor of 4
3 is a factor of 9
2 is a factor of 12
3 is a factor of 12
Neither 2 nor 3 are factors of 13


## Lists

Up to know we've worked only with basic data types such as booleans, numbers and strings. Now we'll take a look at a compound data type: [lists](https://docs.python.org/3/tutorial/introduction.html#lists).

**A list is a collection of items.** Another word commonly used for a list in other programming languages is an **array** (though there are differences between lists and arrays in many languages). 

**Lists are created with square brackets `[]` and can be accessed via an index:**

In [53]:
beatles = ["0-Paul", "1-John", "2-George", "3-Ringo"]
# printing the whole array
print(beatles)
# printing the first element of that array, at index 0
print(beatles[0])
# third element, at index 2
print(beatles[2])
# access the last element
print(beatles[-1])
# access the one-but-last element
print(beatles[-2])

['0-Paul', '1-John', '2-George', '3-Ringo']
0-Paul
2-George
3-Ringo
2-George


If we try to address an index outside of the range of an array, we get an error: 

In [54]:
beatles[4]

IndexError: list index out of range

Sometimes, it makes sense to pre-initialize an array of a certain size:

In [55]:
[0] * 10

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

There is also a handy shortcut for quickly initializing lists. This uses the [range()](https://docs.python.org/3/library/functions.html#func-range) function, which we'll explore in more detail later.

We can also create **slices of an array with the slice operator `:`**

```python
a[start:end] # items start through end-1
a[start:]    # items start through the rest of the array
a[:end]      # items from the beginning through end-1
a[:]         # a copy of the whole array
```

There is also the step value, which can be used with any of the above:

```python
a[start:end:step] # start through not past end, by step
```

See [this post](http://stackoverflow.com/questions/509211/explain-pythons-slice-notation) for a good explanation on slicing.

In [56]:
# Get the slice from 0 (included) to 2 (excluded)
beatles[:2] # this can also be written as [0:2]

['0-Paul', '1-John']

In [57]:
# Sclice from index 2 (3rd element) to end
beatles[2:]

['2-George', '3-Ringo']

In [58]:
# A copy of the array 
beatles[:]

['0-Paul', '1-John', '2-George', '3-Ringo']

The slice operations return a new array, the original array is untouched: 

In [59]:
beatles

['0-Paul', '1-John', '2-George', '3-Ringo']

Slicing outside of a defined range returns an empty list:

In [60]:
beatles[4:9]

[]

Strings can be treated similar to arrays with respect to indexing and slicing:

In [61]:
paul = "Paul McCartney"
paul[0:4]

'Paul'

Lists (in contrast to strings) are mutable. 

That means **we can change the elements that are contained in a list**: 

In [62]:
beatles[1] = "JohnYoko"
beatles

['0-Paul', 'JohnYoko', '2-George', '3-Ringo']

This does not work with strings, strings are immutable: 

In [63]:
# This will return an error
paul[1] = "o"

TypeError: 'str' object does not support item assignment

Arrays can also be **extended with the `append()` function**:

In [64]:
beatles.append("4-George Martin")
beatles

['0-Paul', 'JohnYoko', '2-George', '3-Ringo', '4-George Martin']

Lists can be **concatenated**: 

In [65]:
zeppelin = ["Jimmy", "Robert", "John", "John"]
supergroup = beatles + zeppelin
supergroup

['0-Paul',
 'JohnYoko',
 '2-George',
 '3-Ringo',
 '4-George Martin',
 'Jimmy',
 'Robert',
 'John',
 'John']

We can **check the length** of a list using the built-in [`len()`](https://docs.python.org/3.3/library/functions.html#len) function:

In [66]:
len(zeppelin)

4

Lists can also be **nested**: 

In [67]:
bands = [beatles, zeppelin]
bands

[['0-Paul', 'JohnYoko', '2-George', '3-Ringo', '4-George Martin'],
 ['Jimmy', 'Robert', 'John', 'John']]

In fact, lists can be of hybrid data types, which, however, is something that you typically don't want to and shouldn't do:

In [68]:
bad_bands = bands + [1, 0.3, 17, "This is bad"]
# this list contains lists, integers, floats and strings
bad_bands

[['0-Paul', 'JohnYoko', '2-George', '3-Ringo', '4-George Martin'],
 ['Jimmy', 'Robert', 'John', 'John'],
 1,
 0.3,
 17,
 'This is bad']

### Exercise: Lists

* Create a list for the Rolling Stones: Mick, Keith, Charlie, Ronnie.
* Create a slice of that list that contains only members of the original lineup (Mick, Keith, Charlie). 
* Add the stones lists to the the bands list.

## NumPy Lists

We will frequently use [NumPy](https://numpy.org/) arrays instead of regular Python lists. NumPy provides data structures and operations that are suitable, especially with regards to performance, for scientific computing.

Here's a simple NumPy array. We can do slicing etc just like on regular arrays. 

In [70]:
import numpy as np

my_array = np.array([1,2,3,4,5])

print(my_array[1])
print(my_array[-1])
print(my_array[1:3])
# Notice that the data type is different from a regular python data type
print(my_array.dtype.name)

2
5
[2 3]
int64


NumPy arrays have a lot of additional functionality, which we will introduce as needed. One significant difference to regular arrays is that an array has to be of a single data type.

In [71]:
# trying to set up a hybrid array; that would be OK in python lists. 
my_hybrid_array = np.array([1,"test",3,4,5])

# We see that the elements are up-casted to the most inclusive data type, a string.
print(my_hybrid_array)
print(type(my_hybrid_array[-1]))
print(my_hybrid_array.dtype.name)

['1' 'test' '3' '4' '5']
<class 'numpy.str_'>
str672


## Loops

So far we have learned about two ways to control the flow of a program: functions and if-statements. Now we'll look at another important control structure: loops. 

Like an if statement, a loop has a condition, and as long as that condition is true, it will continue to re-execute its body. 

There are two types of loops. **For** loops and **while** loops.

### While loops

While loops use the `while` keyword, a condition, and the loop body:

In [None]:
a = 1

# print numbers 0-100
while (a <= 100):
    # end is a parameter of print that defines how the string to be printed ends. 
    # By default, a newline \n is appended, which we overwrite here
    print(a, end=", ") 
    a += 1

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 

What happens here? The `while` keyword indicates that this is a loop, which is followed by the **terminating condition of `a <= 100`**. As long as that condition is true, the loop's body will be called again and again and again ...

Once the terminating condition evaluates to false, the code in the loop body will be skipped and the flow of execution continues below the loop. 

You might rightly guess that it's easy to write loops that don't terminate. Here is one example:
```python 
while True:
    print "Stuck"
```

This program is stuck in the loop forever (or until you terminate it by interrupting your kernel, your computer goes off, etc.) It is hence important to take care that loops actually reach a terminating condition, and it's not always as obvious as in the previous example that this is not the case. 

But we could also **use the `break` statement to terminate a loop**:

In [None]:
a = 1
while (True):
    print(a, end=", ") 
    a += 1
    if (a > 100):
        break

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 

Here, we've moved the check of the condition into an if statement, and break when the if statement is executed. 

Similar to the `break` statement, there is also a `continue` statement, that ends evaluation of the loop body and goes back to the start of the loop in the next cycle:

In [None]:
a = 0
while (a < 100):
    a +=1;
    # throw brackets around all numbers divisible by 3
    if (not a % 3):
        print(f"[{a}]", end=", ")
        continue # the next line isn't executed because the flow goes back to the beginning of the loop
    print(a, end=", ")

1, 2, [3], 4, 5, [6], 7, 8, [9], 10, 11, [12], 13, 14, [15], 16, 17, [18], 19, 20, [21], 22, 23, [24], 25, 26, [27], 28, 29, [30], 31, 32, [33], 34, 35, [36], 37, 38, [39], 40, 41, [42], 43, 44, [45], 46, 47, [48], 49, 50, [51], 52, 53, [54], 55, 56, [57], 58, 59, [60], 61, 62, [63], 64, 65, [66], 67, 68, [69], 70, 71, [72], 73, 74, [75], 76, 77, [78], 79, 80, [81], 82, 83, [84], 85, 86, [87], 88, 89, [90], 91, 92, [93], 94, 95, [96], 97, 98, [99], 100, 

Here we've also introduced a [Format String](https://docs.python.org/3/library/string.html?highlight=f%20string#format-string-syntax), which is convenient for creating strings that are a mix of variables and other text. 

A format string begins with an `f` before the quotes. Variables are specified in curly brackets `{}`. 

In [None]:
name = "Alex"
print(f"My name is {name}")

My name is Alex


### For Loops

Python uses for loops mainly to iterate over items of a sequence. Most other programming languages use for loops to iterate over a fixed number of indices.

It uses the following syntax:
```python
for variable in sequence:
    #body
```

The variable is then accessible within the body of the loop.

Here is an example:

In [72]:
for member in zeppelin: 
    print(member)

Jimmy
Robert
John
John


Of course, that works with arbitrary **slices of lists**: 

In [73]:
for member in zeppelin[:2]:
    print(member)

Jimmy
Robert


We can iterate over **nested lists** with nested for loops: 

In [74]:
for band in bands:
    print("Band Members: ")
    print("-------------")
    for member in band:
        print(member)
    print()

Band Members: 
-------------
0-Paul
JohnYoko
2-George
3-Ringo
4-George Martin

Band Members: 
-------------
Jimmy
Robert
John
John



When you want to iterate over a sequence of numbers, use the [`range()`](https://docs.python.org/3/library/stdtypes.html#range) function. Range generates a sequence of numbers:

In [75]:
# we create a new list with the output of the range function
list(range(5))

[0, 1, 2, 3, 4]

In [76]:
# start at 0, stop at index 10, two steps
list(range(0, 10, 2))

[0, 2, 4, 6, 8]

Using this range function, we can now iterate of a sequence of numbers:

In [77]:
for i in range(10): 
    print (i)

0
1
2
3
4
5
6
7
8
9


The range function also takes other parameters, specifically a "start", "stop" and a "step-size" parameter.

In [78]:
for i in range (0, -20, -3):
    print(i)

0
-3
-6
-9
-12
-15
-18


## 7. Revisiting Lists: List Comprehension

Now that we know about loops, we can also take a look at [list comprehension](https://docs.python.org/3.5/tutorial/datastructures.html#list-comprehensions). List comprehension can be used to initialize and transform arrays. 



In [79]:
# _ is customary for a variable name if you don't need it
[0 for _ in range(10)]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [80]:
["John" for _ in range(10)]

['John',
 'John',
 'John',
 'John',
 'John',
 'John',
 'John',
 'John',
 'John',
 'John']

In [81]:
# we can also make  use of values we iterate over
[i for i in range(10)]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

We can, for example, use functions in place of a variable. Here we initialize an array of random numbers in the unit interval:

In [82]:
import random
rands = [random.random() for _ in range(10)]
rands

[0.02136476025223777,
 0.7238070266677037,
 0.6911373993041022,
 0.9731891593154719,
 0.9549287808403482,
 0.5624902883428432,
 0.15344844614298914,
 0.25287820547022455,
 0.7932854346404551,
 0.2493445035665418]

You can also use list comprehension to create a list based on another list:

In [83]:
[x*10 for x in rands]

[0.2136476025223777,
 7.2380702666770365,
 6.911373993041022,
 9.73189159315472,
 9.54928780840348,
 5.624902883428432,
 1.5344844614298914,
 2.5287820547022455,
 7.932854346404551,
 2.493445035665418]

## Exercise: List Comprehension

Write a list comprehension function that creates an array with the length of the words in the following sentence:

In [84]:
sentence = "the quick brown fox jumps over the lazy dog"
word_list = sentence.split()
word_list

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

In [85]:
# your solution