# Introductions

My name is Sergey Antopolskiy

I am a postdoc at Mathew Diamond's Tactile Perception and Learning Lab

# Organizational

### First

We have people with diverse backgrounds in programming, so it is inevitable that for some of you the pace of the course will be too fast and for others it will seem too slow. 

Here are my suggestions for these groups of people.

If you belong to the first group, you will need to invest more time between lectures in reviewing course materials and, more importantly, practicing. Set aside some time for that, especially try to free first week of the course from other major commitments. I am organizing *office hours*, time during which you can come and ask for my help (more on that below). Use it to catch up.

If you feel like I am going too slowly, look through the materials ahead and try to apply it to your data. Look at additional materials in the "where to go from here" section, I will put there links to some advanced topics as well. Also, look through the materials for the next lectures in advance and see what you might be interested in and what is very familiar to you. I will try to announce a day ahead what we will be covering on the next lesson.

### Second

I really want this course to be as interactive as possible. But it is always a challenge for the instructor. How I want to approach this:
- I will ask you questions during the lectures, like asking to raise your hands if you know something. This is not only to engage you, but also to understand how many of you know something, how much I need to go into details
- I might ask some of you to explain something, if you said that you know about it
- I might ask you to propose some explanation or ideas

The aim of this course is two-fold. On the one hand, I want you to learn certain concepts from computer science and programming, some of them you will be able to apply directly to your work. Some will help you to find information on that later when you need it. On the other hand, I want to teach you some practical skills, which you will be able to apply immediately to your work. In fact, I encourage you to do it as we go, and I will help you as much as I can. Then we can look together at what you did and how you did it, it will be helpful for everyone.

Course materials here: https://github.com/antopolskiy/sciprog 

Course Slack channel for discussions, question, announcements: https://sciprog.slack.com/ (you can sign up with @sissa.it email; if you don't have one, send an email to the course instructor).

# Office hours: come and talk to me!

I can answer your questions, help you with assignments or application of the concepts to your data.

- Tuesday from 16:00 to 18:00
- Thursday from 16:00 to 18:00

Find me in the office 324. You can come freely during these periods, but better write me an email at: <b>s.antopolsky@gmail.com</b>.

- Saturday from 10:00 to 12:00 (You will need to tell me in advance that you will come)

# Let's get started

## Why learn programming? 

## Why not use Excel or other statistical tools with easy user interface?

### - scripting and modularity
### - speed and memory efficiency
### - freedom to try (almost) anything

Freedom is particularly important in research and development, because you want to push the boudaries, you don't want to only walk the "main road".

There are other practical reasons. Programming is *lingua franca* of industry and technology. If you ever want to leave academia, it is very likely that your best chances at interesting and fulfulling job are in the tech industry. In that case you absolutely need to be fluent in programming. Even if you want to stay in acamedia, you don't want to be *forced* to stay in academia. Doing anything, even something you like, while feeling like you don't have a choice is a sure way to be frustrated and stressed, and eventually start hating that thing you liked. Besides, in our days the gap between academia and industry is ever shrinking. So learning programming at least 1 language is one of the best investments in your future you can make.

# Ok, so we learn to program... but what?

### C++? Fortran? Pascal? Basic? Assembler?! No.

# Language we need is:

### Interpreted, not compiled (at least at first)

### High-level, not low-level

### Convenient for working with data

### (Relatively) easy to learn

What is the different between interpreted and compiled languages, at least in a nutshell? Why compiled languages are always so fast, which interpreted are only fast if you implement correctly? Answers: dynamic typing vs static typing, memory management.

# Our main choices are: MATLAB, R or Python
(there are others promising options, like Julia, but we skip them for now)

# Which language to choose?

### It doesn't matter

### Things to consider:

- You need to use big chunks of someone else's code 

- You want to use a particular library or package (Brainstorm, Psychopy)

However, remember that you don't need to use the same language at all stages of your process. Switching between several languages may prove tedious at first, but it usually simpler than it seems. You can easily run your experiment using one language, and analyse it in another.

- You want to work locally on any computer

Matlab is a paid software, and it requires you to have a licence. Even if your university has a licence, it is likely available only when you're connected to the network. Therefore, it can be difficult to work remotely.

# Pros and cons
(disclosure: somewhat subjective)

<img src="https://www.mathworks.com/content/mathworks/www/en/company/newsletters/articles/the-mathworks-logo-is-an-eigenfunction-of-the-wave-equation/_jcr_content/mainParsys/image_2.img.gif/1469941373397.gif" align="left" alt="Drawing" style="width: 80px;"/>

# Matlab 

Pros:
- out-of-the-box solution
- easy to start
- decent documentation: not great, but good enough
- really fast for linear algebra
- you write in (almost) mathematical notation
    
        X = [1 2 3, 4 5 6]
        X_transpose = X'
        
- "matlab apps"
- "parfor" allows easy multithreading
- Simulink

<img src="https://www.mathworks.com/content/mathworks/www/en/company/newsletters/articles/the-mathworks-logo-is-an-eigenfunction-of-the-wave-equation/_jcr_content/mainParsys/image_2.img.gif/1469941373397.gif" align="left" alt="Drawing" style="width: 80px;"/>

# Matlab 

Cons:
- expensive, need to pay for add-ons (toolboxes)
- developed by selected experts
- proprietary code ("closed source")
- slow development 
    - example: `splitapply` was introduced only in 2015, while it was around for a decade in R
- plotting and exporting figures is not great
- data manipulation is clunky
- huge overhead costs (memory and CPU)
- not widely used outside academia
- small community

<img src="https://www.r-project.org/logo/Rlogo.svg" align="left" alt="Drawing" style="width: 80px;"/>

Pros:
- developed by statisticians
- great plotting capabilities
- great data manipulation capabilities
- vibrant community
- high demand on the data science market 
    - (this is changing though)

<img src="https://www.r-project.org/logo/Rlogo.svg" align="left" alt="Drawing" style="width: 80px;"/>

Cons:
- developed by statisticians
- sometimes obscure syntax
- slow
- lot's of packages, sometimes without a "standart"
- relatively steep learning curve
- not general language

<img src="https://www.python.org/static/opengraph-icon-200x200.png" align="left" alt="Drawing" style="width: 80px;"/>
# Python

Pros:
- really well designed language
    - gets both of 2 worlds: speed for computations from MATLAB (`numpy` package) and data manipulation from R (`pandas` package) 
- (relatively) easy to learn
- a lot (!) of great resourses
- great string manipulation (*de-facto* standart for NLP; R and MATLAB are not even close)
- object-oriented and introspective
- general purpose (learn for data analysis, but use for anything!)
- standardized "stack" of packages for data analysis
- Jupyter notebooks
- huge and welcoming community offline (PyData) and online (on Stackoverflow)

<img src="https://www.python.org/static/opengraph-icon-200x200.png" align="left" alt="Drawing" style="width: 80px;"/>
# Python

Cons:
- None!

Ok, I am joking: I am an enthusiast, not a fanatic :D

<img src="https://www.python.org/static/opengraph-icon-200x200.png" align="left" alt="Drawing" style="width: 80px;"/>
# Python

Cons:
- some things are still missing
- uneven documentation 
    - some things are amazingly well documented: scikit-learn, pandas, matplotlib
    - other - not so much: wavelets, psychopy, etc
- can be slow if you do things in a wrong way
- sometimes syntax may be too verbose

# Before we go into nitty-gritty details

- we will look (mostly) at Python, but if you cannot find how certain things are done in Matlab, ask on Slack or during office hours; for R ask Davide Crepaldi (dcrepaldi@sissa.it)

- in the classroom focus on the conceptual understanding of what you **can** do, as opposed to **how** you do it (you will have the course materials for that)

- ask questions at any point if something is unclear

- outside classroom focus on **how** you do things, try for yourself; bring questions on the next lesson or during office hours

- the goal of the class is for you to be able to apply concepts to your own work

Who here has their own data? If someone doesn't have data, either ask in your lab, or find online, or ask me.

# Let's go through the basics
- **go through the basic notebook**
- variable assignment
- using functions
- functions

# Jupyter notebook

The Jupyter Notebook (what you're looking at now) is an interactive computing environment (IDE) that enables users to make notebook documents that include: 
- Live code
- Plots
- Narrative text
- Equations
- Images

You are looking at this document one of two ways. First way is when you're on https://github.com/antopolskiy/sciprog, which is the repository for the course materials. In here, this is a static document: you can't do anything with it, only look at it. If this is the case, you need to switch to the second way: opening this locally on your computer with an ability to run and edit code. To do this, you need to have a distribution of Python 3.5 or higher on your machine. You can get it by installing an <a href="https://www.continuum.io/downloads">Anaconda distribution of Python</a> (it is a distribution which includes most of the packages we will be working with during this course). I highly suggest you to install it instead of installing "vanilla" Python (please install Anaconda for Python 3.6, not Python 2.7; you can install Python 2.7 along with it later when you need it). 

If you already have Python and don't want to install Anaconda, follow guidelines <a href="https://jupyter.readthedocs.io/en/latest/install.html">here</a>.

After you have installed Anaconda (or just `notebook` module to your Python distribution), just run `jupyter notebook` in your OS console/terminal ("Command Prompt" in Windows). It will start a notebook server and open a web browser with Jupyter Dashboard (despite the fact that notebooks appear in the browser, they are not online -- they are just rendered in the browser). The folder in which the Dashboard opens is the one in which your console runs by default (e.g. in Windows it is most likely `C:\Users\<username>`). Download this document: go to https://github.com/antopolskiy/sciprog and click green button `Clone or Download` on the right, and choose `Download ZIP`: this will download all course materials as a `ZIP` archive. Unpack the archive and put the contents in the folder where you can navigate in your Jupyter Dashboard (e.g. on Windows something like `C:\Users\<username>\scientific_programming_course`). Open this notebook and follow further instructions from the local version. As you open the file, you might be asked which kernel to use: choose Python 3.

As an alternative, if you don't have the time or opportunity to install Anaconda on your computer, you can run the notebook using free service: https://try.jupyter.org/. Just choose "Upload" in the top right corner and then upload file the notebook file and open it. Again, you will be asked to select a kernel: choose Python 3.

You can find full documentation <a href="http://jupyter-notebook.readthedocs.io/en/latest/index.html">here</a>. If you want to get more comfortable with notebooks quickly, I suggest you go through <a href="http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Notebook%20Basics.html">notebook basics</a> section. For now what you need to know is that notebooks are composed of cells, and each cell contains either **Markdown** (like this cell), which is used for text, or **Code**, which is used for snippets of code or even whole scripts. You can execute the content of the cell by pressing **Shift+Enter**. To create a new cell **b**elow the current cell, press **B** key; to create a new cell **a**bove the current cell, press **A** key.

# Spyder

<a href="https://pythonhosted.org/spyder/">Spyder</a> (Scientific PYthon Development EnviRonment) is another IDE shipped with Anaconda. To start it, you can run `spyder` in your terminal. It is more Matlab-like, and more suited for running scripts rather than exploring and prototyping.

There are other IDEs available for Python, but these will fulfill most of your needs.

# Python 2 vs Python 3

Question which almost always comes up in the Python courses: what's the deal with two different versions of Python -- Python 2 (currently in version 2.7) and Python 3 (currently in version 3.6)? Very quick note on that. Python is a quickly developing language, with new versions coming out almost every year. Usually they contain small and specific improvements, and the scripts are *forward compatible* -- whatever you could run in the previous version, you can also run in the new one. However, at some point (in 2008) it appeared that there is a stack of changes for the internal logic of the language, which need to be made for the language to develop further, and they will quite significantly change how people should use the language (significantly is a strong word: in reality it is significant for software development, not really for data science). At this point script became *forward incompatible*: you cannot (should not!) run a Python 2 script in Python 3 without any changes.

Python 2.7 is the last version of Python 2, and no new versions will come out. In fact, in 2020 support for Python 2 versions will stop completely. Python 3 is the future of the language, so definitely use that if you can. That said, there are several of packages which haven't been translated to Python 3, like `psychopy`. It is generally not a problem: you can install both versions of Python easily with Anaconda, and keep them separate. Whenever you need to use a library supported by Python 2, just switch kernel. We will learn how to do this later, in the last parts of the course (if you need it now, ask me during office hours). In any case, hopefully very soon all active packages will be transferred to Python 3.

If you want to know more, check out here: https://wiki.python.org/moin/Python2orPython3

# Documentation
Python has very good documentation: https://docs.python.org/3/. It also includes a very thorough <a href="https://docs.python.org/3/tutorial/index.html">tutorial</a>. Different Python packages have their own documentation in separate places, but they are only one google search away.

##### Quickly work through the pre-course tutorial notebook

# Now let's dive a bit deeper on each topic

## Notebook's cell output

As you noticed, a cell will display output of the script inside, when you run. However, it will display only the output of the **last** line. If last line doesn't have output, it won't display anything. Compare the following two cells:

In [25]:
x = 10
x

10

In [26]:
x = 10
x
y = 9

If you want to actually display something, you need to say it explicitly with `print` function. Note that in this case the message is not an "output" *per se* (which you can notice by the fact that on the left it doesn't say `Out[]:`), it is just printed. You can `print` infinitely many things, but show output of only one per cell.

In [24]:
x = 10

# this value is just printed
print(x)

y = 5

# this is going to be displayed as an output
y

10


5

**Pro-tip**: When you're not sure what the function's output, just put it in a separate cell and run it. If it has an output, if will show up as an output of the cell.

## Assignment

Variables allow you to store value. Or do they? In fact, variables are just a pointer (a *reference*) to an object in memory. Here is an example which can be confusing to a novice:

In [41]:
X = [0,5,10,15,20]
Y = X
Y[1] = -999
print(X)
print(Y)

[0, -999, 10, 15, 20]
[0, -999, 10, 15, 20]


What happens here? We created a list in memory, and variable `X` points to that object. Variable `Y` is assigned the same value as `X`, but the list is not copied, rather `Y` merely points to the same object. We modify the list through `Y` and discover that `X` still points to the same object.

If you want to avoid this, use explicit `.copy()` method on the list:

In [39]:
X = [0,5,10,15,20]
Y = X.copy()
Y[1] = -999
print(X)
print(Y)

[0, 5, 10, 15, 20]
[0, -999, 10, 15, 20]


This is an example of Python giving you more control over memory and pointers. In Matlab and R default behavior is to copy an object, which can lead to some serious memory replication (using a lot of memory for copies of the same objects). Control is good but you need to be aware of this behavior.

# Tuples: immutable lists

We already learned about `list`: they are containers for different types of stuff. There is another type of *in-built* contained data type, called `tuple`. They are denoted with parentheses `()` instead of brakets `[]`. 

In [1]:
# make a tuple
info = ('Sergey', 28, 'Russian', 1989, 1, 9)
info

('Sergey', 28, 'Russian', 1989, 1, 9)

In [2]:
type(info)

tuple

Tuples are very much like lists, except one thing -- they cannot be changed like lists, here is an example:

In [3]:
# let's make a list out of our tuple:
info_list = list(info)
info_list

['Sergey', 28, 'Russian', 1989, 1, 9]

In [6]:
type(info_list)

list

In [4]:
# now try to change something in it: it works
info_list[1] = 29
info_list

['Sergey', 29, 'Russian', 1989, 1, 9]

In [5]:
# let's try to do the same with tuple
info[1] = 29

TypeError: 'tuple' object does not support item assignment

We get an error if we try to change some value in a tuple. The same if we try to add something to it. 

A good question would be -- why do we need to have exactly the same thing as `list`, but which can do LESS than a `list`? It turns out that for many reasons it is very convenient to have some data type, which cannot be changed. We won't go into details here, but if you have something which you don't intend to change, consider making it a `tuple` instead of a `list`. In the very least you won't change it *accidentally*.

# Mapping data types

Another *in-built* container data type is `dict` (short for *dictionary*). `Dict` contains **pairs of things**. Any entry in a `dict` is pair `key`:`value` (in programming this relationship is called *mapping*: values maps onto the key). Think about it as a real world dictionary -- in an English-Italian dictionary you have a `key` word, e.g. **shirt**, and a `value`, associated with it: **camicia**. And you can find a `value` by addressing the `key`. Just like in the real dictionary, you cannot go the other way and find the word **shirt** by looking up **camicia** -- you would need another, Italian-English dictionary for that. Same with `dict`: `keys` and `values` are not symmetric, you can only get them in one direction `key`->`value`.

Syntax for a `dict` is to put `key:value` pairs inside curly brackets `{}`, with different pairs separated by comma:

In [1]:
{'shirt':'camicia'}

{'shirt': 'camicia'}

In [2]:
info = {'name':'Adina', 'surname':'Drumea', 'lab':'Diamond', 'taken_prog_class':False, 'languages': ['Matlab','C++']}
info

{'lab': 'Diamond',
 'languages': ['Matlab', 'C++'],
 'name': 'Adina',
 'surname': 'Drumea',
 'taken_prog_class': False}

Another way of defining a `dict`. Results are equivalent, so choose whatever you like. Note in this case `keys` need not be strings, but they become strings in the dict:

In [3]:
info = dict(name='Adina', surname='Drumea', lab='Diamond', taken_prog_class=False, languages=['Matlab','C++'])
info

{'lab': 'Diamond',
 'languages': ['Matlab', 'C++'],
 'name': 'Adina',
 'surname': 'Drumea',
 'taken_prog_class': False}

We can retrieve values from `dict` by specifying `key` like this:

In [4]:
info['surname']

'Drumea'

In [5]:
info['taken_prog_class']

False

**Note**: Both `key` and `value` can be of any type (with only exception that `keys` cannot be `list` and some other *modifiable* types; this has to do with implementation of `dict` in Python). If `key` repeats, it will override:

In [6]:
{'name':'Adina', 'name':'Marinella'}

{'name': 'Marinella'}

Besides storing and retrieving values from `dict`, you can also iterate through `keys` and `values` easily:

In [7]:
for (key, value) in info.items():
    print('The key was:', key)
    print('The value was:', value)
    print('')

The key was: name
The value was: Adina

The key was: surname
The value was: Drumea

The key was: lab
The value was: Diamond

The key was: taken_prog_class
The value was: False

The key was: languages
The value was: ['Matlab', 'C++']



`dict` supports a lot of different operations (check full documentation <a href="https://docs.python.org/2/library/stdtypes.html#mapping-types-dict">here</a>). Here are some of them:

In [8]:
# check whether certain key is in the dict
'surname' in info

True

In [9]:
'age' in info

False

In [10]:
# return list of keys
info.keys()

dict_keys(['name', 'surname', 'lab', 'taken_prog_class', 'languages'])

In [11]:
# return list of values
info.values()

dict_values(['Adina', 'Drumea', 'Diamond', False, ['Matlab', 'C++']])

In [12]:
# add stuff to the dict
info.update({'age':28, 'rooms':324})

In [13]:
info

{'age': 28,
 'lab': 'Diamond',
 'languages': ['Matlab', 'C++'],
 'name': 'Adina',
 'rooms': 324,
 'surname': 'Drumea',
 'taken_prog_class': False}

**Note**: The most attentive of you will notice that order of the `key`:`value` pairs has changed when we updated the `dict`. This shows potential pitfall of using `dict`, which you have to be careful about: **`dict` does store the order of inserted pair**! For example, if you try to iterate through the values in the dict (using, you cannot trust that it will iterate in the order in which you inserted the pairs. 

If ever you need to use mapping type which remembers the order, take a look at <a href="https://docs.python.org/2/library/collections.html#collections.OrderedDict">`OrderedDict` from `collections` module</a>. It operates the same way as `dict`, but will keep the order if you iterate.

In [14]:
from collections import OrderedDict
info_ordered = OrderedDict(name='Adina', surname='Drumea', lab='Diamond', taken_prog_class=False, languages=['Matlab','C++'])
info_ordered

OrderedDict([('name', 'Adina'),
             ('surname', 'Drumea'),
             ('lab', 'Diamond'),
             ('taken_prog_class', False),
             ('languages', ['Matlab', 'C++'])])

In [15]:
for key, value in info_ordered.items():
    print(key, value)

name Adina
surname Drumea
lab Diamond
taken_prog_class False
languages ['Matlab', 'C++']


# List comprehensions

In Python there is a number of syntax simplifications, which can be used to speed up coding. You will learn those over time, but there is one particularly useful shortcut called *list comprehensions*, which not only speeds up the coding, but also significantly improves code readability. As a consequence it is used ubiquitously. 

It has to do with how we write `for` loops. In particular, consider the following (real life) example. Let's say I recored behavior in a bunch of rats, and for each session I have a name, which contains year, month, day of the session and the codename of the rat in the following format: YYYYMMDDratcode. Example: `20170114S8`, where `S8` is the name of the rat.

In [102]:
sessions = ['20160701S8', '20160702S9', '20160702S8','20160703S10', '20160703S9', '20160703S8']
sessions

['20160701S8',
 '20160702S9',
 '20160702S8',
 '20160703S10',
 '20160703S9',
 '20160703S8']

Now I just want to get the dates of the session, so that I can see on which days I recorded at least 1 rat. I could construct the following loop:

In [109]:
# create a new empty list, which we will append later
sessions_date = []
# iterate through every session
for s in sessions:
    # append the new list with the first 8 characters from the session name
    sessions_date.append(s[:8])
    
sessions_date

['20160701', '20160702', '20160702', '20160703', '20160703', '20160703']

Possible, but a bit too tedious. Especially the part where you have to create an empty `list` and then append values there. There is a better way in Python:

In [111]:
[s[:8] for s in sessions]

['20160701', '20160702', '20160702', '20160703', '20160703', '20160703']

This produces the same exact output, but instead of taking several lines it just takes one. It is also quite easy to read once you get a hang of it. See `for s in sessions`, which is exactly the same as in the long `for` loop above, and it does the same thing: iterates through values of the `sessions`, and on each iteration `s` takes value from the list, one after another. And for each iteration, you return `s[:8]`, which is the first 8 characters from `s`. These values are automatically captured in the list. You can assign it to a variable in the same way as any other list:

In [112]:
sessions_date = [s[:8] for s in sessions]
sessions_date

['20160701', '20160702', '20160702', '20160703', '20160703', '20160703']

**Side note**: To follow through with the example, if I wanted to get unique days of the recording, I can use a function `unique` from `numpy` module, which will return only the unique values:

In [124]:
from numpy import unique
print(unique(sessions_date))

['20160701' '20160702' '20160703']


*List comprehensions* (or *listcomps* for short) will save you a lot of time and space in your script. You can even do some conditional things inside. Let's say I wanted to return the date ONLY for the rat `S9`. I can use `in` to check presence of a sub-string in a larger string like so:

In [125]:
s = '20160701S9'
'S9' in s

True

However to go through all sessions and check, I would need a `for` loop with `if` inside (if you want, try to implement it like that as an exercise). Instead we can do the same with listcomp:

In [127]:
[s[:8] for s in sessions if 'S9' in s]

['20160702', '20160703']

# A (very brief) intro to object-oriented programming

**Note**: This is a bit more advanced topic which might be difficult to understand on your own (outside class). If you have difficulties with this section, just skip it. It is not necessary for further parts, it is here only to improve your understanding of what is going inside the language.
___

<img src="https://cdn.meme.am/cache/instances/folder307/46864307.jpg">



In Python, everything is an object, which belongs to a certain class (which is the same thing as type). We already saw many different classes (types) of objects, such as `int`, `float`, `str`, `list`, `tuple`, `dict`, etc. You can use existing classes, but you can also create new ones. In reality, in working with data there is almost no application for this (however if you do simulations, especially parts-based simulations like neuronal simulations, this is extremely useful). Still, it will help you understand what is going on with the language. Let's create a class `Student` and inside it define a function, like so:

In [16]:
class Student():
    
    def say_hi(self):
        print('Hi!')

Now we can create an *instance* of that class:

In [17]:
Alex = Student()

It does nothing special, it just exists. We can check the type of `Alex` and verify that it is of our class `Student`:

In [18]:
type(Alex)

__main__.Student

(`__main__` refers to the fact that we defined the class inside this particular notebook; if you imported that class from, let's say, module `math`, it would say `math.Student`)

There is one particular thing about our class `Student` though. We defined a function inside it. This function is not actually a function *per se*, which you can verify by trying to run it (we can use any input just for demonstation):

In [19]:
say_hi(some_input)

NameError: name 'say_hi' is not defined

Python says that the function doesn't exist. That is because it only exists inside the object. We can call it, using *dot-notation*, like so:

In [20]:
Alex.say_hi()

Hi!


In [21]:
Ehsan = Student()
Ehsan.say_hi()

Hi!


The reason I am describing creating classes is to point out the difference between *functions* and *methods*. 

Methods are "functions inside an object". They cannot be called separately from an object, but are accessed with *dot-notation*: `<object>.<method>`. Which methods an object has is defined by its `class`, for example, object of the class `dict` will have different methods from object of the class `list`. In the example above `say_hi()` is a method of a class `Student`.

Let' consider an example from above (section about dictionaries):

In [22]:
for key, value in info.items():
    print(key, value)

name Adina
surname Drumea
lab Diamond
taken_prog_class False
languages ['Matlab', 'C++']
age 28
rooms 324


Note that to iterate through the dictionary, we used syntax `info.items()`. You might not recognize it before, but now you should understand that `items()` is nothing else but a *method* of the object `info`. It is there because `info` is an instance of the class `dict`. In particular, `items()` is a method of a class `dict` which returns contents of the `dict` (`key:value` pair saved there) in a form easily used for iteration through loops.

In your work, you will use both functions and methods and we will see many examples of that.

**Pro-tip**: if you write `<object>.` and then press TAB, Jupyter will list all the methods available for the object. E.g, in the cell below try writing `info.` and press TAB, and it will list all the methods available for this instance of the class `dict`.

In [23]:
# write below res. and press TAB


# Troubles with floats
Example from the survey:

In [12]:
x = 0.1 + 0.2
y = x == 0.3
y

False

Why this happens? If you can spare 9 minutes of your time, I suggest you to watch a great explanation on <a href="https://www.youtube.com/watch?v=PZRI1IfStY0">Computerphile</a> youtube channel. You will understand how floating numbers are stored and why the floating arithmetic is not precise.

I will attempt to explain it here in a nutshell. The problem is more evident if we just look at what `0.1 + 0.2` gives us:

In [213]:
0.1+0.2

0.30000000000000004

As you can see, there is a marginal error in the end. Why is it there? The answer has to do with how numbers (in particular, `float` type) are stored in memory: they can only store a certain number of *significant digits*. In situations when the precision requires an infinite number of repeating digits, it will fail. It is easier to understand with an analogy between fractions and decimals.

Think about a fraction `1/3`. If you try to write it as a decimal, you'd have to write `0.33333333...` and so on. But what if you were able to only store 5 significant digits after `0.`? You'd have to write `0.33333`, that's the best approximation you can do. Now if you try to do arithmetic with `1/3`, let's say `1/3 + 1/3 + 1/3 = 3`. However, if you try to do it in decimal where you can only store 5 significant digits, you'd get `0.99999`.

A similar thing happens in the computers when you write any decimals (like `0.1` and `0.2`), because computers cannot store decimal notation in memory, they have to translate them to *binary*. And sometimes with this translation you get a repeating pattern, just like with `0.33333...`. But computers cannot store infinite numbers, they have to cut it after some digits. This introduces errors when working with `float`, and it will happen in any language, because it is a fundamental property of computers.

Based on these difficulties, there are 2 fundamental principles one has to keep in mind when working with `float` type.

**First**: Never compare equality of floats, this is exactly same mistake which we did in the script from the survey:
    
    x = 0.1 + 0.2
    y = x == 0.3
    
The result will sometimes be `True`, sometimes `False`, based on which numbers exactly you try, but the point is that it is *inconsistent*, you cannot trust it. It is usually fine to compare which float is larger though. (**Pro-tip**: if you ever do need to compare equality of floats and cannot get around it, there is function called <a href="https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.isclose.html">`isclose`</a> in the `numpy` package, which will compare two things up to a specified tolerance).

**Second**: Be very careful when doing summation or difference between floats of vastly different order: the smaller ones can get lost in the noise. Consider the following example, where we add `0.001` one thousand times to a large number $10^{13}$, and expect to get $10^{13}+1$:

In [226]:
a = 10.0**13
for i in range(1000):
    a = a + 0.001
a

10000000000001.953

As you can see, the error is almost the same size as the number we added.

**Pro-tip**: if you even need to do very precise operations with `float` type, check out <a href="https://docs.python.org/2/library/decimal.html">`decimal`</a> module in Python)

If you want to dig a little deeper on that, check out this really great page: http://floating-point-gui.de/

<img src="http://www.contribute.geeksforgeeks.org/wp-content/uploads/numpy-logo1.jpg" align="left" alt="Drawing" style="width: 80px;"/>
# NumPy 
 Numerical Python

Arrays
- have same data type
- occupy continuous segment of memory
- linear access time
- but insertion or appending is inefficient -- hence always preallocate

In [6]:
import numpy as np

In [5]:
x = np.array([1,2,3])
print(x)

[1 2 3]


In [5]:
x = np.zeros(3)
print(x)

[ 0.  0.  0.]


In [24]:
x = np.zeros(10, dtype='int')
print(x)
print("ndim: ", x.ndim)
print("shape:", x.shape)
print("size: ", x.size)
print("dtype:", x.dtype)

[0 0 0 0 0 0 0 0 0 0]
ndim:  1
shape: (10,)
size:  10
dtype: int32


In [25]:
print("itemsize:", x.itemsize, "bytes")
print("nbytes:", x.nbytes, "bytes")

itemsize: 4 bytes
nbytes: 40 bytes


# Array indexing
## Accessing individual items

In [10]:
x = np.arange(20)
x

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [90]:
x[5]

3

In [91]:
x[3]

37

In [92]:
# in Python indexing starts with 0
x[0]

34

In [94]:
# indexing from the end
x[-1]

39

In [95]:
x[-5]

25

In [105]:
# multidimesional indexing
x2 = np.random.randint(10,size=(3,5))
x2

array([[0, 7, 0, 4, 2],
       [0, 1, 9, 5, 7],
       [8, 8, 8, 3, 8]])

In [107]:
x2[1,4]

7

In [108]:
x2[1,-1]

7

In [114]:
x3 = np.random.randint(10,size=(3,5,7))
x3

array([[[2, 6, 2, 5, 8, 0, 2],
        [6, 5, 7, 6, 7, 1, 9],
        [0, 0, 0, 7, 3, 8, 1],
        [1, 0, 5, 9, 4, 1, 0],
        [6, 7, 5, 0, 5, 4, 8]],

       [[4, 0, 5, 1, 0, 8, 3],
        [2, 6, 7, 0, 9, 3, 1],
        [1, 4, 6, 3, 0, 3, 9],
        [9, 2, 3, 0, 8, 4, 7],
        [1, 3, 0, 2, 5, 6, 7]],

       [[8, 4, 8, 1, 0, 1, 2],
        [3, 9, 6, 2, 2, 4, 2],
        [4, 6, 8, 6, 8, 6, 8],
        [4, 1, 0, 4, 4, 2, 7],
        [8, 9, 7, 6, 5, 1, 7]]])

In [115]:
x3[1,3,4]

8

In [116]:
x2

array([[0, 7, 0, 4, 2],
       [0, 1, 9, 5, 7],
       [8, 8, 8, 3, 8]])

In [117]:
# modifying items
x2[0,0] = 101
x2

array([[101,   7,   0,   4,   2],
       [  0,   1,   9,   5,   7],
       [  8,   8,   8,   3,   8]])

Keep in mind that NumPy arrays have fixed type, and they will not "upcast" automatically!

In [118]:
x2[0,0] = 3.1415
x2

array([[3, 7, 0, 4, 2],
       [0, 1, 9, 5, 7],
       [8, 8, 8, 3, 8]])

## Array slicing
Using *:* within brackes we can access slices of the array with the following pattern:
        
    x[start:stop:step]
    
If any of these are unspecified, they are assumed as following: `start=0, stop=`*size of dimension*`, step=1`

In [124]:
x = np.random.rand(50)
x

array([ 0.35730147,  0.92591748,  0.21475045,  0.63482056,  0.76368891,
        0.78922831,  0.96987125,  0.84906021,  0.02231264,  0.58855579,
        0.79702615,  0.71298563,  0.34782458,  0.29281994,  0.33894322,
        0.04343247,  0.16218615,  0.7214245 ,  0.96740122,  0.88163388,
        0.50298211,  0.43952651,  0.91062204,  0.66982251,  0.15239474,
        0.46351212,  0.5548472 ,  0.80435443,  0.46839106,  0.36561595,
        0.12178768,  0.72255183,  0.91057855,  0.80411022,  0.21650579,
        0.40289904,  0.35962963,  0.256189  ,  0.20162084,  0.77324285,
        0.77014375,  0.76209701,  0.73445638,  0.89330203,  0.20395683,
        0.03984289,  0.52263979,  0.46054579,  0.54771782,  0.91143364])

In [131]:
# first 5 elements
x[:5]

array([ 0.35730147,  0.92591748,  0.21475045,  0.63482056,  0.76368891])

In [132]:
# from 5th to 10th element
x[4:10]

array([ 0.76368891,  0.78922831,  0.96987125,  0.84906021,  0.02231264,
        0.58855579])

In [134]:
# from 11th element until the end
x[10:]

array([ 0.79702615,  0.71298563,  0.34782458,  0.29281994,  0.33894322,
        0.04343247,  0.16218615,  0.7214245 ,  0.96740122,  0.88163388,
        0.50298211,  0.43952651,  0.91062204,  0.66982251,  0.15239474,
        0.46351212,  0.5548472 ,  0.80435443,  0.46839106,  0.36561595,
        0.12178768,  0.72255183,  0.91057855,  0.80411022,  0.21650579,
        0.40289904,  0.35962963,  0.256189  ,  0.20162084,  0.77324285,
        0.77014375,  0.76209701,  0.73445638,  0.89330203,  0.20395683,
        0.03984289,  0.52263979,  0.46054579,  0.54771782,  0.91143364])

In [135]:
# every second element
x[::2]

array([ 0.35730147,  0.21475045,  0.76368891,  0.96987125,  0.02231264,
        0.79702615,  0.34782458,  0.33894322,  0.16218615,  0.96740122,
        0.50298211,  0.91062204,  0.15239474,  0.5548472 ,  0.46839106,
        0.12178768,  0.91057855,  0.21650579,  0.35962963,  0.20162084,
        0.77014375,  0.73445638,  0.20395683,  0.52263979,  0.54771782])

In [137]:
# every third element
x[::3]

array([ 0.35730147,  0.63482056,  0.96987125,  0.58855579,  0.34782458,
        0.04343247,  0.96740122,  0.43952651,  0.15239474,  0.80435443,
        0.12178768,  0.80411022,  0.35962963,  0.77324285,  0.73445638,
        0.03984289,  0.54771782])

If step size is negative, array will be indexed from end to start. This is a covenient way to reverse an array:

In [142]:
# reversed array
x = np.arange(10)
print(x)
print(x[::-1])

[0 1 2 3 4 5 6 7 8 9]
[9 8 7 6 5 4 3 2 1 0]


You can access columns and row of multidimensional arrays by combining *:* with element indexing:

In [147]:
# print 2 dimensional array
print(x2)

# access third column
print(x2[:,2])

[[3 7 0 4 2]
 [0 1 9 5 7]
 [8 8 8 3 8]]
[0 9 8]


## Distinction between *memory views* and *copies*
When accessing the sub-arrays, it is important to keep in mind that you get a *view* on the array, not a copy of it! It means that by default the new array is not a separate entity, but is actually accessing the same memory as the original array. Here is a simple example:

In [148]:
x2

array([[3, 7, 0, 4, 2],
       [0, 1, 9, 5, 7],
       [8, 8, 8, 3, 8]])

In [149]:
# get first two elements from both dimensions
x2_sub = x2[:2,:2]
x2_sub

array([[3, 7],
       [0, 1]])

In [150]:
# modify an element in the new array
x2_sub[1,1] = 99
x2_sub

array([[ 3,  7],
       [ 0, 99]])

In [151]:
# see that the original array also got modified
x2

array([[ 3,  7,  0,  4,  2],
       [ 0, 99,  9,  5,  7],
       [ 8,  8,  8,  3,  8]])

This behavior is very useful for working with datasets, because it saves you a lot of memory. If you need to make a copy of the array, you must do it explicitly:

In [153]:
# make a copy
x2_sub_copy = x2[:2,:2].copy()

In [155]:
# modify a copy and verify that the original array is intact
x2_sub_copy[1,1] = -666
print(x2_sub_copy)
print(x2)

[[   3    7]
 [   0 -666]]
[[ 3  7  0  4  2]
 [ 0 99  9  5  7]
 [ 8  8  8  3  8]]


## Reshaping of the array
Reshaping allows you to rearrange the element of the arrays into the new shape. Note that for it to work, the number of elements in the final version must match the original version. Example:

In [162]:
y = np.arange(9)
y

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [163]:
y.reshape(3,3)

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

You can also add axis using reshape (alternative is to use `np.newaxis` object):

In [166]:
y.reshape(9,1)

array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8]])

In [167]:
y[:, np.newaxis]

array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8]])

## Array concatenation and splitting

In [168]:
# simple concatenation
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])

array([1, 2, 3, 3, 2, 1])

In [170]:
# two dimensional concatenation
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])
np.concatenate([grid, grid])

array([[1, 2, 3],
       [4, 5, 6],
       [1, 2, 3],
       [4, 5, 6]])

In [171]:
# control axis of concatenation
np.concatenate([grid, grid], axis=1)

array([[1, 2, 3, 1, 2, 3],
       [4, 5, 6, 4, 5, 6]])

In [182]:
# array splitting example
x = np.arange(10)
print(x)
np.split(x,(2,7))

[0 1 2 3 4 5 6 7 8 9]


[array([0, 1]), array([2, 3, 4, 5, 6]), array([7, 8, 9])]

This is a good opportunity to show a neat python trick -- simultaneous assignment (sometimes called *unpacking*). Whenever you have a function that outputs a `list` or a `tuple`, you can assign individual elements to separate variables in one go:

In [180]:
# simultaneous assignment
a1, a2, a3 = np.split(x,(2,7))

[array([0, 1]), array([2, 3, 4, 5, 6]), array([7, 8, 9])]

## Operations on arrays
We use arrays to speed up computations. The key to understand here is that when you make an operation on each element of an *array* or a *list*, each object has to be *dynamically typed*: during the execusion for each element Python core looks up at the type of the element (whether it is `int`, `float`, `string`, etc) to see which of the compiled functions to apply to the element. This is slow. Consider the following piece of code.

In [188]:
def compute_reciprocals(values):
    # preallocate array for reciprocals
    output = np.empty(len(values))
    
    # compute reciprocal of each element
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    
    return output
        
values = np.arange(1, 10)
print(values)
compute_reciprocals(values)

[1 2 3 4 5 6 7 8 9]


array([ 1.        ,  0.5       ,  0.33333333,  0.25      ,  0.2       ,
        0.16666667,  0.14285714,  0.125     ,  0.11111111])

Now let's see how much time it takes to run this function on an array of 1 million integers:

In [194]:
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)

1 loop, best of 3: 2.85 s per loop


In [204]:
%timeit np.divide(1,big_array)

100 loops, best of 3: 6.07 ms per loop


As you see, this is ~500 times faster.

It can be cumbersome to write np.divide every time you want to perform a division. Good news though, Python made in such a way that allows to redefine the use of operators (`/`, `*`, `+`, `>`, etc) to use certain function for a certain type of object. For any operations with array, the use of normal operators will automatically use NumPy `ufuncs`:

In [206]:
# this is the same as using np.divide
1 / big_array

# this is the same as using np.multiply
big_array * 0.5

array([ 48. ,  29.5,   3.5, ...,  22. ,  31.5,  38. ])

The following table lists the arithmetic operators implemented in NumPy:

| Operator	    | Equivalent ufunc    | Description                           |
|---------------|---------------------|---------------------------------------|
|``+``          |``np.add``           |Addition (e.g., ``1 + 1 = 2``)         |
|``-``          |``np.subtract``      |Subtraction (e.g., ``3 - 2 = 1``)      |
|``-``          |``np.negative``      |Unary negation (e.g., ``-2``)          |
|``*``          |``np.multiply``      |Multiplication (e.g., ``2 * 3 = 6``)   |
|``/``          |``np.divide``        |Division (e.g., ``3 / 2 = 1.5``)       |
|``//``         |``np.floor_divide``  |Floor division (e.g., ``3 // 2 = 1``)  |
|``**``         |``np.power``         |Exponentiation (e.g., ``2 ** 3 = 8``)  |
|``%``          |``np.mod``           |Modulus/remainder (e.g., ``9 % 4 = 1``)|

If these *ufuncs* have operational wrappers, why would you want to use the full function notation? There are some curcumstances when using full notation will give you more flexibility. I outline some examples here for you to work out, but if you intend to do a lot of number crunching on large datasets, certainly take a look at the [NumPy](http://www.numpy.org)(especially at the [ufunc](https://docs.scipy.org/doc/numpy-1.10.0/reference/ufuncs.html) section) and [SciPy](http://www.scipy.org) documentation.

Example 1: Using `out` argument. In this example we have numbers from 0 to 4 which we want to multiply by 10. Here is one way to do it:

In [209]:
x = np.arange(5)
x = x*10
x

array([ 0, 10, 20, 30, 40])

This is explicit, short and easy to understand. However, what this does is creating a new array `x*10` and then assigning it to `x`. The original memory of our array created with `np.arange(5)` is left in memory without a reference and will be eventually "collected" and memory will be recovered. However for very large arrays it can take significant time and memory will not be free for a while. To avoid this, we can use `out` to write result directly in place of our old array, without creating any new objects in memory (note that in this case `np.multiply` does not return anything, instead the result is saved as a "by-product" of the function). The result is equivalent in both cases. However, second implementation can lead to significant savings on preprocessing large numerical datasets.

In [210]:
x = np.arange(5)
np.multiply(x, 10, out=x)
x

array([ 0, 10, 20, 30, 40])

Example 2: Accumulation. *ufuncs* have several methods.

>**Additional information**: *method* is a function inside an object, and can be accessed with dot notation. Toy example: if you have an object `car` it might have a method `start`, and you would access it like so: `car.start()`. *ufuncs* are actually objects, which can be called like a function, but they also have methods, which allow to change their behavior, which is what we will be doing in this example. The cool thing is that although *ufuncs* are objects, you can use them like functions without even knowing this, as we were doing before.

One of such methods is `accumulate`, which will run the *ufunc* cummulatively on all elements of an array. In this example we want to create cumulative product (i.e. factorial) of numbers from 1 to 10.

In [228]:
one_to_ten = np.arange(1,11)
one_to_ten

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [229]:
np.multiply.accumulate(one_to_ten)

array([      1,       2,       6,      24,     120,     720,    5040,
         40320,  362880, 3628800], dtype=int32)

Another method is `outer` which will apply the *ufunc* to all pairs of elements:

In [231]:
np.multiply.outer(one_to_ten, one_to_ten)

array([[  1,   2,   3,   4,   5,   6,   7,   8,   9,  10],
       [  2,   4,   6,   8,  10,  12,  14,  16,  18,  20],
       [  3,   6,   9,  12,  15,  18,  21,  24,  27,  30],
       [  4,   8,  12,  16,  20,  24,  28,  32,  36,  40],
       [  5,  10,  15,  20,  25,  30,  35,  40,  45,  50],
       [  6,  12,  18,  24,  30,  36,  42,  48,  54,  60],
       [  7,  14,  21,  28,  35,  42,  49,  56,  63,  70],
       [  8,  16,  24,  32,  40,  48,  56,  64,  72,  80],
       [  9,  18,  27,  36,  45,  54,  63,  72,  81,  90],
       [ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100]])

### Aggregator functions
Functions which reduce an array (or a dimension of an array) to a single value are called aggregator functions. Some of the most useful include `sum`, `min`, `max`, `mean`, `median`, `std`, etc. Python have in-build versions of some of these functions, but NumPy versions are much faster and you should be always using them:

In [233]:
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)

10 loops, best of 3: 158 ms per loop
1000 loops, best of 3: 940 µs per loop


Most of the aggregator function include a sister-function with `nan` prefix, which does the same, but ignores `NaN` (stands for *Not a Number*) elements. `NaN` is usually used as a placeholder for missing data, so these functions are very useful for working with data. We will revisit this in the future lesson.

The following table provides a list of useful aggregation functions available in NumPy:

|Function name      |   NaN-ignoring version  | Description                                   |
|-------------------|---------------------|-----------------------------------------------|
| ``np.sum``        | ``np.nansum``       | Compute sum of elements                       |
| ``np.prod``       | ``np.nanprod``      | Compute product of elements                   |
| ``np.mean``       | ``np.nanmean``      | Compute median of elements                    |
| ``np.std``        | ``np.nanstd``       | Compute standard deviation                    |
| ``np.var``        | ``np.nanvar``       | Compute variance                              |
| ``np.min``        | ``np.nanmin``       | Find minimum value                            |
| ``np.max``        | ``np.nanmax``       | Find maximum value                            |
| ``np.argmin``     | ``np.nanargmin``    | Find index of minimum value                   |
| ``np.argmax``     | ``np.nanargmax``    | Find index of maximum value                   |
| ``np.median``     | ``np.nanmedian``    | Compute median of elements                    |
| ``np.percentile`` | ``np.nanpercentile``| Compute rank-based statistics of elements     |
| ``np.any``        | N/A                 | Evaluate whether any elements are true        |
| ``np.all``        | N/A                 | Evaluate whether all elements are true        |

We won't discuss each in detail, but feel free to try them for youself.

Just to mention 2 things about aggregates. 

First, some of them (`sum`, `min`, `max` and some others) can be accessed via method notation, like so:

In [236]:
# print min, max and sum of the array
print('Min:', big_array.min())
print('Max:', big_array.max())
print('Sum:', big_array.sum())

Min: 1.12602703384e-07
Max: 0.999999824302
Sum: 499820.072513


And second, for multidimensional arrays, you can specify `axis` parameter to make aggregation only over a specific axis. By default, they will aggregate over all the array:

In [239]:
multi_dim_array = np.random.randint(100, size=(5,10))
multi_dim_array

array([[57, 29, 95, 48, 90, 11,  1, 77, 94, 44],
       [ 4, 64, 99, 50, 74, 77, 46, 48, 93, 54],
       [61, 15, 25, 60, 82, 60, 27, 16, 45, 50],
       [84, 87,  4, 29, 39, 33, 53, 39, 77, 29],
       [69, 48, 77,  2, 74, 55,  0, 89, 94, 99]])

In [241]:
# default behavior gives maximum of all elements of the array 
np.max(multi_dim_array)

99

In [244]:
# specifying axis gives you control over which dimension is aggregated;
# in this particular case, the function will give max of every column 
np.max(multi_dim_array, axis=0)

array([84, 87, 99, 60, 90, 77, 53, 89, 94, 99])