<a href="https://colab.research.google.com/github/moreymat/scpo-data-science-bootcamp/blob/main/notebooks/1_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python basics

Welcome to the "Data science bootcamp" in Python.

In this notebook, the first in a series, you will see the basics of the Python programming language, a fundamental tool nowadays for data science, artificial intelligence and machine learning.

This notebook is mostly (and heavily) derived from the notebook [Python basics](https://colab.research.google.com/github/data-psl/lectures2020/blob/master/notebooks/01_python_basics.ipynb#scrollTo=clWaFCzBMfkv), authored by Mathieu Blondel for the [preparatory week of the transverse program for data sciences at Université PSL (Paris Sciences & Lettres)](https://data-psl.github.io/preparatory-week/).

## Preamble : Colaboratory and notebooks

We are working on a platform, [Google Colaboratory](https://colab.research.google.com), that enables users to edit and run [Jupyter notebooks](https://en.wikipedia.org/wiki/Project_Jupyter#Jupyter_Notebook) in the cloud for free, without any installation, and with access to GPUs.

For a very short history and description of Colaboratory, see [on wikipedia](https://en.wikipedia.org/wiki/Project_Jupyter#Google_Colaboratory).

### Note on how to change the language on Google Colaboratory

Google Colaboratory uses the default language settings from your web browser.
If the menus are displayed in a different language, you can switch to English by going to (the (equivalent of) the "Help" menu, then clicking on (the translation of) the item "Display in English".

Having the interface in English will ensure you can easily follow the instructions and advice in these notebooks.

### Using a notebook

All our work will happen inside a *notebook*, an interactive document where you can write a mix of :

* text formatted in [Markdown](https://guides.github.com/features/mastering-markdown/) that is rendered, and
* code in [Python](https://www.python.org/) that can be executed.

More precisely, a notebook consists in a sequence of *cells*, where each cell contains either (Markdown) text or (Python) code.

Notebooks are not limited to text and code, they can render math equations, images, plain HTML content and even interactive widgets !
But in this series of notebooks, we will mostly write text and code.

You can double click on any text cell to edit it.

By default, Colab shows you the Markdown text on the left and a preview of its rendering on the right that is updated (almost) in real time as you type.

When you are done, just press "Shift+Enter" and the cell will be rendered.


### Markdown basics

Markdown relies on characters, in particular the star `*` (or multiply).

*Text between stars is in italic*, **text between pairs of stars is in bold**. 

A sequence of lines that start with a star is rendered as a list with bullets :

* first item
* second item
* third item

Lines that start with one or more hash `#` are headings :

* 1 hash `#` for level-1 headings (document title), 
* 2 hashes `##` for level-2 headings (section title),
* 3 hashes `###` for level-3 headings (subsection title).

Links are defined by two elements : the text of the link between square brackets `[` and `]`, and the target of the link between parentheses.
As an example, here is a link to the [Sciences-Po website](https://www.sciencespo.fr/en/home).

If you want to add a text (or code) cell after the cell you are currently editing, you can either :
* click on the menu item "Insert" > "Text cell" (or "Code cell"),
* hover your mouse 1-2 mm below the bottom of the cell and click on one of the two buttons "+ Code" and "+ Text" that appear.

If you forget the syntax for any element, the menu bar that appears at the top of the cell (when you edit it) contains the most commonly used elements in a familiar interface.

A smaller bar appears on the upper right corner of the cell, with "up" and "down" arrows to move the cell up or down, a button to delete the cell, etc.

There are many resources online to learn more about the Markdown syntax, eg. this [short tutorial by GitHub](https://guides.github.com/features/mastering-markdown/).

### Python in the notebook

Notebooks can contain code cells, that you can edit, like you do for text cells, but writing Python syntax instead of Markdown, and execute.

A code cell contains one or more Python instructions.

For example, here is a code cell that computes the number of seconds in a day (`24 * 60 * 60`), stores the result in a variable (`seconds_in_a_day`) and prints its value:

In [None]:
seconds_in_a_day = 24 * 60 * 60
seconds_in_a_day

86400

Click on the "play" button on the left of the cell to execute it.
You should be able to see the result. 

Alternatively, you can also execute the cell by pressing "Ctrl+Enter" if you are on Windows or Linux, "Command+Enter" if you are on a Mac.

Variables that you defined in one cell can later be used in other cells:

In [None]:
seconds_in_a_week = 7 * seconds_in_a_day
seconds_in_a_week

604800

Note that the order of execution is important.
For instance, had we not run the cell that computes and stores `seconds_in_a_day` beforehand, executing the above cell would have raised an error, as it depends on this variable.

To make sure that you run all the cells in the correct order, the easiest option is to click on "Runtime" in the top-level menu, then on the "Run all" item.

**Exercise.** Add a cell below this cell: click on this cell then click on "+ Code". In the new cell, compute the number of seconds in a year by reusing the variable `seconds_in_a_day`. Run the new cell.

## Python

Let us start with basic notions of the Python programming language.

### Numbers and arithmetic operations

Python enables to manipulate numeric values of various types :
* [integers](https://en.wikipedia.org/wiki/Integer_(computer_science)) (`int`),
* [booleans](https://en.wikipedia.org/wiki/Boolean_data_type) (`bool`),
* [real numbers](https://en.wikipedia.org/wiki/Floating-point_arithmetic) (`float`),
* [complex numbers](https://en.wikipedia.org/wiki/Complex_data_type) (`complex`),

On these numeric values, Python supports the usual arithmetic operators :

* `+` (addition),
* `*` (multiplication),
* `/` (division),
* `**` (power),
* `//` (integer division).

In general, you do not need to explicitly state the type of a numeric value because Python guesses from the writing.

In [None]:
# (a line starting with a hash in Python is a comment, it is not interpreted)
# let us declare an integer
x = 1
# and display its value, just by calling its name
x

1

In [None]:
# let us declare a float
y = 3.2
# and display its value
y

3.2

If you sum an integer and a real number, Python will automatically convert the integer to a float before doing the computation.

In [None]:
z = x + y
z

4.2

**Exercise.** Add code cells and try out each of the remaining arithmetic operations, with various values.

Ill-defined operations will raise an error, such as a division by zero.

In [None]:
wrong_comp = 14 / 0

ZeroDivisionError: ignored

The "stop signal" on the left of the cell signals that the execution of this Python block has failed.

The error message provides you with (hopefully) useful information to know what went wrong.
If you do not know what the error means, or how to fix it, click on the "Search stack overflow" button and read from the suggested webpages. 

### Lists

A fundamental notion in programming is to operate not only on simple values, but also on collections of values (or more complex objects).

Python provides several data structures to hold such collections, the simplest being the list.

Lists are a container type for ordered sequences of elements.

Lists can be initialized empty

In [None]:
my_list = []

or with some initial elements

In [None]:
my_list = [1, 2, 3]

Lists have a dynamic size : elements can be added, *appended*, to them (at the end).

In [None]:
my_list.append(4)
my_list

[1, 2, 3, 4]

We can access individual elements of a list with their position or *index*.
Note that in Python, **indexing starts from 0**.

In [None]:
# access element at index 0 (hence the 1st value in the list)
my_list[0]

1

In [None]:
# access element at index 3 (hence the 4th value in the list)
my_list[3]

4

We can access "slices" of a list using `my_list[i:j]` where `i` is the start of the slice (again, indexing starts from 0) and `j` the end of the slice. For instance:

In [None]:
my_list[1:3]

[2, 3]

Omitting the second index means that the slice shoud run until the end of the list.

In [None]:
my_list[1:]

[2, 3, 4]

We can check if an element is in the list using `in`.

In [None]:
5 in my_list

False

The length of a list can be obtained using the `len` function.

In [None]:
len(my_list)

4

### Strings

Computers can manipulate not only numeric values, but also text values.

Text values are stored in *strings*.
In Python, strings are enclosed in single quotes `'` or double quotes `"`.

In [None]:
string1 = "some text"
string2 = 'some other text'

Strings behave similarly to lists.
As such we can access individual elements in exactly the same way, by specifiying their index.

In [None]:
string1[3]

'e'

and similarly, we can take a slice of a string

In [None]:
string2[5:]

'other text'

String concatenation is performed using the `+` operator.

In [None]:
string1 + " " + string2

'some text some other text'

### Conditionals

As their name indicates, conditionals are a way to execute code depending on whether a condition is `True` or `False`.

As in other languages, Python supports `if` and `else` but `else if` is contracted into `elif`, as the example below demonstrates.

We will use the `print()` function to print a message below the code cell at execution.

In [None]:
my_variable = 5
if my_variable < 0:
  print("negative")
elif my_variable == 0:
  print("null")
else: # my_variable > 0
  print("positive")

positive


Here `<` and `>` are the strict "less than" and "greater than" operators, while `==` is the equality operator (not to be confused with `=`, the variable assignment operator).

The operators `<=` and `>=` can be used for "less than or equal" and "greater than or equal" comparisons.

Contrary to other languages, blocks of code are delimited using indentation.

Here, we use 2-space indentation but many programmers also use 4-space indentation.
Any one is fine as long as you are consistent throughout your code.

### Loops

Loops are a way to execute a block of code multiple times.
There are two main types of loops: `while` loops and `for` loops.

A `while` loop goes on as long as its condition is True.

In [None]:
i = 0
while i < len(my_list):
  print(my_list[i])
  i += 1 # equivalent to i = i + 1

1
2
3
4


A `for` loop is executed once for each element in a sequence.

If the goal is simply to iterate over a list and manipulate each element in turn, we can do so directly as follows :

In [None]:
for element in my_list:
  print(element)

1
2
3
4


We can also iterate over a sequence of numbers, with the [range](https://docs.python.org/3.7/tutorial/controlflow.html#the-range-function) function.

The following code cell has the effect of iterating over the indices of the elements of `my_list`.

In [None]:
for i in range(len(my_list)):
  print(my_list[i])

1
2
3
4


### Functions

To improve code readability, it is common to separate the code into different blocks, responsible for performing precise actions: **functions**.

A function takes some inputs and process them to return some outputs.

In [None]:
def square(x):
  return x ** 2

In [None]:
square(4)

16

In [None]:
def multiply(a, b):
  return a * b

In [None]:
multiply(3, 5)

15

In [None]:
# Functions can be composed.
square(multiply(3, 2))

36

To improve code readability, it is sometimes useful to explicitly name the arguments.

In [None]:
square(multiply(a=3, b=2))

36

**Exercise.** Write a `divide(a, b)` function. Compare the results returned by `divide(15, 3)`, `divide(3, 15)`, `divide(a=15, b=3)`, `divide(b=3, a=15)`.

### Exercises

**Exercise 1.** Using a conditional, write the [relu](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) function (used in neural networks) defined as follows

$\text{relu}(x) = \left\{
   \begin{array}{rl}
     x, & \text{if }  x \ge 0 \\
     0, & \text{otherwise }.
   \end{array}\right.$

In [None]:
def relu(x):
  # Write your function here
  return

relu(-3)

**Exercise 2.** Using a foor loop, write a function that computes the [average](https://en.wikipedia.org/wiki/Average) of a list of values.

In [None]:
def average(vector):
  # write your function here
  return

my_vector = [12, 11, 9, 5, 7]
# the result should be 8.8
average(my_vector)


**Exercise 3.** Using a for loop and a conditional, write a function that returns the maximum value in a vector.

In [None]:
def vector_maximum(vector):
  # Write your function here
  return

**Bonus exercise.** If you still have time, write a function that sorts a list in ascending order (from smaller to bigger) using the [bubble sort](https://en.wikipedia.org/wiki/Bubble_sort) algorithm.

In [None]:
def bubble_sort(my_list):
  # Write your function here
  return

my_list = [1, -3, 3, 2]
# Should return [-3, 1, 2, 3]
bubble_sort(my_list)

### Going further

We covered some essential of the Python programming language, but it has many more features.

If or when you have time, you can check out these resources :

* [Ultimate Python study guide](https://github.com/huangsam/ultimate-python)
*   List of Python [tutorials](https://wiki.python.org/moin/BeginnersGuide/Programmers)
* Four-hour [course](https://www.youtube.com/watch?v=rfscVS0vtbw) on Youtube (you can skip the parts on PyCharm, files etc. if you only work on Colaboratory)
* [Automate the Boring Stuff with Python - Practical Programming for Total Beginners]((https://automatetheboringstuff.com/)) by Al Sweigart is Sylvain's (and many others') favorite. It is "written for office workers, students, administrators, and anyone who uses a computer to learn how to code small, practical programs to automate tasks on their computer."
The entire content of the book is available on its website, scroll down to the "Table of contents" and you will see links to each chapter.


## NumPy

Computers are designed so that it is more efficient to apply an operation on an array of numbers at once, in bulk, rather than on each number in turn.
Pure Python code, using only the Python standard library, do not enable you to do that.

### Gaining functionalities with libraries

This can be remedied by using an additional [software library](https://en.wikipedia.org/wiki/Library_(computing)), which is, roughly speaking, a collection of code that provides functionalities to perform operations on a given task or domain.
You might have heard, or will hear about, libraries dedicated to machine learning such as [scikit-learn](https://en.wikipedia.org/wiki/Scikit-learn) or Google's [TensorFlow](https://en.wikipedia.org/wiki/TensorFlow) to build neural networks.

The most widely used library in Python to store arrays of numbers and perform computations on them work is [NumPy](https://en.wikipedia.org/wiki/NumPy).
NumPy is a library for scientific computing, providing optimized data structures and operations, mathematical functions, linear algebra routines, etc

NumPy's optimized implementation in [C](https://en.wikipedia.org/wiki/C_(programming_language) enables you to benefit from this, which means your operations on numbers can run faster than in pure Python.


To use NumPy in your notebook or program, you need to import it as follows

In [None]:
import numpy as np

Here we imported the NumPy library, and gave it a shorter name `np` that we can use whenever we need to use one of its functionalities.

### Array creation

NumPy arrays can be created from Python lists

In [None]:
my_array = np.array([1, 2, 3])
my_array

array([1, 2, 3])

NumPy supports arrays of arbitrary dimension. 

For example, we can create two-dimensional arrays (e.g. to store a matrix) as follows

In [None]:
my_2d_array = np.array([[1, 2, 3], [4, 5, 6]])
my_2d_array

array([[1, 2, 3],
       [4, 5, 6]])

We can access individual elements of a 2d-array using two indices

In [None]:
my_2d_array[1, 2]

6

We can also access rows

In [None]:
my_2d_array[1]

array([4, 5, 6])

and columns

In [None]:
my_2d_array[:, 2]

array([3, 6])

NumPy arrays are [Python objects](https://docs.python.org/3/reference/datamodel.html).
Roughly speaking, an object is a rich structure that enables to model an entity, a collection of entities, or relations, in the real world.
An object can have attributes, ie. variables (with a name and value) that are attached to the object and can be accessed with the dot notation.

For instance, NumPy arrays have a `shape` attribute.

In [None]:
print(my_array.shape)
print(my_2d_array.shape)

(3,)
(2, 3)


Contrary to Python lists, NumPy arrays must have a type and all elements of the array must have the same type.
Here, `dtype` is another attribute of the array.

In [None]:
my_array.dtype

dtype('int64')

The main types are `int32` (32-bit integers), `int64` (64-bit integers), `float32` (32-bit real values) and `float64` (64-bit real values).

The `dtype` can be specified when creating the array

In [None]:
my_array = np.array([1, 2, 3], dtype=np.float64)
my_array.dtype

dtype('float64')

We can create arrays of all zeros using

In [None]:
zero_array = np.zeros((2, 3))
zero_array

array([[0., 0., 0.],
       [0., 0., 0.]])

and similarly for all ones using `ones` instead of `zeros`.

**Exercise.** Create an array filled with ones.

We can create a range of values using the [arange](https://numpy.org/doc/stable/reference/generated/numpy.arange.html) function

In [None]:
np.arange(5)

array([0, 1, 2, 3, 4])

If a unique argument is provided, the starting value is assumed to be 0, but you can specify the starting (included) and ending (excluded) values

In [None]:
np.arange(3, 5)

array([3, 4])

Another useful routine is `linspace` for creating linearly spaced values in an interval. For instance, to create 10 values in `[0, 1]`, we can use

In [None]:
np.linspace(0, 1, 10)

array([0.        , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
       0.55555556, 0.66666667, 0.77777778, 0.88888889, 1.        ])

Python objects, including NumPy arrays, also have methods, ie. functions that are attached, or "belong", to an object.

There are very minor (from our point of view) differences between methods and "simple" functions, so you just need to know that calling a method is also done via the dot notation.

For instance, another important operation on NumPy arrays is [reshape](https://numpy.org/doc/stable/user/quickstart.html?highlight=reshape#changing-the-shape-of-an-array), for changing the shape of an array

In [None]:
# 1d array
my_array = np.array([1, 2, 3, 4, 5, 6])
my_array

array([1, 2, 3, 4, 5, 6])

In [None]:
# reshape into a 3 rows, 2 cols array
# we call the reshape() method on the my_array object
my_array.reshape(3, 2)

array([[1, 2],
       [3, 4],
       [5, 6]])

You can play with these operations and make sure you understand them well.

### Basic operations

In NumPy, we express computations directly over arrays. This makes the code much more succint.

Arithmetic operations can be performed directly over arrays. For instance, assuming two arrays have a compatible shape, we can add them as follows

In [None]:
array_a = np.array([1, 2, 3])
array_b = np.array([4, 5, 6])
array_a + array_b

array([5, 7, 9])

Compare this with the equivalent computation using a for loop

In [None]:
# create an array with the same shape as array_a, but filled with 0s
array_out = np.zeros_like(array_a)
# sum element-wise
for i in range(len(array_a)):
  array_out[i] = array_a[i] + array_b[i]
array_out

array([5, 7, 9])

Not only this code is more verbose, it will also run much more slowly.

In NumPy, functions that operates on arrays in an element-wise fashion are called [universal functions](https://numpy.org/doc/stable/reference/ufuncs.html). For instance, this is the case of `np.sin`

In [None]:
np.sin(array_a)

array([0.84147098, 0.90929743, 0.14112001])

[Vector inner product](https://en.wikipedia.org/wiki/Dot_product) can be performed using `np.dot`

In [None]:
np.dot(array_a, array_b)

32

#### Advanced notions (optional)
When the two arguments to `np.dot` are both 2d arrays, `np.dot` becomes matrix multiplication

In [None]:
array_A = np.random.rand(5, 3)
array_B = np.random.randn(3, 4)
np.dot(array_A, array_B)

array([[ 0.0869419 , -1.59793116,  0.79760988, -3.08597472],
       [-0.06744098, -1.80575593,  0.87360006, -3.39354257],
       [-0.07784893, -1.43986958,  0.74358606, -2.86826416],
       [-0.4351405 , -1.98194074,  0.95437821, -3.71338019],
       [-0.24484233, -1.33190555,  0.61521527, -2.40476818]])

Matrix transpose can be done using `.transpose()` or `.T` for short

In [None]:
array_A.T

array([[0.84254571, 0.8729602 , 0.76196144, 0.87053858, 0.55601909],
       [0.40069714, 0.48700109, 0.53893757, 0.76891072, 0.40953449],
       [0.36015303, 0.53530548, 0.19596679, 0.59329035, 0.52866948]])

### Slicing and masking

Like Python lists, NumPy arrays support slicing.

In [None]:
np.arange(10)[5:]

array([5, 6, 7, 8, 9])

We can also select only certain elements from the array, by filtering on a boolean mask that applies if a condition is True.

In [None]:
# create values from 0 to 10 (= up to 9 included)
x = np.arange(10)
# mask all values in x that are greather than 5
mask = x >= 5
# print the mask : False for values 0 to 4 included, True for 5 and up
mask

array([False, False, False, False, False,  True,  True,  True,  True,
        True])

In [None]:
# apply the mask to x itself
x[mask]

array([5, 6, 7, 8, 9])

### Exercices

**Exercise 1.** Create a 3d array of shape (2, 2, 2), containing 8 values. Access individual elements and slices.

**Exercise 2.** Rewrite the relu function (see Python section) using [np.maximum](https://numpy.org/doc/stable/reference/generated/numpy.maximum.html). Check that it works on both a single value and on an array of values.

In [None]:
def relu_numpy(x):
  # write your function here
  return

relu_numpy(np.array([1, -3, 2.5]))

**Exercise 3.** Rewrite the average of a vector (1d array) using NumPy (without for loop), with [np.sum](https://numpy.org/doc/stable/reference/generated/numpy.sum.html) and the shape of the array.

In [None]:
def average_numpy(vector):
  # write your function here
  return

my_vector = np.array([12, 11, 9, 5, 7])
# the result should be 8.8
average(my_vector)

Compare with what you get with [np.mean](https://numpy.org/doc/stable/reference/generated/numpy.mean.html)

### Going further

* [Scientific Computing in Python: Introduction to NumPy and Matplotlib](https://sebastianraschka.com/blog/2020/numpy-intro.html)
* [NumPy reference](https://numpy.org/doc/stable/reference/)
* [SciPy lectures](https://scipy-lectures.org/)
* One-hour [tutorial](https://www.youtube.com/watch?v=QUT1VHiLmmI) on Youtube