<table style="float:left">
    <tr>
        <td>
            <img src="images/emlyon.png" style="height:60px; float:left; padding-right:10px; margin-top:5px" />
        </td>
        <td style="padding-bottom:10px;">
            <h1 style="border-bottom: 1px solid #eeeeee;"> AI Booster Week 01 - Python for Data Science </h1>
            <span style="display:inline-block; margin-top:-15px;">
            <a href="https://masters.em-lyon.com/fr/msc-in-data-science-artificial-intelligence-strategy">[Emlyon]</a> MSc in Data Science & Artificial Intelligence Strategy (DSAIS)    
            <br/>
            Sep 2024, Paris | © Saeed VARASTEH
            </span>
        </td>
    </tr>
</table>

### Data Structures 
<br/>
<div class="alert-info"> 
A data structure models a collection of data, such as a list of numbers, a row in a spreadsheet, or a record in a database.
</div>

### Tuples

___

Tuples are sequences.

A sequence is an ordered list of values. Each element in a sequence is assigned an integer, called an __index__, that determines the order in which the values appear. 

Just like strings, the index of the first value in a sequence is 0.

In [None]:
full_details = ("John","Doe",25)

<div class="alert-info"> 
Tuples and strings have a lot in common. The main difference between strings and tuple is unlike strings, which are sequences of characters, tuples may contain <b>any type of value</b>, including values of different types.
</div>

Tuples Support Indexing and Slicing:

In [None]:
full_details[1]

<div class="alert-danger"> 
    Like strings, tuples are <b>immutable</b>. This means you can’t change the value of an element of a tuple once it has been created.
</div>

Tuples Are Iterable (you can loop over them):

In [None]:
for item in full_details:
    print(item)

<div class="alert-info"> 
    Checking Existence of Values With <b>in</b>
</div>

In [None]:
if "Bob" in full_details:
    print('Bob is there')
else:
    print('Nope')

### Lists

___

The list data structure is another sequence type in Python.

Just like strings and tuples, lists contain items that are indexed by integers, starting with 0.

A list literal looks almost exactly like a tuple literal, except that it is surrounded with square brackets [ and ] instead of parentheses:

In [None]:
mylist = [1, "John", "Red"]

In [None]:
mylist

Indexing and slicing operations work on lists the same way they do on tuples:

In [None]:
mylist[1:]

<div class="alert-danger"> 
    Unlike tuples, however, lists are <b>mutable</b>, meaning you can change the value at an index even after the list has been created.
</div>

In [None]:
mylist[1] = 'Bob'

In [None]:
mylist

#### List methods

List methods provide a more natural and readable way to mutate a list.

The __list.insert()__ method is used to insert a single new value into a list. It takes two parameters, an index i and a value x, and inserts the value x at index i in the list:

In [None]:
mylist2 = []

In [None]:
mylist2.insert(0, 'A')

In [None]:
mylist2

In [None]:
mylist2.insert(1, 'B')

In [None]:
mylist2

The __list.pop()__ method takes one parameter, an index i, and removes the value from the list at that index. The value that is removed is returned by the method:

In [None]:
out = mylist2.pop(0)

In [None]:
out

In [None]:
mylist2

The __list.append()__ method is used to append an new element to the end of a list:

In [None]:
mylist2.append('C')

In [None]:
mylist2

#### Nesting Lists 

 A nested list is a list that is contained as a value in another list.

In [None]:
my_nested_list = [[1, 2], [3, 4]]

In [None]:
my_nested_list

In [None]:
my_nested_list[1]

<div class="alert-info"> 
You can use <b>double index</b> notation to access an element in the nested list.
</div>

In [None]:
my_nested_list[1][0]

<div class="alert-warning"> 
Readers interested in data analysis or scientific computing may recognize lists of lists as a sort of <b>matrix</b> of values.
</div>

#### Sorting Lists

Lists have a __.sort()__ method that sorts all of the items in ascending order. [Inplace]

In [None]:
numbers = [1, 10, 5, 3]

In [None]:
numbers.sort()

In [None]:
numbers

For example, to sort a list of strings by the length of each string, you can pass the len function to __key__:

In [None]:
colors = ["red", "yellow", "green", "blue"]

In [None]:
colors.sort(key=len)

In [None]:
colors

### Dictionaries

___


One of the most useful data structures in Python is the __dictionary__.

Python dictionaries, like lists and tuples, store a collection of objects.

However, instead of storing objects in a sequence, dictionaries hold information in pairs of data called __key-value pairs__. 

That is, each object in a dictionary has two parts: a __key__ and a __value__.

The __key__ in a key-value pair is a unique name that identifies the value part of the pair.

The following code creates a dictionary literal containing names of countries and their capitals:

In [None]:
capitals = {
"France": "Paris",
"Spain": "Madrid",
"Italy": "Rome",
}

Notice that each key is separated from its value by a colon (:), each key-value pair is separated by a comma (,), and the entire dictionary is enclosed in curly braces { and }.

In [None]:
capitals

__Accessing Dictionary Values__ To access a value in a dictionary, enclose the corresponding key in square brackets [ and ] at the end of dictionary or a variable name assigned to a dictionary:

In [None]:
capitals['France']

<div class="alert-warning"> 
    <b> dictionaries are a fundamentally different data structure than sequence types like lists and tuples, how?</b>
        
Values in a sequence type are accessed by index, which is an integer value expressing the order of items in the sequence.

On the other hand, items in a dictionary are accessed by a key, which doesn’t define any kind of order, but just provides a label that can be used to reference the value.
</div>

__Adding and Removing Values in a Dictionary__ You can add and remove items from a dictionary. Like lists, dictionaries are mutable data structures.

Let’s add a new capital to the capitals dictionary:

In [None]:
capitals['Germany'] = "Berlin"

In [None]:
capitals

To remove an item from a dictionary, use the del keyword with the key for the value you want to delete:

In [None]:
del capitals["Spain"]

In [None]:
capitals

__Checking the Existence of Dictionary Keys__

You can check that a key exists in a dictionary using the __in__ keyword:

In [None]:
if "Spain" in capitals.keys():
    print( capitals["Spain"] )
else:
    print( "Nope" )

__Iterating Over Dictionaries__ Like lists and tuples, dictionaries are iterable. However, looping over a dictionary is a bit different than looping over a list or tuple.

When you loop over a dictionary with a for loop, you iterate over the __dictionary’s keys__:

In [None]:
for key in capitals:
    print(key)

In [None]:
for country in capitals:
    print(f"The capital of {country} is {capitals[country]}")

However, there is a slightly more succinct way to do this:

In [None]:
capitals.items()

In [None]:
for country, capital in capitals.items():
    print(f"The capital of {country} is {capital}")

### Sets

___

Finally, short note on sets. Sets are used to store multiple items in a single variable.

A set is a collection which is unordered, unchangeable, and unindexed. 

Sets cannot have two items with the same value.

In [None]:
myset = {"apple", "banana", "cherry"}

In [None]:
myset

---

__How to Pick a Data Structure?__

Use a __list__ when:
- Data has a natural order to it
- You will need to update or alter the data during the program
- The primary purpose of the data structure is iteration

Use a __tuple__ when:
- Data has a natural order to it
- You will not need to update or alter the data during the program
- The primary purpose of the data structure is iteration

Use a __dictionary__ when:
- The data is unordered, or the order does not matter
- You will need to update or alter the data during the program
- The primary purpose of the data structure is looking up values


### List Comprehensions

___

A way to create a list from an existing iterable is with a list comprehension:

In [None]:
numbers = (1, 2, 3, 4, 5)

squares = [num**2 for num in numbers]

In [None]:
squares

A list comprehension is a short-hand for a for loop. In the example above, a tuple literal containing five numbers is created and assigned to the numbers variable. On the second line, a list comprehension loops over each number in numbers, squares each number, and adds it to a new list called squares.

A traditional for loop is:

In [None]:
squares = []
for num in numbers:
    squares.append(num**2)

In [None]:
squares

<div class="alert-warning"> 
    List comprehensions are commonly used to convert values in one list to a different type.
</div>

For instance, converting a list of strings containing floating point values to a list of float objects:

In [None]:
str_numbers = ["1.5", "2.3", "5.25"]
float_numbers = [float(value) for value in str_numbers]

In [None]:
float_numbers

### Modules and Packages

___

A module is a file containing Python code that can be re-used in other Python code files.

<br/>
<div class="alert-info"> 
Check the following four variations for importing modules from packages:
</div>

    import [package]
    import [package] as [other_name]
    from [package] import [module]
    from [package] import [module] as [other_name]
    
    
#### Python Libraries (Standard Libraries)

The so called "Standard Library" doesn't need to be installed separately and it includes modules like copy, os, math, time, random, and shutil.

If you want to look at all inbuilt packages, here's the official list: [Check this out](https://docs.python.org/3/library/)

#### Math

The math package includes - you guessed it - mathematical stuff: It provides mathematical constants, e.g. pi or e, as well as mathematical functions, like log or cos, and also some utilites like ceil and floor. You will find some examples:

In [None]:
import math

print(math.log(5, 2))

In [None]:
print(math.floor(2.6))

In [None]:
print(math.ceil(2.6))

#### Time

The time package deals with time and timing-related things (e.g. timing how fast a function is, but also which year it is). If you're wondering why there's this entire module as well as the datetime and calendar packages; it's because time is quite a tricky thing to properly describe in programming.

In [None]:
import time

print(time.time())
# Output: These are the seconds that have passed since January 1st 1970 00:00:00

In [None]:
# The script will now pause for 2 seconds
time.sleep(2)
print("Hi")

In [None]:
# Here's how we can time how long a function takes:
def my_function():
    for n in range(1000000):
        n += 2 * n

start_time = time.time()
my_function()

print("{:.4f} seconds".format(time.time() - start_time))

You can do the same measurement using the `%%time` command.

`%%time` is a so-called "magic" command of `IPython`.

In [None]:
%%time

def my_function():
    for n in range(1000000):
        n += 2 * n

my_function()

#### External Libraries

__External packages__ need to be installed before using them. These various packages are not included with Python
by default.

Many programming languages offer a __package manager__ that automates the process of installing, upgrading, and removing third-party packages. Python is no exception.

The package manager for Python is called __pip__.


#### Installing Third-Party Packages With Pip

__pip__ is a separate program and a command line tool. That means you must run it from a command line or terminal program.

With your terminal program open, type in the following command to check whether or not pip is installed on your system:

<br/>

<div class="alert-danger"> 
Before doing anything, after you open the terminal, activate your conda environment using this command:
    <b>conda activate</b>
</div>

<code>
pip --version
</code>

#### Installing Third-Party Packages With Pip

To upgrade pip, type the following into your terminal and press Enter:

<code>
pip install --upgrade pip
</code>

#### Listing All Installed Packages

To list all of the packages you have installed:

<code>
pip list
</code>

#### Installing a Package

Let’s install your first Python package! In your terminal, type the following:

<code>
pip install numpy
</code>

#### Installing Speciрc Package Versions 

You can pin dependencies to a specific version with the == version specifier:

<code>
pip install numpy==1.16.5
</code>

#### Uninstalling a Package

Finally, to uninstall a package, type the following into your terminal:

<code>
pip uninstall numpy
</code>

---

<div class="alert-info" style="border-bottom: solid 1px lightgray; background-color:#f0ffff;">
    <img src="images/self.png" style="height:60px; float:left; padding-right:10px;" />
    <span style="font-weight:bold; color:#1a8a8a">
        <h4 style="padding-top:25px;"> SELF-STUDY </h4>
    </span>
</div>

### Numerical Computing with NumPy

In Python we can use __lists__ to store and manipulate sequences of objects, any objects.

While that is very convenient for us it comes at a cost of time and memory.

In this example we create 1,000,000 integers:

In [None]:
import random 
measurements = [random.randint(150, 200) for _ in range(1_000_000)]

In [None]:
measurements[:10]

and compute their mean:

In [None]:
list_time = %timeit -o sum(measurements) / len(measurements)

Because python doesn't know that our list only contains integers, it has to check everytime it adds values together whether the objects actually support addition. Thats why __sum__ takes "so long".

If we could tell the interpreter that we are only adding integers, we could skip all that typechecking and speed up the operation.

For this purpose, __numpy__ was invented.  

To use numpy we have to import it. The import is usually aliased as __np__ so we have to type less later on.

In [None]:
import numpy as np

Numpy's standard datatype is the __ndarray__ (which stands for n-dimensional array). In the simplest case, numpy array can be created from list.

In [None]:
measurements_array = np.array(measurements)

In [None]:
measurements_array

Now we can use the built-in __mean__ function provided by Numpy:

In [None]:
numpy_time = %timeit -o np.mean(measurements_array)

As we can see, using Numpy significantly speeds up our computation, making it

In [None]:
print(f"{list_time.average / numpy_time.average} times faster")

#### Numpy Arrays 

#### creating arrays from lists

As we already saw we can create Numpy Arrays from lists

In [None]:
np.array([10,2,35])

#### creating arrays using Numpy functions

Whenever we don't want to create an array from specific values like [10, 2, 35] we can use Numpy's utility functions.

These are also faster than Pythons built-in functions. (e.g. __np.arange__ works like __range__)

In [None]:
list(range(5))

In [None]:
np.arange(start=2, stop=14, step=2)

##### np.linspace

Creating an array with a certain number of values in a certain interval:

In [None]:
np.linspace(start=-5, stop=5, num=9)

#### np.zeros and np.ones

Creating an array with only zeros or ones:

In [None]:
np.zeros(3), np.ones(3)

Both take a `shape` argument that lets us create multidimensional arrays.

In [None]:
np.zeros((2, 2))

In [None]:
np.ones((2,3))

#### Anatomy of arrays

#### dtype

Returns data type of the array. Arrays can contain bools, ints, unsigned ints, floats or complex numbers of various byte sizes.

They can also store strings or Python objects, but that has very few use cases.

In [None]:
values = [0, 1, 2, 3, 4]
int_arr = np.array(values, dtype='int')
int_arr, int_arr.dtype

If the dtype does not match the given values, numpy will cast everything to that data type.

In [None]:
bool_arr = np.array(values, dtype='bool')
bool_arr, bool_arr.dtype

If no explicit data type is given, numpy will choose the "smallest common denominator". <br>
In the following example, everything becomes a float, as ints can be represented as floats, but not vice versa.

In [None]:
values = [0, 1, 2.5, 3, 4]
float_arr = np.array(values)
float_arr, float_arr.dtype

#### shape and ndim

__.shape__ is very important for keeping track of arrays with more than one dimension. It is a tuple with the number of elements in each dimension.

__.ndim__ is just the number of dimensions in total. 

#### 1D

In [None]:
values = [1, 2, 3, 4]
one_dim_arr = np.array(values)
one_dim_arr

In [None]:
one_dim_arr.shape

In [None]:
one_dim_arr.ndim

#### 2D

In [None]:
values = [[1, 2, 3, 4, 5],
          [1, 2, 3, 4, 1],
          [1, 2, 3, 4, 2]]
two_dim_arr = np.array(values)
two_dim_arr

In [None]:
two_dim_arr.shape

In [None]:
two_dim_arr.ndim

In [None]:
two_dim_arr[1, 3]

In [None]:
two_dim_arr[1,3] = 10
two_dim_arr

#### 3D

In [None]:
values = [[[1, 2, 3, 4]] * 3] * 6
three_dim_arr = np.array(values)
three_dim_arr

In [None]:
three_dim_arr.shape

In [None]:
three_dim_arr.ndim

In [None]:
three_dim_arr[1,1,1]

#### Numpy Functions - what can we do with arrays 

#### reshape

In [None]:
a = np.arange(start=2, stop=14)
print(a.shape)
a

In [None]:
a.reshape(3, 4)

<div class="alert-info"> 
-1 as axis automatically figures out the size of the dimension which we didn't explicitly specify.
</div>

In [None]:
a.reshape(2, -1)

In [None]:
a.reshape(2, -1).shape

#### Comparing Arrays

In [None]:
a = np.zeros((3,3))
b = np.zeros((3,3))

a == b

In [None]:
a[0,0] += 0.000000000000001
a == b

In [None]:
np.isclose(a, b)

#### Mathematical operations

Numpy contains a lot of mathematical functions that operate on arrays in a vectorized manner. That means that they are applied to each element, without explicit for-loops. Vectorized functions are called __ufuncs__ (universal functions) in Numpy.

In [None]:
arr = np.arange(1, 10)
arr

In [None]:
arr * 3

In [None]:
arr * arr

In [None]:
arr + (arr*2)

In [None]:
arr - arr

In [None]:
arr / arr

In [None]:
arr ** 2

Using `@` you can even do matrix multiplication. In the case of 1d arrays, this is the inner product between two vectors.

In [None]:
arr @ arr

In [None]:
# That's the same as
np.sum(arr * arr)

In [None]:
np.log(arr)

In [None]:
np.log2(arr)

In [None]:
np.exp(arr)

In [None]:
np.sin(arr)

#### Aggregation functions

Aggregation functions are functions that reduce the dimensionality of an array. They provide an __axis__ argument, to specify which dimension to reduce.

In [None]:
two_dim_arr = np.random.randint(0, high=20, size=(3, 4))
two_dim_arr

If just the array is passed, the aggregation operation is performed over the whole array.

In [None]:
np.min(two_dim_arr)

The optional `axis` argument allows us to specify, which dimension should be aggregated. You can think of it as the operation being applied to all entries that are obtained by keeping the indices in all dimensions fixed except for the `axis` dimension.
Let's look at the result of the minimum operation with `axis=0`:

In [None]:
np.min(two_dim_arr, axis=0)

In [None]:
np.max(two_dim_arr)

In [None]:
np.sum(two_dim_arr)

In [None]:
np.average(two_dim_arr, axis=0)

#### Combining arrays

There are many ways to combine existing arrays, like `np.append`, `np.concatenate` and `np.stack`. However, these operations always require the whole array to be copied. Therefore, it often makes more sense to allocate an array of the size you need later upfront and then just fill the respective parts.

#### `concatenate`

In [None]:
a = np.arange(10)
b = np.arange(10)[::-1]

In [None]:
a

In [None]:
b

In [None]:
# Needs a sequence (in this case a tuple) of array-likes
np.concatenate((a,b))

#### `append`

uses `concatenate`internally

In [None]:
# Needs exactly two array-likes
np.append(a, b)

#### `stack`
For higher-dimensional arrays, other functions are useful

In [None]:
np.stack((np.arange(10), np.arange(10)))

#### Masking 

Logical arrays, i.e. arrays containing boolean values, can be used to index other arrays. These logical arrays are then called masks. This is especially useful to index based on logical conditions.

In [None]:
# A simple integer array.
arr = np.arange(1, 6)
arr

In [None]:
# A boolean array of the same shape as arr.
mask = np.array([True, False, True, False, True])
mask

Uising the masked array, its possibel to return only the elemenst where the mask is true. 

In [None]:
arr[mask]

In [None]:
arr < 5

In [None]:
arr[arr < 3]

#### Indexing

In [None]:
large_two_dim_arr = np.arange(81).reshape((9, 9))
large_two_dim_arr

the syntax works as `[row, column]`, to break it down more,   `[row_start : row_end, col_start : col_end]`

In [None]:
large_two_dim_arr[:, 2]

In [None]:
large_two_dim_arr[1, 2:7]

with Standard slicing `(start, stop, step)` works as expected.

In [None]:
large_two_dim_arr[:, 2:7:2]

---