<div>
    <img src="images/emlyon.png" style="height:60px; float:left; padding-right:10px; margin-top:5px" />
    <span>
        <h1 style="padding-bottom:5px;"> Python BootCamp </h1>
        <a href="https://masters.em-lyon.com/en/msc-in-digital-marketing-data-science">[Emlyon]</a> MSc in Digital Marketing & Data Science (DMDS) <br/>
         September 2022, Paris | © Saeed VARASTEH [RP] | Lucas VILLAIN
    </span>
</div>

### Lecture 04 : Modules and Packages

A module is a file containing Python code that can be re-used in other Python code files.

---

<div class="alert-info"> 
There are three variations of the import statement that you learned for importing names from modules. These three variations translate to the following four variations for importing modules from packages:
</div>

1. import [package]
2. import [package] as [other_name]
3. from [package] import [module]
4. from [package] import [module] as [other_name]

---

#### Python Libraries

Let's talk about Python Libraries.

There are libraries already included and others we have to install before being able to import and use them.

The so called "Standard Library" doesn't need to be installed separately and it includes modules like copy, os and shutil, of which we will talk about later in this section. 

Numpy and Pandas are external libraries that you need to install before using them.

All these libraries offer you great functions that are ready to be used once imported and which will save you a lot of time and give you new opportunities and functions to work with. So lets start with the standard library in the next section.

#### Standard Library:

As you learned, when writing python code you will often import some packages in order to make use of their functions. While there are many packages that you have to install separately, your Python installation comes with a bunch of packages already included. These are called the Python Standard Library. 

Packages that are part of the Standard Library are also often called inbuilt packages, as opposed to external packages.

The standard library consists of packages that are commonly used, and you've already used some of them.

If you want to look at all inbuilt packages, here's the official list: [Check this out](https://docs.python.org/3/library/)

#### COPY

The __copy__ module is part of the standard library of python and offers the user functions to copy objects without references to the original object.

Assignment statements in Python do not copy objects, they create bindings between a target and an object. For collections that are mutable or contain mutable items, a copy is sometimes needed so one can change one copy without changing the other. This module provides generic shallow and deep copy operations.

In [1]:
animals = ["lion", "tiger", "cheetah"]
large_cats = animals
large_cats.append("Leopard")

In [2]:
large_cats

['lion', 'tiger', 'cheetah', 'Leopard']

In [3]:
animals

['lion', 'tiger', 'cheetah', 'Leopard']

In [4]:
animals = ["lion", "tiger", "cheetah"]
large_cats = animals.copy()
large_cats.append("Leopard")

In [5]:
large_cats

['lion', 'tiger', 'cheetah', 'Leopard']

In [6]:
animals

['lion', 'tiger', 'cheetah']

#### MATH, TIME & SYS:

This section will briefly present three inbuilt packages that can be very useful.

__math__ The math package includes - you guessed it - mathematical stuff: It provides mathematical constants, e.g. pi or e, as well as mathematical functions, like log or cos, and also some utilites like ceil and floor. You will find some examples:

In [None]:
import math

print(math.log(5, 2))

In [None]:
print(math.pi)

In [None]:
print(math.floor(2.6))

In [None]:
print(math.ceil(2.6))

__time__ The time package deals with time and timing-related things (e.g. timing how fast a function is, but also which year it is). If you're wondering why there's this entire module as well as the datetime and calendar packages; it's because time is quite a tricky thing to properly describe in programming.

In [8]:
import time

print(time.time())
# Output: These are the seconds that have passed since January 1st 1970 00:00:00

1666104607.0629044


In [9]:
# The script will now pause for 2 seconds
time.sleep(2)
print("Hi")

Hi


In [10]:
# Here's how we can time how long a function takes:
def my_function():
    for n in range(1000000):
        n += 2 * n

start_time = time.time()
my_function()

print("{:.4f} seconds".format(time.time() - start_time))

0.3123 seconds


time object

In [None]:
now = time.localtime()
print(now)

In [None]:
print(now.tm_year)

---

<div class="alert-info" style="background-color:#ece4f5; padding-bottom:22px; background-image:url(images/arrows.png); background-repeat:no-repeat; background-position: right; background-size: contain;">
    <img src="images/assignment.png" style="height:60px; float:left; padding-right:10px;" />
    <span style="font-weight:bold; color:#8966b0;">
        <h4 style="padding-top:25px;"> EXERCISES 03 - PART ONE </h4>
    </span>
</div>

<div class="alert-info" style="background-color:#fff4e3; padding-bottom:22px; background-image:url(images/arrows.png); background-repeat:no-repeat; background-position: right; background-size: contain;">
    <img src="images/homework.png" style="height:60px; float:left; padding-right:10px; padding-left:7px;" />
    <span style="font-weight:bold; color:#db9425;">
        <h4 style="padding-top:25px;"> HOMEWORK 01 & 02 </h4>
    </span>
</div>

---

<div class="alert-info" style="border-bottom: solid 1px lightgray; background-color:#f0ffff;">
    <img src="images/self.png" style="height:60px; float:left; padding-right:10px;" />
    <span style="font-weight:bold; color:#1a8a8a">
        <h4 style="padding-top:25px;"> SELF-STUDY </h4>
    </span>
</div>

#### External Library

__External packages__ need to be installed before using them. These various packages are not included with Python
by default.

Many programming languages offer a __package manager__ that automates the process of installing, upgrading, and removing third-party packages. Python is no exception.

The package manager for Python is called __pip__.


#### Installing Third-Party Packages With Pip

__pip__ is a separate program and a command line tool. That means you must run it from a command line or terminal program.

With your terminal program open, type in the following command to check whether or not pip is installed on your system:

<br/>

<div class="alert-danger"> 
Before doing anything, after you open the terminal, activate your conda environment using this command:
    <b>conda activate</b>
</div>

<code>
pip --version
</code>

#### Installing Third-Party Packages With Pip

To upgrade pip, type the following into your terminal and press Enter:

<code>
pip install --upgrade pip
</code>

#### Listing All Installed Packages

To list all of the packages you have installed:

<code>
pip list
</code>

#### Installing a Package

Let’s install your first Python package! In your terminal, type the following:

<code>
pip install numpy
</code>

#### Installing Speciрc Package Versions 

You can pin dependencies to a specific version with the == version specifier:

<code>
pip install numpy==1.16.5
</code>

#### Uninstalling a Package

Finally, to uninstall a package, type the following into your terminal:

<code>
pip uninstall numpy
</code>

---

#### Numerical Computing with NumPy

In Python we can use __lists__ to store and manipulate sequences of objects, any objects.

While that is very convenient for us it comes at a cost of time and memory.

In this example we create 1,000,000 integers:

In [11]:
import random 
measurements = [random.randint(150, 200) for _ in range(1_000_000)]

In [12]:
measurements[:10]

[164, 169, 158, 186, 199, 160, 156, 191, 166, 179]

and compute their mean:

In [14]:
list_time = %timeit -o sum(measurements) / len(measurements)

20.8 ms ± 818 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Because python doesn't know that our list only contains integers, it has to check everytime it adds values together whether the objects actually support addition. Thats why __sum__ takes "so long".

If we could tell the interpreter that we are only adding integers, we could skip all that typechecking and speed up the operation.

For this purpose, __numpy__ was invented.  

To use numpy we have to import it. The import is usually aliased as __np__ so we have to type less later on.

In [None]:
import numpy as np

Numpy's standard datatype is the __ndarray__ (which stands for n-dimensional array). In the simplest case, numpy array can be created from list.

In [None]:
measurements_array = np.array(measurements)

In [None]:
measurements_array

Now we can use the built-in __mean__ function provided by Numpy:

In [None]:
numpy_time = %timeit -o np.mean(measurements_array)

As we can see, using Numpy significantly speeds up our computation, making it

In [None]:
print(f"{list_time.average / numpy_time.average} times faster")

#### Numpy Arrays 

#### creating arrays from lists

As we already saw we can create Numpy Arrays from lists

In [None]:
np.array([10,2,35])

#### creating arrays using Numpy functions

Whenever we don't want to create an array from specific values like [10, 2, 35] we can use Numpy's utility functions.

These are also faster than Pythons built-in functions. (e.g. __np.arange__ works like __range__)

In [None]:
list(range(5))

In [None]:
np.arange(start=2, stop=14, step=2)

##### np.linspace

Creating an array with a certain number of values in a certain interval:

In [None]:
np.linspace(start=-5, stop=5, num=9)

#### np.zeros and np.ones

Creating an array with only zeros or ones:

In [None]:
np.zeros(3), np.ones(3)

Both take a `shape` argument that lets us create multidimensional arrays.

In [None]:
np.zeros((2, 2))

In [None]:
np.ones((2,3))

#### Anatomy of arrays

#### dtype

Returns data type of the array. Arrays can contain bools, ints, unsigned ints, floats or complex numbers of various byte sizes.

They can also store strings or Python objects, but that has very few use cases.

In [None]:
values = [0, 1, 2, 3, 4]
int_arr = np.array(values, dtype='int')
int_arr, int_arr.dtype

If the dtype does not match the given values, numpy will cast everything to that data type.

In [None]:
bool_arr = np.array(values, dtype='bool')
bool_arr, bool_arr.dtype

If no explicit data type is given, numpy will choose the "smallest common denominator". <br>
In the following example, everything becomes a float, as ints can be represented as floats, but not vice versa.

In [None]:
values = [0, 1, 2.5, 3, 4]
float_arr = np.array(values)
float_arr, float_arr.dtype

#### shape and ndim

__.shape__ is very important for keeping track of arrays with more than one dimension. It is a tuple with the number of elements in each dimension.

__.ndim__ is just the number of dimensions in total. 

#### 1D

In [None]:
values = [1, 2, 3, 4]
one_dim_arr = np.array(values)
one_dim_arr

In [None]:
one_dim_arr.shape

In [None]:
one_dim_arr.ndim

#### 2D

In [None]:
values = [[1, 2, 3, 4, 5],
          [1, 2, 3, 4, 1],
          [1, 2, 3, 4, 2]]
two_dim_arr = np.array(values)
two_dim_arr

In [None]:
two_dim_arr.shape

In [None]:
two_dim_arr.ndim

In [None]:
two_dim_arr[1, 3]

In [None]:
two_dim_arr[1,3] = 10
two_dim_arr

#### 3D

In [None]:
values = [[[1, 2, 3, 4]] * 3] * 6
three_dim_arr = np.array(values)
three_dim_arr

In [None]:
three_dim_arr.shape

In [None]:
three_dim_arr.ndim

In [None]:
three_dim_arr[1,1,1]

#### Numpy Functions - what can we do with arrays 

#### reshape

In [None]:
a = np.arange(start=2, stop=14)
print(a.shape)
a

In [None]:
a.reshape(3, 4)

<div class="alert-info"> 
-1 as axis automatically figures out the size of the dimension which we didn't explicitly specify.
</div>

In [None]:
a.reshape(2, -1)

In [None]:
a.reshape(2, -1).shape

#### Comparing Arrays

In [None]:
a = np.zeros((3,3))
b = np.zeros((3,3))

a == b

In [None]:
a[0,0] += 0.000000000000001
a == b

In [None]:
np.isclose(a, b)

#### Mathematical operations

Numpy contains a lot of mathematical functions that operate on arrays in a vectorized manner. That means that they are applied to each element, without explicit for-loops. Vectorized functions are called __ufuncs__ (universal functions) in Numpy.

In [None]:
arr = np.arange(1, 10)
arr

In [None]:
arr * 3

In [None]:
arr * arr

In [None]:
arr + (arr*2)

In [None]:
arr - arr

In [None]:
arr / arr

In [None]:
arr ** 2

Using `@` you can even do matrix multiplication. In the case of 1d arrays, this is the inner product between two vectors.

In [None]:
arr @ arr

In [None]:
# That's the same as
np.sum(arr * arr)

In [None]:
np.log(arr)

In [None]:
np.log2(arr)

In [None]:
np.exp(arr)

In [None]:
np.sin(arr)

#### Aggregation functions

Aggregation functions are functions that reduce the dimensionality of an array. They provide an __axis__ argument, to specify which dimension to reduce.

In [None]:
two_dim_arr = np.random.randint(0, high=20, size=(3, 4))
two_dim_arr

If just the array is passed, the aggregation operation is performed over the whole array.

In [None]:
np.min(two_dim_arr)

The optional `axis` argument allows us to specify, which dimension should be aggregated. You can think of it as the operation being applied to all entries that are obtained by keeping the indices in all dimensions fixed except for the `axis` dimension.
Let's look at the result of the minimum operation with `axis=0`:

In [None]:
np.min(two_dim_arr, axis=0)

In [None]:
np.max(two_dim_arr)

In [None]:
np.sum(two_dim_arr)

In [None]:
np.average(two_dim_arr, axis=0)

#### Combining arrays

There are many ways to combine existing arrays, like `np.append`, `np.concatenate` and `np.stack`. However, these operations always require the whole array to be copied. Therefore, it often makes more sense to allocate an array of the size you need later upfront and then just fill the respective parts.

#### `concatenate`

In [None]:
a = np.arange(10)
b = np.arange(10)[::-1]

In [None]:
a

In [None]:
b

In [None]:
# Needs a sequence (in this case a tuple) of array-likes
np.concatenate((a,b))

#### `append`

uses `concatenate`internally

In [None]:
# Needs exactly two array-likes
np.append(a, b)

#### `stack`
For higher-dimensional arrays, other functions are useful

In [None]:
np.stack((np.arange(10), np.arange(10)))

#### Masking 

Logical arrays, i.e. arrays containing boolean values, can be used to index other arrays. These logical arrays are then called masks. This is especially useful to index based on logical conditions.

In [None]:
# A simple integer array.
arr = np.arange(1, 6)
arr

In [None]:
# A boolean array of the same shape as arr.
mask = np.array([True, False, True, False, True])
mask

Uising the masked array, its possibel to return only the elemenst where the mask is true. 

In [None]:
arr[mask]

In [None]:
arr < 5

In [None]:
arr[arr < 3]

#### Indexing

In [None]:
large_two_dim_arr = np.arange(81).reshape((9, 9))
large_two_dim_arr

the syntax works as `[row, column]`, to break it down more,   `[row_start : row_end, col_start : col_end]`

In [None]:
large_two_dim_arr[:, 2]

In [None]:
large_two_dim_arr[1, 2:7]

with Standard slicing `(start, stop, step)` works as expected.

In [None]:
large_two_dim_arr[:, 2:7:2]

---