<center>
<img src="images/nva5b9dq8r631.png" alt="Python-MEME" width="500" height="600">
</center>

Credit: this notebook was inspired by and built upon [this tutorial]( https://github.com/HSE-LAMBDA/MLatMIPS-2020) from Mosphys 2020 school.

## **Before we start**

Don't forget **the most important helper feature** in Jupyter notebooks: 
* if you're typing something, press `Tab` to see automatic suggestions, use arrow keys + enter to pick one.
* if you move your cursor inside some function and press `Shift + Tab`, you'll get a help window.

## **Numpy**

Almost any machine learning model requires some computational heavy lifting usually involving linear algebra problems. Unfortunately, raw python is terrible at this because each operation is interpreted at runtime. 

So instead, we'll use [NumPy](https://numpy.org) - a library that lets you run blazing [fast](https://numpy.org/doc/stable/user/whatisnumpy.html#why-is-numpy-fast) computation with vectors, matrices and other tensors. In this notebook we will present a general overiview of NumPy, but we also encourage you to have a look at [this](https://numpy.org/devdocs/user/quickstart.html) quickstart tutorial for yet another introduction into the library.

In [None]:
import numpy as np

In [None]:
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)

### Creation

In [None]:
# We can initialize NumPy arrays from Python lists, and access elements using square brackets
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])
print("a = ", a)
print("b = ", b)

In [None]:
type(a)

In [None]:
c = np.array([[1, 2.0], [0, 0],
              (1 + 1j, 3.)])  # Note mix of lists and tuple, and mix of types
c

In [None]:
c = np.array([[1, 2.0], [0, 0], (1 + 1j, 3.)], dtype=np.int)

In [None]:
c = np.array([[1.5, 2, 3], [4, 5, 6]], dtype=np.complex)
c

In [None]:
# or initialise using the following methods:

In [None]:
np.arange(3, 15, 2)  # start, stop, step

In [None]:
np.linspace(0, 10, 11)  # Divide [0, 10] interval into 11 points

In [None]:
np.logspace(1, 10, base=2, dtype=np.int64)  # Base 2, number = 50

In [None]:
np.logspace(1, 10, 10, base=2)  # Base 2, number = 10

In [None]:
np.zeros(shape=(3, 4))

In [None]:
np.ones(shape=(2, 5))

In [None]:
np.ones(shape=(2, 5), dtype=np.bool)

In [None]:
np.full((2, 2), 99)

In [None]:
# Return a 2-D array with ones on the diagonal and zeros elsewhere
np.eye(4)

In [None]:
np.random.random((2, 2))

### Problem №0

Compute the determinant of a given matrix.  
_Hint:_ you might want to use `numpy.linalg` module for this

In [None]:
a = np.array([[1, 0], [1, 2]])
a

### Operations 

In [None]:
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])

In [None]:
# Math and boolean operations can be applied to each element of an array

print("a + 1 = ", a + 1)
print("a * 2 = ", a * 2)
print("a == 2 ", a == 2)

# And corresponding elements of two (or more) arrays
print("a + b = ", a + b)
print("a * b = ", a * b)
print("a / b = ", a / b)

In [None]:
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2])

print("a+b = ", a+b)
print("a*b = ", a*b)
print("(a*b) / 2 = ", (a*b) / 2)

In [None]:
a = np.array([1, 2, 3, 4, 5])
b = np.array([5, 4, 3, 2, 1])

print("a * b = ", a * b)

In [None]:
print("numpy.sum(a) = ", np.sum(a)) 
print("numpy.prod(a) = ", np.prod(a))
print("numpy.mean(a) = ", np.mean(a))
print("numpy.min(a) = ", np.min(a))
print("numpy.argmin(a) = ", np.argmin(a))  # Index of the minimal element


In [None]:
print("numpy.dot(a, b) = ", np.dot(a, b))  # Dot product. Also used for matrix/tensor multiplication
print("a@b = ", a@b)  # fancier dot product 
print("numpy.subtract(a, b) = ", np.subtract(a, b))  # a - b
print("numpy.divide(a, b) = ", np.divide(a, b))  # a / b

In [None]:
 # Find the unique elements of an array.
np.unique(['male', 'male', 'female', 'female', 'male', 'qqq'])

In [None]:
np.unique([1, 2, 1, 1, 0, 99], return_counts=True)

### Transformations

#### indexing

In [None]:
array = np.array([7, 5, 3, 2, 6, 1, 4])
array

In [None]:
sorted_array = np.sort(array)
sorted_array

In [None]:
reverse_array = sorted_array[::-1]
reverse_array

In [None]:
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
a

In [None]:
a.shape

In [None]:
# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):

b = a[:2, 1:3]
b

In [None]:
b = a[:3, 0:4]
b

In [None]:
b.shape

In [None]:
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
a

In [None]:
# Two ways of accessing the data in the middle row of the array.
# Mixing integer indexing with slices yields an array of lower rank,
# while using only slices yields an array of the same rank as the
# original array:

a_row_1 = a[1, :]  # Rank 1 view of the second row of a
a_row_1

In [None]:
a_row_1.shape

In [None]:
a_row_2 = a[1:2, :]  # Rank 2 view of the second row of a
a_row_2

In [None]:
a_row_2.shape

In [None]:
a_row_1 = a[1, 1:3]  # Rank 1 view of the second row of a
a_row_1

In [None]:
a = np.array([[1, 2], [3, 4], [5, 6]])

bool_idx = a > 2

bool_idx

#### reshaping

In [None]:
np.arange(24)

In [None]:
np.arange(24).reshape(6, 4)

In [None]:
np.arange(24).reshape(2, 3, 4)

In [None]:
np.arange(24).reshape(4, 3, 2)

In [None]:
# add dimension
# do you understand why it might be useful?
np.arange(3)[:, np.newaxis]

In [None]:
np.arange(3)[np.newaxis, :]

In [None]:
np.arange(3)[:, np.newaxis] + np.arange(3)[np.newaxis, :]

#### concatinating

In [None]:
matrix1 = np.arange(50).reshape(10, 5)
matrix1

In [None]:
matrix2 = -np.arange(20).reshape(10, 2)
matrix2

In [None]:
np.concatenate([matrix1, matrix2],
               axis=1)  # Join a sequence of arrays along an existing axis.

In [None]:
# Axes are defined for arrays with more than one dimension.
# A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0),
# and the second running horizontally across columns (axis 1).

# Suppose your data tensor has the shape (19,19,5,80). This means:
# axis 0 = 19 elements
# axis 1 = 19 elements
# axis 2 = 5 elements
# axis -1 = axis 3 = 80 elements

In [None]:
# Many operations can take place along one of these axes.
# For example, we can sum each row of an array, in which case we operate along with columns, or axis 1:

x = np.arange(12).reshape(3, 4)
x

In [None]:
x.sum(axis=1)  # See below about 'sum'.

In [None]:
x.sum(axis=0)

In [None]:
x.sum(axis=-1)

In [None]:
matrix1 = np.arange(50).reshape(10, 5)
matrix2 = -np.arange(20).reshape(10, 2)

np.concatenate([matrix1, matrix2])  # Default is 0.

In [None]:
matrix2.T  # transposed array

### Problem №1

Divide diagonal elements of the matrix with the sum of elements in the corresponding row.

In [None]:
a = np.arange(1, 17).reshape(4,4)
a

## **Pandas**

[Pandas](https://pandas.pydata.org/docs/index.html#) is a Python package that provides fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

Pandas is well suited for many different kinds of data:
* Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
* Ordered and unordered (not necessarily fixed-frequency) time series data.
* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
* Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

[Here](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html) you can have a look at quick overview of data formats in Pandas, but we will also go through them in this tutorial.  

In [None]:
import pandas as pd

### Creating dataframes

#### from numpy arrays

In [None]:
a = np.random.normal(size=100)
a

In [None]:
# The main object in pandas is DataFrame
df = pd.DataFrame(a)
df

In [None]:
a = np.random.normal(size=20)
df = pd.DataFrame(a, columns=['column_name'], dtype=np.complex)
df

In [None]:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                  columns=['A', 'B', 'C'])
df

In [None]:
df.A

In [None]:
type(df)

In [None]:
type(df.A)

#### from series

In [None]:
a_array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
b_array = np.random.randint(low=1, high=11, size=len(a_array))
b_array

In [None]:
type(b_array)

In [None]:
# Series is a one-dimensional labeled array
a_series = pd.Series(a_array, name='a_array', dtype='float32')
b_series = pd.Series(b_array, name='b_array', dtype='float32')
b_series

In [None]:
type(b_series)

In [None]:
df = pd.DataFrame({'a_series': a_series, 'b_series': b_series})
df

#### from file

In [None]:
# .csv is one of the formats to store tabular data
!ls -lhtr data/train.csv

In [None]:
# this is a Titanic dataset
data = pd.read_csv("./data/train.csv")

In [None]:
data

In [None]:
data = pd.read_csv("./data/train.csv", index_col='PassengerId')

In [None]:
data

In [None]:
data.shape

In [None]:
data.  # press Tab

In [None]:
data.columns

Here is the [description](https://www.kaggle.com/c/titanic/data) of some of the columns:
* Name - a string with a person's full name
* Survived - 1 if a person survived the shipwreck, 0 otherwise.
* Pclass - passenger class. Pclass == 3 is cheap'n'cheerful, Pclass == 1 is for moneybags.
* Sex - a person's gender
* Age - age in years, if available
* Sibsp - number of siblings/spouses on a ship
* Parch - number of parents/children on a ship
* Fare - ticket cost
* Embarked - the port where the passenger embarked (C = Cherbourg; Q = Queenstown; S = Southampton)
* Cabin - Cabin number
* Ticket - Ticket Number

In [None]:
data.describe()  # Generate descriptive statistics.

In [None]:
data["Sex"].describe()

### Accessing data

In [None]:
head = data[:10]
head

In [None]:
type(head)

In [None]:
# return the first 5 rows
data.head()

In [None]:
data.head(10)

In [None]:
# return the last 5 rows
data.tail()

In [None]:
data.tail(10)

In [None]:
# return a random sample of items
data.sample()

In [None]:
data.sample(5)

In [None]:
# dimensions
print("len(data) = ", len(data))
print("data.shape = ", data.shape)

In [None]:
# select a single row by label
data.loc[4]

In [None]:
type(a)

In [None]:
# select a single column.
ages = data["Age"]
ages[:10]

In [None]:
data.Age[:10]

In [None]:
# select several columns and rows at once

data.loc[5:10, ("Fare",
                "Pclass")]  # Alternatively: data[["Fare", "Pclass"]].loc[5:10]

In [None]:
data[["Fare", "Pclass"]].loc[5:10]

There are two ways of [indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) the rows in pandas:
 *   by index column values (`PassengerId` in our case) – with `.loc` 
 *   by positional index - with `.iloc` 

Also in case of our dataset indices (`PassengerId`) start from 1, so positional index 0 will correspond to index column value 1, positional 1 to index column value 2, and so on:

In [None]:
data.index

In [None]:
print(data.iloc[0])

In [None]:
print(data.loc[1])

### Querying data

In [None]:
data[data['Sex'] == 'female']

In [None]:
data[data["Pclass"] < 3]

In [None]:
data.query('Pclass < 3')  # Query the columns of a DataFrame with a boolean expression.

In [None]:
data.eval('SibSp + Parch')

In [None]:
data.eval('FamilyRel = SibSp + Parch', inplace = True)

In [None]:
data.head()

### Transforming data

In [None]:
subdata = data[['Age', 'Parch']]

In [None]:
subdata.apply(np.sqrt)  # Apply a function along an axis of the DataFrame.

In [None]:
subdata.apply(np.sum, axis=0)

In [None]:
subdata.apply(np.sum, axis=1)

In [None]:
subdata.apply(lambda x: x, axis=1)

In [None]:
# and now groups
# see for lot more on groups: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
grouped = data.groupby('Sex')

In [None]:
grouped['Sex'].apply(lambda x: x.describe())

In [None]:
grouped.sum()

In [None]:
for name, group in grouped:
    print(name)
    print(group.head())

In [None]:
# apply filter to group as a whole
filtered_group = data.groupby('Pclass').filter(lambda x: len(x.Survived) > 300)
filtered_group

In [None]:
np.unique(filtered_group.Pclass)

In [None]:
grouped.agg({'Fare': np.sum,
             'Age': lambda x: np.std(x, ddof=1),
             'Survived': np.mean
            })

### NaNs

Some columns contain __NaN__ values - this means that there is no data there. For example, passenger `#5` has an unknown age. To simplify the future data analysis, we'll replace NaN values by using pandas `fillna` function.

**NB: we do this so easily because it's a tutorial. In general, you think twice before you modify data like this.**

In [None]:
data.iloc[5]

In [None]:
data.loc[889]

In [None]:
pd.isna(data)  # Detect them all

In [None]:
data.notna()  # Detect existing (non-missing) values

### Problem №2

Identify the columns in the Titanic dataset which have NaN values, and also count their number. And one line of code please😉

In [None]:
data.head()

In [None]:
# filling NaNs with 0
data.fillna(0)

In [None]:
data['Age'][-5:]

In [None]:
data['Age'].mean()

In [None]:
# filling NaNs with mean of the column
data['Age'].fillna(value=data['Age'].mean())[-5:]

In [None]:
# drop NaN values
data.dropna()  

# do you understand the default parameters?

In [None]:
data.dropna(axis='columns')

In [None]:
# Drop the rows where all elements are missing.
data.dropna(how='all')

In [None]:
# Furthermore, one has to keep in mind that in Python (and NumPy), the nan's don’t compare equal, but None's do.
# Note that pandas/NumPy uses the fact that np.nan != np.nan, and treats None like np.nan.

In [None]:
df = pd.DataFrame(np.random.randn(5, 3),
                  index=['a', 'c', 'e', 'f', 'h'],
                  columns=['one', 'two', 'three'])
df

In [None]:
df_nans = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df_nans

In [None]:
None == None

In [None]:
np.nan == np.nan

In [None]:
# So as compared to above, a scalar equality comparison versus a None/np.nan doesn’t provide useful information.
df_nans['one'] == np.nan

### Problem №3

Find the guy, who paid the most for the ticket. Yeah, one line of code, as always.

### Problem №4

Find out whether children are more likely to survive.

In [None]:
ax = data['Age'].hist()
ax.axvline(x=18, color='brown', linewidth=3)

### Coffee Break?