# Lab #1 Intro to numpy and Data Analysis

This aim of this lab is to get you aquainted to very important python modules (libraries):
- numpy
- pandas
- matplotlib

Other modules in this jupyter notebook are **prohibited**, unless it is stated otherwise.

#### About tasks

This notebook consists of numerous tasks but please make it look like a whole story: a report with your own code, thoughts and conclusions. In some of these tasks you will have to implement some custom functions, in some of them you will be asked to present some plots and describe them. Please try to make your code as short as possible and your answers as clear as possible.

#### Evaluation

Each task has its value, **15 points** in total. If you use some open-source code please make sure to include the url. 

#### How to submit
- Name your file according to this convention: `lab01_GroupNo_Surname_Name.ipynb`. If you don't have group number, put `nan` instead.
- Attach it to an email with topic `lab01_GroupNo_Surname_Name.ipynb`
- Send it to `cosmic.research.ml@yandex.ru`

## Part 1. Numpy [7 points]

In this part you must not use loops (`for`, `while`) and `map` function. For every implemented function provide some usage example (you may use randomly sampled matrices). Pay attention to types of input variables

In [None]:
import numpy as np

**Task 1. [0.5 point]** Implement a function that takes two matrices as input, flattens them and returns a one-dimensional array where elements of these matrices alternate. 

For example, `(np.array([[1,2,3], [4,5,6]]), np.array([[7,8],[9,10]])) -> [1,7,2,8,3,9,4,10,5,6]`.

In [None]:
def flatten_merge(arr_a, arr_b):
    assert isinstance(arr_a, np.ndarray) and isinstance(arr_b, np.ndarray)
    # YOUR CODE HERE

**Task 2. [0.5 point]** Implement a function to calculate the product of non-zero elements of array. For example, for `np.array([1,2,0,6])` the answer is 12. 

If there are no non-zero elements, the function must return `nan`.

In [None]:
def product_non_zero(arr):
    assert isinstance(arr, np.ndarray)
    # YOUR CODE HERE

**Task 3. [1 point]** Normalize every column of the input matrix: subtract the mean and divide by the std (avoid division by zero)

In [None]:
def vertical_scale(arr):
    assert isinstance(arr, np.ndarray)
    # YOUR CODE HERE

**Task 4. [0.5 points]** Implement a function that returns transposed matrix, without changing given.

In [None]:
def safe_transpose(arr):
    assert isinstance(arr, np.ndarray)
    # YOUR CODE HERE

**Task 5. [0.5]** Implement a funtion that returns index of max element in the matrix.

In [None]:
def max_elem_index(arr):
    assert isinstance(arr, np.ndarray)
    # YOUR CODE HERE

**Task 6. [1 points]** Implement a function that will take a matrix and insert a zeros between every column and row. 

Example: `[[1, 2], [3, 4]] -> [[1, 0, 2], [0, 0, 0], [3, 0, 4]]`

In [None]:
def insert_zeros(arr):
    assert isinstance(arr, np.ndarray)
    # YOUR CODE HERE   

**Task 7. [0.5 points]** Implement a function that returns those columns of matrix that have count of elements greater than `k` is greater than count of elements smaller than `k`. 

For example, `([[1,2],[3,4]], 2) -> [[2], [4]]`.

In [None]:
def k_columns(arr, k):
    assert isinstance(arr, np.ndarray)
    # YOUR CODE HERE 

**Task 8. [0.5 points]** Implement a function that an integer matrix and an integer as input. It multiplies each element of the matrix by minimal factor that this element becomes divisible by the given number. 

For example, `([[5, 4, 36, 8]], 12) -> [[60, 12, 36, 24]]`.

In [None]:
def make_divisible(arr, k):
    assert isinstance(arr, np.ndarray)
    assert isinstance(k, int)
    # YOUR CODE HERE

**Task 9. [0.5 points]** Given a matrix, implement a function that transforms all elements greater than `a_max` into `a_max`. And all elements that are smaller that `a_min` into `a_min`.

In [None]:
def min_max_crop(arr, a_min, a_max):
    assert isinstance(arr, np.ndarray)
    # YOUR CODE HERE

**Task 10. [0.5 points]** Implement a function that replaces nan elements with the mean of all non-nan elements. In case if all elements are nan the function does nothing.

In [None]:
def replace_nans(arr):
    assert isinstance(arr, np.ndarray)
    # YOUR CODE HERE

**Task 11. [0.5 points]** Implement a function that calculates:

- determinant
- trace
- eigenvalues
- Frobenius norm
- inverse matrix

of a matrix given

In [None]:
def matrix_stats(arr):
    assert isinstance(arr, np.ndarray)
    # YOUR CODE HERE

**Task 12. [0.5 points]** Implement a function that takes two lists of same length `N`, constructs a `N` by 3 matrix. First two columns are the elements of input lists and  values in the third column are the result of bitwise xor of elements of the same row.

In [None]:
def construct_xor_matrix(list_a, list_b):
    assert isinstance(list_a, list)
    assert isinstance(list_b, list)
    # YOUR CODE HERE

# Part 2. Dataset analysis [8 points]


In this part we are going to analyze the "Titanic dataset".
The main goal of this task is to describe the data.

Here are some tips:
- use plots
- notice pecualrities in the data
- present verbal explanations, don't be too shy

**Important** Please pay attention to your plots: titles, axis-labels and legends are necessary.

These  tasks involve `numpy`, `pandas` and `matplotlib` - very common python modules. In one task you may use `scipy`.

#### Input data
This task uses 2 files:
- `passengers_record.csv` contains some general information on passengers (name, class, age, etc.)
- `survival_info.csv` contains binary labels whether passenger survived or not

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

**Task 1. [1 point]**

Read these files and join them into single dataset (use `passenger_id` as join key). Then provide some description of the dataset: 
- What are age/gender/class distributions and their averages? How many people belong to each group?
- Find the oldest/youngest passangers in every class
- Compare survival rates between classes/age groups/genders (configuration of age groups is up to you)

In [None]:
passengers_record = # YOUR CODE HERE
survival_info = # YOUR CODE HERE

assert passengers_record.shape == (891, 11) and survival_info.shape == (891, 2), "Wrong db shapes"

In [None]:
db = # YOUR CODE HERE

**Task 2. [2 points]**

Compare average age per class: can we consider these values to be equal? Use visualisation to prove your point. 

Can we answer this question using the T-test? Why? Here you may use `scipy.stats`.

**Task 3. [2 points]**

Use `plt.subplots` to create side-by-side histograms of distributions:
- columns: 3 classes
- rows: age, fare, sex, survival rate

The output is 4 by 3 table.

**Task 4. [1 point]**

Find 5 most common surnames in passenger list. Try to use `pandas.Series.apply` and lambda-function to extract surnames.

**Task 5. [2 points]**

Probably you already have noticed that there are missing values in the dataset. We will try to fix that.
1. Omit all the rows, that contain missing values. Is the result plausible? How does this method effect amount of data and values from Task 1 (age histograms, survival rates, etc.?)
2. Suggest some better options to handle missing values in the data and check if it distorts the statistics.

### Great! Don't forget to submit before the deadline :)