# Lab 'Afterlife'
### Review of basic concepts to pass the course

### Important: do not delete any blocks
#### But you may add as many as you need.


#### About tasks

This notebook consists of numerous tasks but please make it look like a whole story: a report with your own code, thoughts and conclusions. In some of these tasks you will have to implement some custom functions, in some of them you will be asked to present some plots and describe them. Please try to make your code as short as possible and your answers as clear as possible (in Russian or English).


#### Evaluation

- There are **Questions** in the tasks, don't skip them. If you skip a question, the whole task is considered as skipped.
- When your answer includes some numbers, make sure to provide some code or calculations that prove your results.
- Pay a lot attention to your plots:
    - Are they comprehensible? Shapes, colours, sizes?
    - Are they titled?
    - Are axes labelled?
    - Is legend included?

#### How to submit
- Name your file according to this convention: `2021_afterlife_GroupNumber_Surname_Name.ipynb`, for example 
    - `2021_afterlife_404_Sheipak_Sviat.ipynb`
- Attach your .ipynb to an email with topic `2021_afterlife_GroupNumber_Surname_Name.ipynb`
- Send it to `cosmic.research.ml@yandex.ru`


#### The Data:
- All the datasets you need are here:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Part 1. Basic concepts

**Task 1.1.**

**Q:** What is supervised learning? What is the main goal of supervised learning? What is the difference between classification and regression?

**Your answer here:**


**Task 1.2.**

**Q:** What are objects and labels in classification and regression? Provide at least three examples of classification and regression tasks. For each of these examples suggest a few features that can be used in this tasks.

**Your answer here:**


**Task 1.3.**

**Q:** Name as many binary classification quality metrics as you can. Provide a formula for each of them

**Your answer here:**


**Task 1.4.**

**Q:** Calculate all the metrics you mentioned in the task above. Implement them, do not import anything apart from `numpy`.

In [None]:
predictions = [1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1]
ground_truth = [1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0]

# Metric 1:
# YOUR CODE HERE

# Metric 2:
# YOUR CODE HERE
# ...


# Metric N:
# YOUR CODE HERE

**Task 1.6.**

**Q:** What is ROC-AUC? Present two ways how to calculate it (plot-based and probability based). Implement at least one of this methods:

In [None]:
predictions = [0.8, 0.41, 0.76, 0.6, 0.35, 0.74, 0.54, 0.1, 0.51, 0.68, 0.43, 0.95]
ground_truth = [1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0]

# YOUR CODE HERE

**Task 1.7.**

**Q:** What are train and test sets? What is the purpose of splitting the data into train and test?

**Your answer here:**


**Task 1.8.**

**Q:** What is overfitting? How do you spot overfitting? What are general ways to overcome it?

**Your answer here:**


### Part 2. Iris dataset

In [None]:
from sklearn.datasets import load_iris
iris_db = load_iris()
iris_db.keys()

**Task 2.1.**

**Q:** What are classes and features in this dataset? For each feature plot a histogram of its distribution. Plot bars for each class with a separate colour.

In [None]:
# YOUR CODE HERE

**Task 2.2.**
Let's consider only two features and two classes: sepal length and sepal width and setosa and versicolor. Plot a 2D scatterplot, each class with its colour and shape.

In [None]:
# YOUR CODE HERE

**Task 2.3.**

Using a dataset from 2.2 train 4 decison trees of depths `[1, 2, 3, 4]`.

For each tree examine classification metrics, plot decision boundaries [[tips]]((https://scikit-learn.org/stable/auto_examples/ensemble/plot_voting_decision_regions.html)).

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# YOUR CODE HERE

**Task 2.4.**

- Using a dataset from 2.2 train a linear model and linear SVM.
- Plot decision boundaries for these models and print accuracies.
- What are analytical formulas for these boundaries?
- What is the difference between these two models?

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [None]:
# YOUR CODE HERE

### Part 3. Grid search

**Task 3.1.**

**Q:** What are hyperapameters of a ML-model? What differs them from internal parameters?

**Your answer here:**


Now download the breast cancer dataset. Split it into train and test (70% train and 30% test).

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

In [None]:
bc_db = load_breast_cancer()

In [None]:
X, Y = bc_db.data, bc_db.target

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, stratify=Y)

**Task 3.1.** 

Explain what is K-fold cross-validation:

**Your answer here:**


**Task 3.2.** 

Explain how RF is formed up from decision trees for regression and classification tasks.

**Your answer here:**


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score

**Task 3.3.**

- Run a 2D grid search for RF
- Measure training time of each iteration of the search
- Suggest a few ways to make the whole search faster
- Apply one of these faster methods and compare conusmed time and classification quality

In [None]:
# YOUR CODE HERE

**Task 3.4.** 

Try to beat a RF with non-linear SVM

In [None]:
# YOUR CODE HERE

**Task 3.5.** 

Tune a voting classifier over SVM and RF, report the results.

In [None]:
# YOUR CODE HERE

### Part 4. Basic text analysis

First task is about binary classification of words: we'll try to separate russian surnames from common russian words.

In [None]:
def read_list_from_file(filename, prefix=""):
    res = []
    with open(prefix + filename, "r") as input_file:
        for line in input_file.readlines():
            res.append(line.strip())
    return res

surnames = np.array(read_list_from_file("data/russian_surnames.txt"))
all_words = np.array(read_list_from_file("data/russian.txt"))

In [None]:
surnames_labels = np.ones_like(surnames, dtype=int)
allwords_labels = np.zeros_like(all_words, dtype=int)

X = np.concatenate([surnames, all_words])
y = np.concatenate([surnames_labels, allwords_labels])

**Task 4.1.** 

We are going to use syllables as features. Why this idea can be reasonable?

**Your answer here:**


**Task 4.2.** 

Implement a simple tokenizer:

In [None]:
def tokenize_word(word, token_len=3):
    ''' Function that splits word into sequence of tokens
    Args:
        word (string): input word
        token_len (int): length of each token
    Returns:
        list(str): list of tokens 
    '''
    tokens_list = []
    # YOUR CODE HERE
    
    return tokens_list

assert tokenize_word("cybersnatch") == ['cyb', 'ybe', 'ber', 'ers', 'rsn', 'sna', 'nat', 'atc', 'tch'], "smth's wrong"
print("tokenize_word: seems legit")

**Task 4.3.** Feature extraction:

- Apply tokenizer to each word to split it into 3-char syllables
- Map list of tokens to a vector with CountVectorizer
- Map list of tokens to a vector with HashingVectorizer
- Train Linear Model and RF on each of these datasets (don't forget to split (X,y) into train and test)
- Report the results (training time and f1): which model is better and why?

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer

In [None]:
# YOUR CODE HERE

### Part 5. Clustering

**Task 5.1.** 

What is clustering? What is the difference between clustering and multiclass classification?

**Your answer here:**


**Task 5.2.** 

What is K-means? How it works?

**Your answer here:**


**Task 5.3.** Iris + Kmeans

Apply K-means to Iris dataset. Plot clustering result on a 2D plane for each pair of features.

In [None]:
# YOUR CODE HERE

**Task 5.4.** PCA

- What is mathematical concept behind PCA
- What can be main purposes of PCA usage?

**Your answer here:**


**Task 5.5.** 

Plot clustering you've obtained above on a 2D plane using PCA.

In [None]:
# YOUR CODE HERE