# Quizzes
### Week 1

### What back propagation is usually used for in neural networks?
* To propagate signal through network from input to output only
* Select gradient update direction by flipping a coin
* Make several random perturbations of parameters and go back to the best one
* **To calculate gradient of the loss function with respect to the parameters of the network**

### Suppose we've trained a RandomForest model with 100 trees. Consider two cases:

1. We drop the first tree in the model
2. We drop the last tree in the model

We then compare models performance on the train set. Select the right answer.

* In the case 1 performance will drop more than in the case 2
* **In the case 1 performance will be roughly the same as in the case 2**
* In the case 1 performance will drop less than in the case 2

### Suppose we've trained a GBDT model with 100 trees with a fairly large learning rate. Consider two cases:

1. We drop the first tree in the model
2. We drop the last tree in the model

We then compare models performance on the train set. Select the right answer.

* **In the case 1 performance will drop more than in the case 2**
* In the case 1 performance will be roughly the same as in the case 2
* In the case 1 performance will drop less than in the case 2 

### Consider two cases:

1. We fit two RandomForestClassifiers 500 trees each and average their predicted probabilities on the test set.
2. We fit a RandomForestClassifier with 1000 trees and use it to get test set probabilities.

All hyperparameters except number of trees are the same for all models. Select the right answer.

* **The quality of predictions in the case 1 will be roughly the same as the quality of the predictions in the case 2**
* The quality of predictions in the case 1 will be lower than the quality of the predictions in the case 2
* The quality of predictions in the case 1 will be higher than the quality of the predictions in the case 2

### Which library provides the most convenient way to perfrom matrix multiplication?

* **Numpy**
* Pandas
* Sklearn
* XGBoost


### Which libraries contain implementation of linear models?

* Pandas
* Numpy
* **Sklearn**
* Matplotlib
* tsne

### Which library (or libraries) are used to train a neural network?

* Numpy
* **Pytorch**
* Matplotlib
* **Keras**
* **Tensorflow**
* T-SNE

### Select the correct statements about the RandomForest and GBDT models.

* **In GBDT each new tree is built to improve the previous trees.**
  * `Since each tree is independent from other trees`
* Trees in GBDT can be constructed in parallel (that is how XGBoost makes use of all your cores)
  * `No, we need to build trees in sequential manner. In XGBoost multiple cores are used to build single tree.`
* In RandomForest each new tree is built to improve the previous trees.
  * `No, every tree is independent.`
* **Trees in RandomForest can be constructed in parallel (that is how RandomForest from sklearn makes use of all your cores)**
  * `The idea of boosting is to correct errors of previously learned models`



### What type does a feature with values: `[‘low’, ‘middle’, ‘high’]` most likely have?

* Categorical
* Numeric
* Coordinates
* **Ordinal (ordered categorical)**
* Text
* Datetime

### Suppose you have a dataset X, and a version of X where each feature has been standard scaled.
For which model types training or testing quality can be much different depending on the choice of the dataset?

* **Neural network**
* Random Forest
* **Nearest neighbours**
* GBDT
  * Tree-based methods split features using simple thresholds and they usually are insensitive to monotonic transforms of the data, so GBDT will perform more or less the same on the both datasets.
* **Linear models**

### Suppose we want to fit a GBDT model to a data with a categorical feature. 
We need to somehow encode the feature. Which of the following statements are true?

* One-hot encoding is always better than label encoding
* **Depending on the dataset either of label encoder or one-hot encoder could be better**
* Label encoding is always better to use than one-hot encoding

### What can be useful to do about missing values?

* Impute with feature variance
* **Nothing, but use a model that can deal with them out of the box**
* **Replace them with a constant (-1/-999/etc.)**
* Apply standard scaler
* **Remove rows with missing values**
* **Reconstruct them (for example train a model to predict the missing values)**
* **Impute with a feature mean**

# Quiz
### Feature preprocessing and generation with respect to models

## Suppose we have a feature with all the values between 0 and 1 except few outliers larger than 1. 
What can help us to *decrease outliers' influence on non-tree models*?

* **Apply rank transform to the features**
  * `Yes, because after applying rank distance between all adjacent objects in a sorted array is 1, outliers now will be very close to other samples.`
* **[Winsorization](https://en.wikipedia.org/wiki/Winsorizing)**
  * `The main purpose of winsorization is to remove outliers by clipping feature's values.`
* StandardScaler
* MinMaxScaler
* **Apply $np.log1p(x)$ transform to the data**
  * `This transformation is non-linear and will move outliers relatively closer to other samples.`
* **Apply $np.sqrt(x)$ transform to the data**
  * `This transformation is non-linear and will move outliers relatively closer to other samples.`

## Suppose we fit a tree-based model. 
In which cases label encoding can be better to use than one-hot encoding?

* **When the number of categorical features in the dataset is huge**
  * `One-hot encoding a categorical feature with huge number of values can lead to (1) high memory consumption and (2) the case when non-categorical features are rarely used by model. You can deal with the 1st case if you employ sparse matrices. The 2nd case can occur if you build a tree using only a subset of features. For example, if you have 9 numeric features and 1 categorical with 100 unique values and you one-hot-encoded that categorical feature, you will get 109 features. If a tree is built with only a subset of features, initial 9 numeric features will rarely be used. In this case, you can increase the parameter controlling size of this subset. In xgboost it is called colsample_bytree, in sklearn's Random Forest max_features.`
* **When categorical feature is ordinal**
  * `Correct! Label encoding can lead to better quality if it preserves correct order of values. In this case a split made by a tree will divide the feature to values 'lower' and 'higher' that the value chosen for this split.`
* **When we can come up with label encoder, that assigns close labels to similar (in terms of target) categories**
  * `Correct! First, in this case tree will achieve the same quality with less amount of splits, and second, this encoding will help to treat rare categories.`

## Suppose we fit a tree-based model on several categorical features.
In which cases applying one-hot encoding can be better to use than label-encoding?

* **If target dependence on the label encoded feature is very non-linear, i.e. values that are close to each other in the label encode feature correspond to target values that aren't close.**
  * ` Correct! If this feature is important, a tree would try to make a lot of splits and select each feature' value in a category on its own. But because tree is build in a greedy way, it can be hard to select one important value in label encoded vector. This won't be the problem if you use OHE.`
* When the feature have only two unique values
  * `When the feature have only two unique values. Incorrect. In this case both one-hot encoding and label encoding will produce similar columns.`

## Suppose we have a categorical feature and a linear model. 
We need to somehow encode this feature. Which of the following statements are true?

* Label encoding is always better than one-hot encoding
  * `Usually the dependence between the feature and the target is non-linear. In this case a linear model will not be able to utilize Label Encoded feature efficiently.`
* One-hot encoding is always better than label encoding
  * `Consider the toy example when the label encoded feature and the target are equal. In this case a linear model on this feature will have the perfect quality.`
* **Depending on the dataset either of label encoder or one-hot encoder could be better**

# Practice
### Feature extraction from text and images

## TF-IDF is applied to a matrix where each column represents a word, each row represents a document, and each value shows the number of times a particular word occurred in a particular document. Choose the correct statements.

* **IDF scales features inversely proportionally to a number of word occurrences over documents**
* IDF scales features proportional to the frequency of word’s occurrences
* TF normalizes sum of the column values to 1
* **TF normalizes sum of the row values to 1**

## What of these methods can be used to preprocess texts?

* **Stemming**
* **Lemmatization**
* Plumping
* **Stopwords removal**
* **Lowercase transformation**
* Levenshteining
* Plumbing

## What is the main purpose of Lemmatization and Stemming?

* To induce common word amplification standards to the most useful for machine learning algorithms form.
* To reduce significance of common words.
* **To reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.**
* To remove words which are not useful.

## To learn Word2vec embeddings we need ...

* Labels for the documents in the corpora
* GloVe embeddings
* Labels for each word in the documents in the corpora
  * `We only need words. `
* **Text corpora**

# Quiz
### Feature extraction from text and images

## Select true statements about n-grams

* Levenshteining should always be applied before computing n-grams
* **N-grams features are typically sparse**
  * `Correct. Ngrams deal with counts of words occurrences, and not every word can be found in a document. For example, if we count occurrences of words from an english dictionary in our everyday speech, a lot of words won't be there, and that is sparsity.
`
* **N-grams can help utilize local context around each word**
  * `Correct, because ngrams encode sequences of words.`
* N-grams always help increase significance of important words
  * `No, ngrams deals with words occurrences and not their importance.`

## Select true statements.

* You do not need bag of words features in a competition if you have word2vec features.
  * `Incorrect. Both approaches are valuable and you should try to utilize both of them.`
* Meaning of each value in BOW matrix is unknown.
  * `Incorrect. Meaning of a value in BOW matrix is the number of a word's occurrences in a document.`
* **Semantically similar words usually have similar word2vec embeddings.**
  * `Correct. This is one of the main benefits of w2v in competitions.`
* **Bag of words usually produces longer vectors than Word2vec**
  * `Correct! Number of features in Bag of words approach is usually equal to number of unique words, while number of features in w2v is restricted to a constant, like 300 or so.`

## Suppose in a new competition we are given a dataset of 2D medical images. 
We want to extract image descriptors from a hidden layer of a neural network pretrained on the ImageNet dataset. We will then use extracted descriptors to train a simple logistic regression model to classify images from our dataset.<br>

We consider to use two networks: ResNet-50 with imagenet accuracy of X and VGG-16 with imageNet accuracy of Y (X < Y). Select true statements.

* Descriptors from ResNet-50 and from VGG-16 are always very similar in cosine distance.
* For any image descriptors from the last hidden layer of ResNet-50 are the same as the descriptors from the last hidden layer of VGG-16.
* **It is not clear what descriptors are better on our dataset. We should evaluate both.**
* Descriptors from ResNet 50 will always be better than the ones from VGG-16 in our pipeline.
* With one pretrained CNN model you can get only one vector of descriptors for an image


## Data augmentation can be used at (1) train time (2) test time

* False, True
* True, False
* **True, True**
  * `Data augmentation can be used (1) to increase the amount of training data and (2) to average predictions for one augmented sample.`
* False, False