# Quizzes

### Explanatory data analysis

![](img/quiz1-1.png)
![](img/quiz1-2.png)

* Z=X−Y
* Z=X/Y
* Z=X+Y
* **Z=XY**

![](img/quiz2-1.png)
![](img/quiz2-2.png)

> 2


![](img/quiz3-1.png)
![](img/quiz3-2.png)

The following code was used to produce these two plots:


```python
# top plot
plt.plot(x, '.')

# bottom plot
logX = np.log1p(x) # no NaNs after this operation
plt.plot(logX, '.')
```
(note that it is not the same variable X as in previous questions).<br>

Which hypotheses about variable X do NOT contradict with the plots? In other words: what hypotheses we can't reject (not in statistical sense) based on the plots and our intuition?

* **X is a counter or label encoded categorical feature**
* X can be the temperature (in Celsius) in different cities at different times
* **X takes only discrete values**
* X can take a value of zero
  * `No! for x=0:log(x+1)=0. But we see minimum value on log(x+1) plot is between 0.5 and 1.0, and most likely to be log(1+1)≈0.69, so min(x)=1.`
* **2≤X<3 happens more frequently than 3≤X<4**

![](img/quiz4.png)

Suppose we are given a dataset with features X and Y and need to learn to classify objects into 2 classes. The corresponding targets for the objects from the dataset are denoted as y.<br>

Top left plot shows X vs Y scatter plot, produced with the following code:

```python
# y is a target vector
plt.scatter(X, Y, c = y)
```

We use target variable y to colorcode the points.<br>

The other three plots were produced by jittering X and Y values:


```python
def jitter(data, stdev):
    import numpy as np
    N = len(data)
    return data + np.random.randn(N) * stdev

# sigma is a given std. dev. for Gaussian distribution
plt.scatter(jitter(X, sigma), jitter(Y, sigma), c = y)
```

That is, we add Gaussian noise to the features before drawing scatter plot. Select the correct statements.

* We need to jitter variables not only for a sake of visualization, but also because it is beneficial for a model.
  * `Jittering introduces noise to a variable, and it is surely not good idea to introduce artificial noise to the dataset. Usually, the cleaner the dataset the better the results.`
* It is always beneficial to jitter variables before building a scatter plot
* Target is completely determined by coordinates (x,y), i.e. the label of the point is completely determined by point's position (x,y). Saying the same in other words: if we only had two features (x,y), we could build a classifier, that is accurate 100% of time.
* **Standard deviation for Jittering is the largest on the bottom right plot.**
* **Top right plot is "better" than top left one. That is, every piece of information we can find on the top left we can also find on the top right, but not vice versa.**

## Validation
### Practice Quiz, 4 questions

Suppose we are given a huge dataset. We did a KFold validation once and noticed that scores on each fold are roughly the same. Which validation type is most practical to use?


* **We can use a simple holdout validation scheme because the data is homogeneous.**
  * `Correct! If scores on different folds are similar, we indeed can use holdout split. In fact, this is often the case.`
* We should keep on using KFold scheme as the data is homogeneous and KFold is the most computationally efficient scheme.
* Leave-one-out because the data is not homogeneous.

Suppose we are given a medium-sized dataset and we did a KFold validation once. We noticed that scores on each fold differ noticeably. Which validation type is the most practical to use?

* LOO
* **KFold**
  * `Correct. This is the most frequent way to deal with this kind of situations. Also, scores deviation in KFold will help you to select statistically significant change in scores while tuning a model.`
* Holdout

The features we generate depend on the train-test data splitting method. Is this true?

* False
* **True**

What of these can indicate an expected leaderboard shuffle in a competition?

* **Little amount of training or/and testing data**
* **Different public/private data or target distributions**
* **Most of the competitors have very similar scores**

## Validation
### Quiz, 4 questions

Select true statements

* **The logic behind validation split should mimic the logic behind train-test split.**
* The model, that performs best on the validation set is guaranteed to be the best on the test set.
  * `Incorrect. Target in the test set can have different distribution and our score estimation can fail.`
* Performance increase on a fixed cross-validation split guaranties performance increase on any cross-validation split.
  * ` Incorrect. You can overfit to the specific CV-split. You should change your split from time to time to reduce the chance of overfitting.
`
* **Underfitting refers to not capturing enough patterns in the data**
* **We use validation to estimate the quality of our model**

Usually on Kaggle it is allowed to select two final submissions, which will be checked against the `private LB` and contribute to the competitor's final position. A common practice is to select `one submission with a best validation score`, and `another submission which scored best on Public LB`. What is the logic behind this choice?

* Generally, this approach is based on the assumption that people rarely tend to overfit to the Public LB. Almost always you have a lot of data in the test set and it is quite hard to overfit. Indeed, this render validation useless.
* **Generally, this approach is based on the assumption that the test data may have a different target distribution compared to the train data. If that would be the true, the submission which was chosen based on Public LB, will perform better. If, otherwise, the above distributions will be similar, the submission which was chosen based on validation scores, will perform better.**
* Generally, this approach is based on the assumption that validation is rarely valid in competitions. Often it is hard to trust your validation and thus you should account for both cases if the validation will succeed and if the validation will fail.

Suppose we have a competition where we are given a dataset of marketing campaigns. Each campaign runs for a few weeks and for each day in campaign we have a target - number of new customers involved. Thus the row in a dataset looks like<br>

```
Campaign_id, Date, {some features}, Number_of_new_customers
```

Test set consists of multiple campaigns. For each of them we are given several first days in train data. For example, if a campaign runs for two weeks, we could have three first days in train set, and all next days will be present in the test set. For another campaign, running for weeks, we could have the first 6 days in the train set, and the remaining days in the test set.<br>

Identify train/test split in a competition.

* **Combined split**
  * `For each campaign train and test are divided by a date, and this date can be different for different campaigns. Thus, split is made by id and by time.`
* Time-based split
* Id-based split
* Random split

Which of the following problems you usually can identify without the Leaderboard?

* **Train and test data are from different distributions**
* **Public leaderboard score will be unreliable because of too little data**
* Train and test target distribution are from different distributions
  * `Incorrect! To do this, we would need to have test target values, which is not possible in a competition.`
* **Different scores/optimal parameters between folds**

## Data Leakages
### Quiz, 4 questions

Suppose that you have a credit scoring task, where you have to create a ML model that approximates expert evaluation of an individual's creditworthiness. Which of the following can potentially be a data leakage? Select all that apply.

* Among the features you have a company_id, an identifier of a company where this person works. It turns out that this feature is very important and adding it to the model significantly improves your score.
* **First half of the data points in the train set has a score of 0, while the second half has scores > 0.**
* **An ID of a data point (row) in the train set correlates with target variable.**

What is the most foolproof way to set up a time series competition?

* Make a time based split for train/test and a random split for public/private.
* **Split train, public and private parts of data by time. Remove all features except IDs (e.g. timestamp) from test set so that participants will generate all the features based on past and join them themselves.**
* Split train, public and private parts of data by time. Remove time variable from test set, keep the features.

Suppose that you have a binary classification task being evaluated by logloss metric. You know that there are 10000 rows in public chunk of test set and that constant 0.3 prediction gives the public score of 1.01. Mean of target variable in train is 0.44. What is the mean of target variable in public part of test data (up to 4 decimal places)?

> 

loglos formula for binary classification

![](http://wiki.fast.ai/images/math/a/4/6/a4651d4ad311666c617d57c1dde37b28.png)

In [1]:
import numpy as np
-((1-0)*np.log(1-.3))

0.35667494393873245

In [2]:
-(np.log(.3))

1.2039728043259361

Suppose that you are solving image classification task. What is the label of this picture?

![img](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/Po7WII6cEeekTw4xHnjhxA_1349a164252c7f654e705fd2094c88bc_label_is_3.jpg?expiry=1520553600000&hmac=xr_xYq3mmABVdT8T9TD16Cdo2nqknaM40smiYoq-ny4)

> 