# Common Data Scientist Interview Questions

## Past Experiences

From highest business impact to lowest:
1. quality score
    * automate repetitive tasks, data-driven pricing
    * time series as cross-section data; multiple features, missing data, avoid assumption of stationarity for short time periods
    * feature engineering; how many days needed helps business makes better contracts
    * smaller models with good features are easier to productionize
2. crm customer lookalike
    * offline to online conversion
    * very imbalanced dataset; supervised learning is useless
    * rule-based, PCA, autoencoders
3. search
    * show things people will most likely buy
    * full-text match
    * learning-to-rank with product embeddings
4. recommendation
    * personalized with collaborative filtering
    * association rules with collaborative filtering
    * similar products with category classifier
5. abnormality detection in xrays
    * pretrained models with 3 channels to 1 channel
    * normalization of images
    * false discovery rate at 93% recall
6. text classification in thai
    * ULMFit over BERT with cleaning rules

## What is a stationary time series

A stationary process has the property that 
* the mean, 
* variance and 
* autocorrelation structure 
do not change over time

Dickey-Fuller test tests if coefficient $\rho-1$ is zero

$$y_t - y_{t-1} = (\rho - 1) y_{t-1} + e_t$$

## ARIMA in [R](https://www.analyticsvidhya.com/blog/2015/12/complete-tutorial-time-series-modeling/) and [Python](https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/)

1. `AR` - autoregressive

$$y_t = \beta_0 y_{t-1} + e_t$$

2. `I` - differenceing

$$z_t = y_t - y_{t-1}$$

3. `MA` - moving average (of errors)

$$y_t = \beta_0 e_{t-1} + e_t$$

Variations
* `ARIMA(1,0,0)` = first-order autoregressive model
* `ARIMA(0,1,0)` = random walk
* `ARIMA(1,1,0)` = differenced first-order autoregressive model
* `ARIMA(0,1,1)` without constant = simple exponential smoothing
* `ARIMA(0,1,1)` with constant = simple exponential smoothing with growth
* `ARIMA(0,2,1)` or (0,2,2) without constant = linear exponential smoothing
* `ARIMA(1,1,2)` with constant = damped-trend linear exponential smoothing

## What are some probability distributions

* Bernoulli - probability of boolean outcomes with probability $p$
* Binomial - sum of results of $n$ Bernoulli trials with probability $p$
* Hypergeometric - Binomial but without replacements
* Poisson - continuous version of binomial with rate of events per time interval $\lambda$
* Exponential - time between Poisson events with rate $\lambda$
* Weibull - generalized Exponential with varying rates $\lambda_i$
* Geometric - how many failures before a success with probability $p$
* Negative binomial - how many failures before $n$ successes with probability $p$
* Normal distribution - sample means of anything independent and identically distributed
* Standard normal distribution - N(0,1)
* t distribution - fatter-tailed normal distribution
* Chi-square distribution - sum of squares of normally distributed random variables
* F distribution - ratio of two chi-squared random variables


## What are the differences between PMF, PDF and CDF

* `PMF` - actual probability of discrete value of a random variable
* `PDF` - sum of probability of a continuous random variable at a given range of values
* `CDF` - sum of probability of a continuous random variable up to a give nvalue

## Some commonly used performance metrics

classification
* precision - $\frac{tp}{tp+fp}$; positive predicted value
* negative precision - $\frac{tn}{tn+fn}$; negative predicted value
* recall - $\frac{tp}{tp+fn}$; true positive rate, sensitivity, power
* negative recall - $\frac{tn}{tn+fp}$; true negative rate, specificity
* false positive rate - $\frac{fp}{fp+tn}$; type 1 error
* false negative rate - $\frac{fn}{fn+tp}$; type 2 error
* false discovery rate - $\frac{fp}{tp+fp}$; 1-precision
* false omission rate - $\frac{fn}{tn+fn}$; 1-negative precision
* F-score - $F_\beta = \frac {(1 + \beta^2) \cdot \mathrm{true\ positive} }{(1 + \beta^2) \cdot \mathrm{true\ positive} + \beta^2 \cdot \mathrm{false\ negative} + \mathrm{false\ positive}}$
* Area under ROC curve - FPR vs TPR
* kappa - $\kappa \equiv {\frac {p_{o}-p_{e}}{1-p_{e}}}=1-{\frac {1-p_{o}}{1-p_{e}}}$; $p_o$ is observed agreement, $p_e$ is expected agreement
* [BLEU (bilingual evaluation understudy)](https://medium.com/usf-msds/choosing-the-right-metric-for-machine-learning-models-part-1-a99d7d7414e4) - exponentially weighted precision of 1-4 grams with brevity penalty

regression
* MSE - $\frac{1}{n} \sum_{i=1}^{n} (y_i-\hat{y_i})^2$
* MAE - $\frac{1}{n} \sum_{i=1}^{n} |y_i-\hat{y_i}|$
* $\text{variation in y}$ - $\frac{1}{n} \sum_{i=1}^{n} (y_i-\bar{y_i})^2$
* $R^2$ - $\frac{MSE}{\text{variation in y}}$

## What hypothesis tests to use for which occasions

* ratio vs ratio - proportional Z-test
* real number vs real number - t-test
* categorical vs categorical - Chi-squared test
* categorical vs real number - (M)ANOVA; F-distribution
* [lady tasting tea](https://en.wikipedia.org/wiki/Lady_tasting_tea)


## What is entropy

$$H(X) = \Sigma_{x} p(x) * log_2 p(x) $$

In [91]:
import numpy as np

def entropy(X):
    res = 0
    for x in X: res+=x*np.log2(x)
    return -res

Xs = [[0.99,0.01],[0.9,0.1],[0.5,0.5],[0.1,0.9],[0.01,0.99]]
for X in Xs: print(X,entropy(X))

[0.99, 0.01] 0.08079313589591118
[0.9, 0.1] 0.4689955935892812
[0.5, 0.5] 1.0
[0.1, 0.9] 0.4689955935892812
[0.01, 0.99] 0.08079313589591118


## What are some common loss functions

Hinge loss
$$L_{i} = \Sigma max(0,f_{j} - f_{target} + \Delta)$$
where $j!=target$

In [44]:
import torch
import torch.nn.functional as F

#3 rows, 5 classes 0-4
y = torch.tensor([1, 0, 4])
logit = torch.randn(3, 5)

\begin{align}
loss(x,class) &= −x_{class} \\
L &= \Sigma loss(x,class)
\end{align}

In [43]:
#negative log likelihood in pytorch is just negative aggregation (NO LOG!)
F.nll_loss(logit,y), -torch.gather(logit,1,y[:,None]).mean()

(tensor(-0.1049), tensor(-0.1049))

\begin{align}
loss(x,class) &=−log(\frac{e^{x_{class}}}{\Sigma e^{x_j}}) \\
L &= \Sigma loss(x,class)
\end{align}

In [49]:
#cross entropy in pytorch = log_softmax + nll
F.cross_entropy(logit,y), -torch.gather(F.log_softmax(logit,1),1,y[:,None]).mean()

(tensor(1.3862), tensor(1.3862))

In [67]:
#binary cross entropy
y = torch.randint(0,2,(3,1)).float()
logit = torch.randn(3, 1)
y,logit

(tensor([[1.],
         [0.],
         [1.]]),
 tensor([[-0.0066],
         [-1.0632],
         [-1.2841]]))

In [77]:
#binary_cross_entropy_with_logits in pytorch = sigmoid + binary cross entropy
F.binary_cross_entropy_with_logits(logit,y),F.binary_cross_entropy(torch.sigmoid(logit),y)

(tensor(0.8405), tensor(0.8405))

In [80]:
#mse
y = torch.randn(3, 1)
y_hat = torch.randn(3, 1)
F.mse_loss(y_hat,y), (y-y_hat).pow(2).mean()

(tensor(2.2401), tensor(2.2401))

In [82]:
#mae
y = torch.randn(3, 1)
y_hat = torch.randn(3, 1)
F.l1_loss(y_hat,y), (y-y_hat).abs().mean()

(tensor(0.9315), tensor(0.9315))

## What are some common layers

* Softmax - $\frac{exp(x_i)}{\Sigma exp(x_j)}$; range 0-1
* Sigmoid - $\frac{1}{1+ exp(-x)}$; range 0-1
* Tanh - range -1 to 1
* ReLU 
* BatchNorm
* Dropout
* ConvXD
* RNN, LSTM, GRU

## What are some common distances

$$cosineSimilarity = \frac{A \cdot B}{||A||_{2}||B||_{2}} = \frac{\Sigma A_i B_i}{\sqrt{\Sigma A_i^2 \Sigma B_i^2}}$$

$$JaccardSimilarity = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|}$$

$$Manhattan(l1) = \Sigma|A_i-B_i|$$

$$Euclidean(l2) =\sqrt{\Sigma(A_i-B_i)^2}$$

## What is bias-variance tradeoff

Bias is the error of prediction and variance is how spread out the prediction is. High bias is underfitting and high variance is overfitting.