# Machine learning interview questions

## Easy

### 1. You are building a binary classifier for an unbalanced dataset (where 1 class is much rarer than the other, say 1% and 99%, respectively). How do you handle this situation?

Unbalanced classes can be dealt with in several ways.

1. Can we get more data? Is the event inherently rare?
2. Choose the appropriate performance metrics:
   * Don't use accuracy
   * Look at precision, recall, F1 score and the ROC curve
     
3. Apply resampling to the training data:
   * Oversampling the minority class via bootstrapping
   * Undersampling the majority class via bootstrapping
     
4) Generate synthetic examples for the training data
   * SMOTE - creates synthetic examples of the minority class (random variations of instance attributes based on neighbours)
     
5) Ensemble models
   * Apply boosting to reduce bias - higher weight is given to the minority class at each successive iteration
     
6) Design a custom cost function to penalise wrong classification of the rare class more than the majority class

### 2. What are some differences you would expect in a model that minimises squared error versus a model that minimises absolute error? In which case would each error metric be appropriate?

$$MAE = \frac{1}{N} \Sigma_i^N (y_i - y_{pred})$$
$$MSE = \frac{1}{N} \Sigma_i^N (y_i - y_{pred})^2$$
<center> Where, N is the number of training samples </center>


Key differences:
* MSE is more sensitive to outliers as errors are squared before being averaged.
* MSE is more efficient computationally as the gradient is easier to calculate during optimisation
* Calculating the gradient of MAE requires linear programming (less efficient)

Conclusion:
* Use MAE if the model needs to be robust to outliers and computational efficiency isn't an issue (e.g. small training set)
* Use MSE if the model doesn't need to be robust to outliers and computation time is an issue

### 3. When performing K-means clustering, how do you choose K?

There is no perfect method for picking k (otherwise it would be a supervised problem).
1. ***Business intuition***
* Do you expect a certain number of clusters?
* Visualise features within expected groups, do they behave similarly?
<br>
<br>
2. ***Elbow method***
* A few clusters should explain a lot of the variation in data 
* Plot the Within-Cluster-Sum of Squared Errors (WSS) for different values of k - at what point are there diminishing returns?
* Calculation for each k:
  * Fit k-mean clustering model
  * Calculate the Squared Error for each point from the centroid of its cluster
  * Sum the squared error across all points giving WSS
  * Plot WSS versus k and choose k for which WSS becomes first starts to diminish
<br>
<br>
3. ***Silhouette method***
$$ Silhouette score = \frac{(x-y)}{max(x,y)}$$
<center> Where, x = mean distance to points of the nearest cluster & y = mean distance to points in the same cluster. Euclidian distance usually used. </center>
* A silhouette score measures how similar a point is to its own cluster (cohesion) compared to other clusters (seperation)
* The score ranges from -1 to + 1. A high value indicates a point is placed in the correct cluster
* Calculation for each k:
    * Calculate a silhouette for each point
    * Plot a clustered bar chart showing scores for each point in their respective cluster

### 4. How can you make you models more robust to outliers?

The first step is to try and understand why outliers occurred. Different steps can then be followed.
1. ***Trimming***
* If the points are truly anomalous, and not worth incorporating, they can be removed.
* Risk losing information.
<br>
2. ***Winsorization***
* Cap the data at a threshold
* A 90% winorization:
    * Cap the bottom 5% of values at the 5th percentile
    * Cap the top 5% of values at the 95th percentile
<br>
3. ***Change the cost function***
* The mean absolute error cost function is more robust to outliers than the mean squared error cost function (see above)
<br>
4. ***Add regularization***
* L1 & L2 regularization reduce variance by minimising model weights
<br>
5. ***Transform the data***
<br>

### 5. Say that you are running a multiple linear regression and that you have reason to believe that several of the predictors are correlated. How will the results behave if several are indeed correlated? How would you deal with this problem?

Two primary problems relate to uncertainty of feature importance
1. ***P values are misleading***
* Important variables may have higher, statistically insignificant, P-values (as importance split over correlated variables)
<br>

2. ***Coefficient estimates are unstable***
* Coefficients vary depending on which variables included
* Imprecise estimates of coefficients lead to broad confidence intervals (maybe including zero)
<br>

Solutions:
1. ***Remove correlated predictors***
* Remove variables clearly related to the other (e.g. X and 2X)
* Use a latent (i.e. hidden) variable relating to correlated variables (e.g. speed replaces distance & time)
<br>

2. ***Combine correlated predictors***
* Combine collerated variables using PCA
* Calculate interaction terms - e.g product of the two that are correlated
<br>

3. ***Regularization***
* Use L2 regularization (e.g. Ridge regression) to stabilise the size of the coefficients
<br>

### 6. Describe the motivation behind random forests. What are two ways in which they improve upon individual decision trees?

Decision trees are prone to overfitting. Random forests are a type of ensemble learning.
1. They reduce overfitting and therefore variance via bagging (bootstrap aggregating).
2. Each consitituent decision tree is trained on a random subsample of the predictor variables. This decorrelates the trees meaning they are not equivalent and learn about other features of the data. Without it they would all prioritise the strong predictors.
3. Random forests can be used to produce feature importance values [(see here)](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html). 
4. Easy to implement and fast to run.

### 7. Given a large dataset of payment transactions, say we want to predict the likelihood of a given transaction being fraudulent. However, there are many rows with missing values for various columns. How would you deal with this?

Approach in a series of steps:

1. ***Characterise***
* What features are missing data? Are missing values numerical or categorical?
* Is it possible to use an additional data source to fill in the missing info?
* Why is the data missing?
    * *Missing completely at random (MCAR)*
      - Randomly distributed across a given variable and the probability of a value being missing is unrelated to other variables.
      - E.g. Skipping survey questions. Equipment malfunction. Data entry errors.
        
    * *Missing at random (MAR)*
      - NOT randomly distributed as the probability of a value being missing is related to another variable.
      - E.g. Systematic exclusions. Data not collected for a specific demographic.
        
    * *Not missing at random (NMAR)*
       - Missing for reasons related to the values themselves.
       - E.g. People not reporting their income **because** of the value. Fear of discrimination.

2. ***Establish a baseline***
* Build a baseline model - does it meet business goals?
* Is the missing data problem?
  * MCAR - Do the relevant features have predictive value?
  * MAR - Is the missing data within a category where fraudulent transactions never occur?

4. ***Impute missing data***
* If the baseline model is not OK - impute!
  * Mean/median value (simple but dosn't factor in other features and correlations)
  * Use a nearest neighbour method to estimate a value based on other features
    
6. ***Check performance with imputed data***
* Use cross validation to compare performance of model with/without imputed data. If there is no change - alter or remove missing data.

Note: A performance increase would only be expected if the imputed features have predictive value.

### 8. Say you are running a simple logistic regression to solve a problem but find the results to be unsatisfactiory. What are some ways you might improve your model, or what other models might you look into using instead?

***Model improvements***
* Logistic regression models often have high bias - add more features
* Make sure all features are normalised (so they don't dominate model performance)
* Perform feature selection - removing features with no predictive value may reduce noise
* Perform k-fold cross validation to optimise hyperparameters: e.g. choose a form of regularization to reduce overfitting

***Alternative models***
* Logistic regression provides a linear decision boundary.
* The classes may not be linearly seperable, therefore try other classification methods:
    * Support vector machine (SVM)
    * Tree-based approaches
    * Neural networks


### 9. Say you were running a linear regression for a dataset but you accidentally duplicated every data point. What happens to your beta coefficient?

### 10. Compare and contrast gradient boosting and random forests.

Both are forms of ensemble learning. Key differences:

***Training***
* Random forests rely on independant parallel training of decision trees using bootstrap aggregating (bagging)
* Gradient boosting relies on sequential training of models where weak learners learn from the mistakes of preceding weak learners

***Testing***
$$\hat{f}(x)= mode\{\hat{f}_1(x),\hat{f}_2(x),1,...,\hat{f}_m(x)\}$$
$$\hat{f}(x)=\frac{1}{M} \sum_{m=1}^M\hat{f}_m(x)$$
<center> In random forests, the output of the trees is combined at test time via averaging or majority voting </center>

$$\hat{f}(x)=\sum_{b=1}^B \lambda \hat{f}_b(x)$$
<center> In boosting, models are combined sequentialy during training using a weighting. A final model is then applied at test time. </center>

***Characteristics***
* Gradient boosting is more prone to overfitting due to lack of independence and focus on mistakes
* Gradient boosting hyperparameters are harder to tune
* Gradient boosting can take longer to train overall due to sequential training of constituent models

***Applications***
* Gradient boosting is better for unbalanced datasets
* Random forests are better for multi-class object detection with noisy data (e.g. computer vision)

### 11. Say that DoorDash is launching in Singapore. For this new market, you want to predict the estimated time of arrival (ETA) for a delivery to reach a customer after an order has been placed on the app. From an earlier beta test in Singapore, there were 10,000 deliveries made. Do you have enough training data to create an accurate ETA model?

"Accurate" is subjective. Therefore follow the below steps:

1. ***Clarify what is "good" enough***
* What will the prediction be used for? -> get context -> accuracy for order-driver matching may need to be higher than for customers
* What errors are acceptable? -> for customers, better to overestimate delivery than underestimate
* Benchmark performance for ETA -> establish accuracy provided in other markets
  
2. ***Assess baseline performance***
* Develop a baseline model using the 10,000 deliveries
* This is a regression problem therefore try:
  * Multi-linear regression: preperation time + distance
  * Assess performance using RMSE, MAE, R2

3. ***Determine how additional data improves accuracy***
* Choose an evaluation metric (e.g. R2) and build learning curves to assess how performance changes with increasing % of data
* If the learning curve begins to plateau then more data might not be required - focus on optimisation (i.e. feature selection, regularization etc.)

If more data is required as performance isn't good enough. Follow the below steps:

1. ***Assess features***
* Can we add additional features (e.g. traffic patterns, supply & demand)
* Are there are almost as many or more featurs than data points, if so the model will be prone to overfitting - apply PCA or feature selection

2. ***Alternative model***
* Do alternative models cope better with smaller training datasets?

3. ***Assess impact***
* Is the less accurate prediction a true launch blocker?
* If not, launch in the new market and retrain the model using the generated data