# Machine learning interview questions

1. **You are building a binary classifier for an unbalanced dataset (where 1 class is much rarer than the other, say 1% and 99%, respectively). How do you handle this situation?**

Unbalanced classes can be dealt with in several ways.

1. Can we get more data? Is the event inherently rare?
2. Choose the appropriate performance metrics:
   * Don't use accuracy
   * Look at precision, recall, F1 score and the ROC curve
     
3. Apply resampling to the training data:
   * Oversampling the minority class via bootstrapping
   * Undersampling the majority class via bootstrapping
     
4) Generate synthetic examples for the training data
   * SMOTE - creates synthetic examples of the minority class (random variations of instance attributes based on neighbours)
     
5) Ensemble models
   * Apply boosting to reduce bias - higher weight is given to the minority class at each successive iteration
     
6) Design a custom cost function to penalise wrong classification of the rare class more than the majority class

2. **What are some differences you would expect in a model that minimises squared error versus a model that minimises absolute error? In which case would each error metric be appropriate?**

$$MAE = \frac{1}{N} \Sigma_i^N (y_i - y_{pred})$$
$$MSE = \frac{1}{N} \Sigma_i^N (y_i - y_{pred})^2$$
<center> Where, N is the number of training samples </center>


Key differences:
* MSE is more sensitive to outliers as errors are squared before being averaged.
* MSE is more efficient computationally as the gradient is easier to calculate during optimisation
* Calculating the gradient of MAE requires linear programming (less efficient)

Conclusion:
* Use MAE if the model needs to be robust to outliers and computational efficiency isn't an issue (e.g. small training set)
* Use MSE if the model doesn't need to be robust to outliers and computation time is an issue

3. **When performing K-means clustering, how do you choose K?**

There is no perfect method for picking k (otherwise it would be a supervised problem).
1. *Business intuition*
* Do you expect a certain number of clusters?
* Visualise features within expected groups, do they behave similarly?
<br>
<br>
2. *Elbow method*
* A few clusters should explain a lot of the variation in data 
* Plot the Within-Cluster-Sum of Squared Errors (WSS) for different values of k - at what point are there diminishing returns?
* Calculation for each k:
  * Fit k-mean clustering model
  * Calculate the Squared Error for each point from the centroid of its cluster
  * Sum the squared error across all points giving WSS
  * Plot WSS versus k and choose k for which WSS becomes first starts to diminish
<br>
<br>
3. *Silhouette method*
$$ Silhouette score = \frac{(x-y)}{max(x,y)}$$
<center> Where, x = mean distance to points of the nearest cluster & y = mean distance to points in the same cluster. Euclidian distance usually used. </center>
* A silhouette score measures how similar a point is to its own cluster (cohesion) compared to other clusters (seperation)
* The score ranges from -1 to + 1. A high value indicates a point is placed in the correct cluster
* Calculation for each k:
    * Calculate a silhouette for each point
    * Plot a clustered bar chart showing scores for each point in their respective cluster

4. **How can you make you models more robust to outliers?**

The first step is to try and understand why outliers occurred. Different steps can then be followed.
1. *Trimming*
* If the points are truly anomalous, and not worth incorporating, they can be removed.
* Risk losing information.
<br>
2. *Winsorization*
* Cap the data at a threshold
* A 90% winorization:
    * Cap the bottom 5% of values at the 5th percentile
    * Cap the top 5% of values at the 95th percentile
<br>
3. *Change the cost function*
* The mean absolute error cost function is more robust to outliers than the mean squared error cost function (see above)
<br>
4. *Add regularization*
* L1 & L2 regularization reduce variance by minimising model weights
<br>
5. *Transform the data*
<br>

5. **Say that you are running a multiple linear regression and that you have reason to believe that several of the predictors are correlated. How will the results behave if several are indeed correlated? How would you deal with this problem?**

Two primary problems relate to uncertainty of feature importance
1. *P values are misleading*
* Important variables may have higher, statistically insignificant, P-values (as importance split over correlated variables)
<br>

2. *Coefficient estimates are unstable*
* Coefficients vary depending on which variables included
* Imprecise estimates of coefficients lead to broad confidence intervals (maybe including zero)
<br>

Solutions:
1. *Remove correlated predictors*
* Remove variables clearly related to the other (e.g. X and 2X)
* Use a latent (i.e. hidden) variable relating to correlated variables (e.g. speed replaces distance & time)
<br>

2. *Combine correlated predictors*
* Combine collerated variables using PCA
* Calculate interaction terms - e.g product of the two that are correlated
<br>

3. *Regularization*
* Use L2 regularization (e.g. Ridge regression) to stabilise the size of the coefficients
<br>