# 120 Interview Questions Answers
Note, there are a lot of user-driven answers as well. See https://github.com/JifuZhao/120-DS-Interview-Questions/blob/master/predictive-modeling.md for example.

# Predicting Modeling

## Question 1
(Given a Dataset) Analyze this dataset and give me a model that can predict this response variable.

Answer:
Initial Exploratory Analysis
1. Check the shape of the data. How long is it, how wide is it. If very wide but not long we might run into variance issues
2. Check for class imbalance
3. Check for missing data. What does missing mean? Is it missing at random? Is there selection bias in the way the data is missing?

Feature Extraction
1. First, is there subject matter expertise on how to pick features. While there are feature selection methods to allow the algorithm to pick which features are most important, it's good to come in with a priori knowledge around which features to include. Otherwise we could just be throwing in junk and have bias/variance tradeoff
2. Think of ways to hierarchically group features, if you have too many features that could lead to high variance 
3. One-hot encode categorical features if necessary

Model Fitting
1. Again we have to think about bias/variance tradeoff. I always like to start with two models
2. Logistic Regression: A simple model that is highly explainable; especially in healthcare setting, having coefficients and being able to explain a model is useful. However, tends to be high bias, but lower variance, since you're imposing a linear classifier
3. Random Forest: Takes into account non-linearities and interaction effects.


## Question 2
**Need to read more on this**

What could be some issues if the distribution of the test data is significantly different than the distribution of the training data?

Answer:

In this case, there will definitely be issues on model performance since the model is trained on the training set will fit to the training set, and if the distribution is different in the test set, the model will not perform as well. In production this could very well happen if populations shift. In the healthcare setting, imagine training your model using one hospital data, but then you apply that model to a completely different setting; your model may not perform as well.

Very commonly, we call this "dataset shift". There are three aspects to dataset shift
- Covariate Shift
- Prior Probability Shift
- Concept Shift

See https://medium.com/capital-one-tech/domain-adaptation-5955edf0277b for more information

## Question 3
What are some ways I can make my model more robust to outliers?

Answer:
Regularization, bagging, using Mean Absolute Deviation as opposed to Mean Squared Error in a regression problem

Tree based models tend to also do better with outliers because you're typically just splitting into regions. So for tree based models the scale of the features don't matter as much just the relative ordering.

## Question 4
What are some differences you would expect in a model that minimizes squared error, versus a model that minimizes absolute error? In which cases would each error metric be appropriate?

Answer: Absolute error is more robust to outliers but harder to computationally fit. If we do actually care about consequences of large errors than we should use MSE since MSE will be more likely to minimize that. 

## Question 5
**Need to read more about this especially for multiclass problems**
What error metric would you use to evaluate how good a binary classifier is? What if the classes are imbalanced? What if there are more than 2 groups?

Answer: Accuracy, AUPR, AUROC

AUPR: Recall vs Precision
AUROC: Recall vs False Positive Rate

On more than 2 groups, the idea is similar where you can calculate recall/precision on each of the classes. To compute an F1 score, you can do something similar where you weight/average all the scores together.

See https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1 for an example

## Question 6
**Need to look into this more. I have a high level response, but it's good to go over each of the standard models and know how they work**

What are various ways to predict a binary response variable? Can you compare two of them and tell me when one would be more appropriate? What’s the difference between these? (SVM, Logistic Regression, Naive Bayes, Decision Tree, etc.)

Answer

High level there are parametrics and non-parametric models and ultimately what you choose is based on bias/variance tradeoff.

Logistic, linear classifier, fits based on Maximum Likelihood method where you're trying to fit against the log odds. 
Random Forest: tree based classifier, similar to decision tree in that you're trying to minimize entropy and maximize information gain but to reduce risk of overfitting you only consider a subset of all features at each split. (**Need to look into this more**)

## Question 7
What is regularization and where might it be helpful? What is an example of using regularization in a model?

Answer:
At a high level, regularization is basically adding a penalty term for more complex models. The reason why you might do this is going back to fundamental bias-variance tradeoff. If you have a very complex model, you might have low bias but have high variance so you will overfit the training data. Regularization will add a penalty term for this.

Typically, there is a concept of L1 vs L2 regularization. In a regression context, L1 is what a Lasso regression uses whereas L2 is what a Ridge regression uses. At a high level both will add a penalty term on the coefficients. The main difference in outcome is that the Lasso regression can set some coefficients to 0 whereas the Ridge regression will not. The mechanisms are also different

- L1 adds a penalty term by adding a penalty for the sum of the absolute value of coefficients
- L2 adds a penalty term by adding a penalty for the sum of squares of the coefficients

https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-How-does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when is pretty good at explaining this as well.

## Question 8
Why might it be preferable to include fewer predictors over many?

Answer:
Bias-variance tradeoff

# Programming

## Question 9
What are the different types of joins? What are the differences between them?

1. Inner
2. Left
3. Right
4. Full Outer
5. Cross Join

## Question 10
Why might a join on a subquery be slow? How might you speed it up?

Answer:
Lack of indices, query planner on a subquery is forced by user. Tradeoff is traditionally readability vs performance. One way to speed it up is to materialize the actual view as a table.

Stackoverflow answer: https://stackoverflow.com/questions/31724903/why-might-a-join-on-a-subquery-be-slow-what-could-be-done-to-make-it-faster-s

## Question 11
Describe the difference between primary keys and foreign keys in a SQL database.

Primary key uniquely identifies the column in a SQL table and often adds an index. Foreign key is a reference from another table

## Question 12
Given a `COURSES` table with columns `course_id` and `course_name`, a `FACULTY` table with columns `faculty_id` and `faculty_name`, and a `COURSE_FACULTY` table with columns `faculty_id` and `course_id`, how would you return a list of faculty who teach a course given the name of a course?

```
select
  courses.course_name,
  faculty.faculty_id,
  faculty.faculty_name
from courses
  join course_faculty
    on courses.course_id = course_faculty.course_id
  join faculty
    on course_faculty.faculty_id = faculty.faculty_id
```

## Question 13
Given a `IMPRESSIONS` table with `ad_id`, `click` (an indicator that the ad was clicked), and `date`, write a SQL query that will tell me the click-through-rate of each ad by month.

```
select
  date_trunc('month', date) as month,
  sum(click) as total_clicks,
  count(*) as total_adds,
  sum(click) / count(*) as ctr_per_month
from impressions
```

## Question 14
Write a query that returns the name of each department and a count of the number of employees in each:
`EMPLOYEES` containing: `Emp_ID` (Primary key) and `Emp_Name`

`EMPLOYEE_DEPT` containing: `Emp_ID` (Foreign key) and `Dept_ID` (Foreign key)

`DEPTS` containing: `Dept_ID` (Primary key) and `Dept_Name`

```
select
  depts.dept_name,
  count(*) as number_employees
from employee_dept
  join depts
    on employee_dept.dept_id = depts.dept_id
```