# Predictive Modeling - 19 Questions

1. (Given a Dataset) Analyze this dataset and give me a model that can predict this response variable.
    * Problem Determination -> Data Cleaning -> Data Visualization -> Feature engineering -> Modeling
    * Try to start by fitting a simple model (logistic regression, linear regression), do some feature engineering according, and then try more complicated models. Always split the data set into train and test data, and use cross validation to check their performance.
    * Determine if the problem is classification or regression
    * Favor simple models that run quickly and that can be easily explained.
    * Mention cross validation as a means to evaluate the model
    * Plot and visualize the data

2. What could be some issues if the distribution of the test data is significantly different than the distribution of the training data.
    * The model will have high training accuracy but have a low test accuracy. Without knowing a bunch of information about the dataa, it is hard to understand which dataset represents the population data and the generalizability of the algorithm is hard to measure. This should be mitigated by repeated splitting of train vs. test dataset (cross-validation).
        * If the train and test data has a different distribution, the classifier would likely overfit to the train data.
        * This issue can be overcome by using a more general learning method.
        * This can occur when:
            * P(y|x) are the same but P(x) are different. (covariance shift)
            * P(y|x) are different. (concept shift)
        * Thue causes can be:
            * Training samples are obtained in a biased way. (sample selection bias).
            * Train is different from test because of the temporal, or spatial changes.
        * Solution to covariate shift:
            * importance weighted cv.

3. What are some ways I can make my model more robust to outliers.
    * We can have L1 or L2 regularization to reduce variance (or increase bias).
    * Changes to the algorithm:
        * Use tree-based methods instead of regression methods as they are more resistant to outliers. For statistical tests, use non-parametric tests instead of parametric ones.
        * Use robust erro metrics use as MAE or Huber Loss.
    * Changes to the data:
        * Winsorizing the data
        * Transforming the data(take the log)
        * Remove them only if you're certain they're anomalies and not worth predicting.

4. What are some differences you would expect in a model that minimizes squared error, vs a model that minimizes absolute error? In which cases would each error metric be appropriate?
    * MSE is more strict to having outliers. While, MAE is more robust in that sense, but is harder to fit the model for because it cannot be numerically optimized. So, when there is less variability in the model and the model is computationally easy to fit, we should use MAE, and if that's not the case, we should use MSE.
    * MSE:
        * Easier to compute the gradient
        * Correspondes to maximizing likelihood of Gaussian random variables
    * MAE:
        * Linear programming needed to compute the gradient
        * more robust to outliers, if the consequences of large errors are great, use MSE.
        

5. What error metric would you use to evaluate how good a binary classifier is? What if the classes are imbalanced? What if there are more than 2 groups?
    * Accuracy:
        * Proportion of instances you predicted correctly
        * Pros: intuitive, easy to explain
        * Cons: works poorly when the class labels are imbalanced and the signal from the data is weak
    * ROC curve and AOC:
        * Plot false-positive-rate(fpr) on the x-axis and the true-positive-rate(tpr) on the y-axis for different threshold. Given a random positive instance and a random negative instance, the AUC is the probability that you can identify who's who. 
        * Pros: Works well when testing the ability of distinguishing the two classes
        * Cons: Can't interpret predictions as possibilities (b/c AUC is determined by rankings), so can't exaplin the uncertainty of the model, and it doesn't work for mult-class case.
    * Logloss/deviance/cross-entropy:
        * Pros: Error metric based on probabilities
        * Cons: very sensitive to false positives, negatives
    * When there are more than 2 groups, we have k binary classifications and add them up for logloss. Some metrics like AUC is only applicable in the binary case.
    

6. What are various ways to predict a binary response variable? Can you compare two of them and tell me when one would be more appropriate? What’s the difference between these? (SVM, Logistic Regression, Naive Bayes, Decision Tree, etc.)
    * Things to look at: N, P, linearly seperable, features independent, likely to overfit
    * Logistic Regression:
        * Features roughly linear, problem roughly linealy seperable
        * Robust to noise, use L1/L2 Regularization for model selection, avoid overfitting
        * The output come as probabilities
        * Efficient and the computation can be distributed
        * Can be used as a baseline for other algorithms
        * Negative: can hardly handle categorical features
    * SVM:
        * W/ a non-linear kernel, can deal with problems that are not linearly separable
        * Negative: Slow to train, for most industry scale applications, not really efficient.
    * Naive Bayes:
        * Computationally efficient when P is large by alleviating the curse of dimensionality.
        * Works surprisingly well for some cases even if the condition doesn't hold
        * W/ word frequencies as features, the independence assumption can be seen reasonable. So the algorithm can be used in text categorization
        * Negative: Conditional independence of every other feature should be met.
    * Tree Ensembles:
        * Good for large N and large P, can deal with categorical features very well
        * Non-parametric, so no need to worry about outliers
        * GBT's work better but the parameters are harder to tune
        * RF works out of the box, but usually performs worse than GBT
    * Deep Learning:
        * Works well for some classification takes (e.g. image)
        * Used to squeeze something out of the problem

7. What is regularization and where might it be helpful? What is an example of using regularization in a model?
    * Regularization:
        * Regulatization is a tool to deal with overfitting, it is useful for reducing variance in the model.
    * L1 Regulatization:
        * Lasso Regularization is used to penalize large coefficients and automatically select features
        * adds the *absolute value of magnitude* to the cost function 
    * L2 Regularization:
        * Ridge Regularization is used to peanlize the feature coefficients
        * Adds the *squared value of magnitude* to the cost function

8. Why might it be preferable to include fewer predictors over many?
    * When we add irrelevent features, it increases the model's tendency to overfit because those features introduce more noise. When two variables are correlated, they might be harder to interpret in case of regression, etc.
    * Curse of dimensionality, adding random noise makes the model more complicated but useless
    * Computational cost

9. Given training data on tweets and their retweets, how would you predict the number of retweets of a given tweet after 7 days after only observing 2 days worth of data?
    * Build a time series model w/ the training data with a seven day cycle and then use that for a new data set w only 2 days of data.
    * BUild a regression function to estimate the number of retweets as a function of time t
    * To determine if one regression function can be built, see if there are clusters in terms of the trends in the number of tweets.

10. How could you collect and analyze data to use social media to predict the weather?
    * Collect the historical feed of interest (could be based on location or social network)
    * Define and label the weather in the past x days for each feed
    * For each feed, find the words or phrase with text matrix. (there could be some features, i.e. words, that correlate w/ weather), use feature reduction to determine the best words to predict weather

11. How would you construct a feed to show relevant content for a site that involves user interactions with items?
    * We could do this using a recommendation engine.
    * The easiest we can do is to show contents that are popular to other uses
    * To be more accurate:
        * We can build a content-based filtering or collaborative filtering.
        * If there's enough user usage data, we can try collaborative filtering and recommend contents other similar users have consumed
        * If there isn't, we can recommend similar items based on vectorization of items (content based filterting)

12. How would you design the people you may know feature on LinkedIn or Facebook?
    * Find strong unconnected people in weighted connection graph.
        * Define similarity as how strong the two people are connected.
        * Given a certain feature, we can calculate the similarity based on:
            * Friend connections (neighbors)
            * Check-in's people being at the same location all the time.
            * Same college, workplace, etc.
            * Have randomly dropped graphs test the performance of the algo
    * Reference News Feed Optimization:
        * Affinity Score: how close the content creator and the users are
        * Weight: Weight for the edge type (comment, like, tag, etc.). Emphasis on features the company wants to promote.
        * Time Decay: the older the less important

13. How would you predict who someone may want to send a Snapchat or Gmail to?
    * For each user, assign a score of how likely someone would send an email to
    * The rest is feature engineering:
        * Number of past emails, how many eresponses, the last time they exchange an email, whether the last email has specific characters in it, features about the other users, etc.
    * Emailing behavior, who they email the most frequently, time decay, etc.


14. How would you suggest to a franchise where to open a new store?
    * BUild a master dataset w/ local demographic information available for each location
        * Local income levels, proximity to traffic, weather, population density, proximity to other buisinesses. 
        * A reference dataset on local, regional, and national macroeconomic conditions (e.g. unemployment, inflation, prime interest rate, etc.)
        * Any data on the local franchise owner-operators, to the degree of the manager
    * Identify a set of KPIs acceptable to the management that had requested the analysis concerning the most desirable factors surrounding a franchise
        * Quarterly operating profit, ROI, EVA, pay-down rate, etc.
    * Run econometric models to understand the relative significance of each variable
    * Run machine learning algorithms to predict the performance of each location candidate

15. In a search engine, given partial data on what the user has typed, how would you predict the user's eventual search query?
    * Based on the frequency of words shown up given a sequence of characters, we can construct conditional probabilities of the set of next sequences of wwords that can show up (use n-grams).
    * The sequence w/ the highest conditional probabilities can show up as top candidates.
    * To further improve the algorithm:
        * We can put more weight on past sequences which showed up more recently or more closely in proximity to the user's location
        * Look at your recent searches
    * Personalize and localize the search
        * Use the user's historical search data
        * Use the historical data from that user's area

16. Given a database of all previous alumni donations to your university, how would you predict which recent alumni are most likely to donate?
    * Use a binary classification algorithm (i.e. RF, Logreg) 
    * Base it on frequency and amount of donations, graduation, year, major, organization, etc.

17. You’re Uber and you want to design a heatmap to recommend to drivers where to wait for a passenger. How would you approach this?
    * Based on the pickup locations of past passengers around the same time of day, day of week, year, etc. 
    * Based on the number of past pickups
        * account for periodicity
        * special events

18. How would you build a model to predict a March Madness bracket?
    * Use a lienar model to predict score differential between two teams, given their historical data
    * As well as, use lots of generated features (budget, player physicality, player statistics, etc.)

19. You want to run a regression to predict the probability of a flight delay, but there are flights with delays of up to 12 hours that are really messing up your model. How can you address this?
    * Make the model more robust to outliers:
        * User Regularization or change the algorithm
        * Tree-based algorithsm or MAE or Huber loser
        * Transform the data (log)