# Data Science Interview Cheat Sheets

<b>[TTS LinkedIn Candidate Guide](docs/TTS-Candidate-LinkedIn-Guide.pdf)</b> | <b>[Elements Of Programming Interviews in Python](docs/elements-of-programming-interviews-in-python.pdf)</b>

### [Data Science Interview Questions](docs/ds_interview_questions.pdf)

[![Data Science Interview Questions](images/ds-qa.png)](docs/ds_interview_questions.pdf)

- - -
# Data Science

1. <b> What does data science mean?</b>

Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" in order to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. 

2. <b> What are the assumptions of a linear regression?</b>

[Linear regression](https://www.statology.org/linear-regression/) is a useful statistical method we can use to understand the relationship between two variables, x and y. Otherwise stated, finding the “Line of Best Fit”. We must first make sure that [four assumptions are met](https://www.statology.org/linear-regression-assumptions/):<br><br>
- <b>Linear relationship</b>: There exists a linear relationship between the independent variable, x, and the dependent variable, y.  The points in the plot below look like they fall on roughly a straight line, which indicates that there is a linear relationship between x and y.<br><br>
![title](images/LinReg1.jpeg)<br><br>

- <b>Independence</b>: The residuals are independent. In particular, there is no correlation between consecutive residuals in time series data. Residuals are independent. This is mostly relevant when working with time series data.  The simplest way to test if this assumption is met is to look at a residual time series plot, which is a plot of residuals vs. time. Ideally, most of the residual autocorrelations should fall within the 95% confidence bands around zero.<br><br>

- <b>Homoscedasticity</b>: The residuals have constant variance at every level of x.  When this is not the case, the residuals are said to suffer from [heteroscedasticity](https://en.wikipedia.org/wiki/Heteroscedasticity).<br><br>

When heteroscedasticity is present in a regression analysis, the results of the analysis become hard to trust. Specifically, heteroscedasticity increases the variance of the regression coefficient estimates, but the regression model doesn’t pick up on this. This makes it much more likely for a regression model to declare that a term in the model is statistically significant, when in fact it is not.<br><br>

The simplest way to detect heteroscedasticity is by creating a fitted value vs. residual plot.<br><br>
Notice how the residuals become much more spread out as the fitted values get larger. This “cone” shape is a classic sign of heteroscedasticity:<br><br>
![title](images/het1.jpeg)<br><br>

- <b>Normality</b>: The residuals of the model are normally distributed. There are two common ways to check if this assumption is met:

A Q-Q plot, short for quantile-quantile plot, is a type of plot that we can use to determine whether or not the residuals of a model follow a normal distribution. If the points on the plot roughly form a straight diagonal line, then the normality assumption is met.<br><br>
![title](images/qq.jpeg)<br><br>

3.	<b>What is the difference between factor analysis and cluster analysis</b>?

Factor analysis is an exploratory statistical technique to investigate dimensions and the factor structure
underlying a set of variables (items) while cluster analysis is an exploratory statistical technique to group
observations (people, things, events) into clusters or groups so that the degree of association is strong
between members of the same cluster and weak between members of different clusters.<br><br>

4. <b>What is an iterator generator?</b>

Iterators are containers for objects so that you can loop over the objects. In other words, you can run the "for" loop over the object. <br><br>
Python generator gives us an easier way to create python iterators. This is done by defining a function but instead of the return statement returning from the function, use the "yield" keyword. <br><br>
The difference is that a generator expression does not actually compute the values until they are needed. This not only leads to memory efficiency, but to computational efficiency as well! This also means that while the size of a list is limited by available memory, the size of a generator expression is unlimited!<br><br>

5. <b>Write down an SQL script to return data from two tables</b>.

INNER JOINS returns only rows where a match is found in both input tables.<br><br>

```
SELECT o.orderid, o.qty, i.itemprice, i.itemdesc
FROM orders o
INNER JOIN items i
on o.itemid = i.itemid
```
OUTER JOINS return all rows from one table and matching rows from the second table. In cases where the join cannot find matching records from the second table, the results from the second table are displayed as NULL.  Unlike inner joins, the order in which tables are listed and joined in the FROM clause does matter, as it will determine whether you choose LEFT or RIGHT for your join.<br><br>
```
SELECT o.orderid, o.qty, i.itemprice, i.itemdesc
FROM orders o
LEFT JOIN items i
on o.itemid = i.itemid
```

6. <b>Draw graphs relevant to PPC (pay-per-click) adverts and ticket purchases</b>.

<b>PPC (pay-per-click)</b> advertising rates are determined using the flat-rate model or the bid-based model.
- Flat Rate Model: In the flat rate pay-per-click model, an advertiser pays a publisher a fixed fee for each click. Publishers generally keep a list of different PPC rates that apply to different areas of their website. Note that publishers are generally open to negotiations regarding the price. A publisher is very likely to lower the fixed price if an advertiser offers a long-term or a high-value contract.<br><br>
- Bid-Based Model: In the bid-based model, each advertiser makes a bid with a maximum amount of money they are willing to pay for an advertising spot. Then, a publisher undertakes an auction using automated tools. An auction is run whenever a visitor triggers the ad spot.

Digital Marketing Dashboard:<br><br>

![title](images/digital-marketing-dashboard.png)<br><br>

Ticket Sales Analysis (immersive):<br><br>


In [1]:
from IPython.display import HTML

# Youtube
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/8Ku12tO_X9k?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')



7. <b>How can you prove an improvement you introduced to a model is actually working</b>?

1. <b>Add More Data</b><br><br>
Having more data is always a good idea. It allows the “data to tell for itself,” instead of relying on assumptions and weak correlations. Presence of more data results in better and accurate models.  For example: we do not get a choice to increase the size of training data in data science competitions.<br><br>

2. <b>Treat Missing and Outlier Values</b><br><br>
The unwanted presence of missing and outlier values in the training data often reduces the accuracy of a model or leads to a biased model. It leads to inaccurate predictions. This is because we don’t analyse the behavior and relationship with other variables correctly. So, it is important to treat missing and outlier values well.<br><br>
     - Missing: In case of continuous variables, you can impute the missing values with mean, median, mode. For categorical variables, you can treat variables as a separate class. You can also build a model to predict the missing values. KNN imputation offers a great option to deal with missing values. To know more about these methods refer article “Methods to deal and treat missing values“.
     - Outlier: You can delete the observations, perform transformation, binning, Imputation (Same as missing values) or you can also treat outlier values separately.<br><br>

3. <b>Feature Engineering</b><br><br>
This step helps to extract more information from existing data. New information is extracted in terms of new features. These features may have a higher ability to explain the variance in the training data. Thus, giving improved model accuracy.

Feature engineering is highly influenced by hypotheses generation. Good hypothesis result in good features. That’s why, I always suggest to invest quality time in hypothesis generation.<br><br>

Feature engineering process can be divided into two steps:<br><br>

- - -

<b>Feature Transformation</b>: There are various scenarios where feature transformation is required:
- Changing the scale of a variable from original scale to scale between zero and one. This is known as data normalization. For example: If a data set has 1st variable in meter, 2nd in centi-meter and 3rd in kilo-meter, in such case, before applying any algorithm, we must normalize these variable in same scale.
- Some algorithms works well with normally distributed data. Therefore, we must remove skewness of variable(s). There are methods like log, square root or inverse of the values to remove skewness.
- Some times, creating bins of numeric data works well, since it handles the outlier values also. Numeric data can be made discrete by grouping values into bins. This is known as data discretization.
 
- - -
<b>Feature Creation</b>: Deriving new variable(s ) from existing variables is known as feature creation. It helps to unleash the hidden relationship of a data set. Let’s say, we want to predict the number of transactions in a store based on transaction dates. Here transaction dates may not have direct correlation with number of transaction, but if we look at the day of a week, it may have a higher correlation. In this case, the information about day of a week is hidden. We need to extract it to make the model better.

4. <b>Feature Selection</b>

Feature Selection is a process of finding out the best subset of attributes which better explains the relationship of independent variables with target variable.  You can select the useful features based on various metrics like:

- Domain Knowledge: Based on domain experience, we select feature(s) which may have higher impact on target variable.
- Visualization: As the name suggests, it helps to visualize the relationship between variables, which makes your variable selection process easier.
- Statistical Parameters: We also consider the p-values, information values and other statistical metrics to select right features.
- PCA: It helps to represent training data into lower dimensional spaces, but still characterize the inherent relationships in the data. It is a type of dimensionality reduction technique. There are various methods to reduce the dimensions (features) of training data like factor analysis, low variance, higher correlation, backward/ forward feature selection and others.


5. <b>Multiple Algorithms</b>

Hitting at the right machine learning algorithm is the ideal approach to achieve higher accuracy. But, it is easier said than done.  This intuition comes with experience and incessant practice. Some algorithms are better suited to a particular type of data sets than others. Hence, we should apply all relevant models and check the performance.<br><br>

6. <b>Algorithm Tuning</b>

We know that machine learning algorithms are driven by parameters. These parameters majorly influence the outcome of learning process. The objective of parameter tuning is to find the optimum value for each parameter to improve the accuracy of the model. To tune these parameters, you must have a good understanding of these meaning and their individual impact on model. You can repeat this process with a number of well performing models.<br><br>

For example: In random forest, we have various parameters like ```max_features, number_trees, random_state, oob_score``` and others. Intuitive optimization of these parameter values will result in better and more accurate models.

7. <b>Ensemble Methods</b>

This is the most common approach found majorly in winning solutions of Data Science competitions. This technique simply combines the result of multiple weak models and produce better results. It is always a better idea to apply ensemble methods to improve the accuracy of your model. There are two good reasons for this: a ) They are generally more complex than traditional methods. b) The traditional methods give you a good base level from which you can improve and draw from to create your ensembles.

- Bagging (Bootstrap Aggregating) involves fitting many decision trees on different samples of the same dataset and averaging the predictions.
- Stacking involves fitting many different models types on the same data and using another model to learn how to best combine the predictions.
- Boosting involves adding ensemble members sequentially that correct the predictions made by prior models and outputs a weighted average of the predictions.

8. <b>Cross Validation</b>: To find the right answer of this question, we must use cross validation technique. Cross Validation is one of the most important concepts in data modeling. It says, try to leave a sample on which you do not train the model and test the model on this sample before finalizing the model.  This method helps us to achieve more generalized relationships.

- - -

CAUTION: Till here, we have seen methods which can improve the accuracy of a model. But, it is not necessary that higher accuracy models always perform better (for unseen data points). Sometimes, the improvement in model’s accuracy can be due to over-fitting too.

8. <b>Name several types of computer vision models</b>. 

Different types of computer vision include 
- image segmentation
- object detection 
- facial recognition 
- edge detection 
- pattern detection
- image classification
- feature matching

9. <b>How would you explain Random Forest to a non-technical person?</b>

A 'decision tree' is an algorithm that tries to split up the data based on a series of (usually binary) questions.<br><br> 
<b>Example</b>: If we want to learn about the set of all dogs, we could ask "Big or small", "Long hair or short hair", "pure or mutt" etc.  Basically, this is like a game of 20 questions where the algorithm tries to narrow down the scope of possibilities to something more specific. For each question, the decision tree tries to find the best cutoff between "yes" and "no" (e.g., "big" or "small" for our dog example) based on either an entropy or 'gini' score.  The results of this sequence of questions can be visualized as a tree, where node represents a question and each branch represents an answer.<br><br>
Now a given tree is determined by 
- which questions to ask and in what order  
- the sample of data that the tree is trained on

<b>Preventing over-fitting</b>: A random tree is a tree where either the questions or the data are selected randomly.<br><br>   

A random forest is a set of random trees, where the final result is taken to be an average (regression) or a 'vote' (classification) from among the individual trees.<br><br>

<b>Algorithms:</b> Different decision tree algorithms utilize different impurity metrics: 
- - -
- CART uses Gini; 
CART is more sensitive to outliers at the target variable (Y) than the predictors (X). It is calculated by summing the split improvement score for each variable across all splits in a tree. Random Forest creates multiple CART trees based on "bootstrapped" samples of data and then combines the predictions.
- - -
- ID3 and C4. 5 use Entropy. 

10. <b>What is and how to determine a Gini coefficient?</b>

The Gini coefficient is a statistical measure used to calculate inequality within a nation. It does so by calculating the wealth distribution between members of the population. The Gini coefficient can be calculated using the formula: 

```
Gini Coefficient = A / (A + B)
```
where A is the area above the Lorenz Curve and B is the area below the Lorenz Curve.<br>
![title](images/gini.png)

10. Explain K-means.
11. What kind of RDBMS software do you have experience with? What about non-relational databases?
14. What is the difference between SQL, MySQL and SQLServer?

16. Give examples where a false negative is more important than a false positive, and vice versa.
17. What is alogistic regression?

14. <b>How would you start cleaning a big dataset</b>?

- - -
Step 1: <b>Remove duplicate or irrelevant observations</b><br><br>
Remove unwanted observations from your dataset, including duplicate observations or irrelevant observations. Duplicate observations will happen most often during data collection. When you combine data sets from multiple places, scrape data, or receive data from clients or multiple departments, there are opportunities to create duplicate data. De-duplication is one of the largest areas to be considered in this process. Irrelevant observations are when you notice observations that do not fit into the specific problem you are trying to analyze. For example, if you want to analyze data regarding millennial customers, but your dataset includes older generations, you might remove those irrelevant observations. This can make analysis more efficient and minimize distraction from your primary target—as well as creating a more manageable and more performant dataset.

- - -
Step 2: <b>Fix structural errors</b><br><br>
Structural errors are when you measure or transfer data and notice strange naming conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or classes. For example, you may find “N/A” and “Not Applicable” both appear, but they should be analyzed as the same category.
- - -
Step 3: <b>Filter unwanted outliers</b><br><br>
Often, there will be one-off observations where, at a glance, they do not appear to fit within the data you are analyzing. If you have a legitimate reason to remove an outlier, like improper data-entry, doing so will help the performance of the data you are working with. However, sometimes it is the appearance of an outlier that will prove a theory you are working on. Remember: just because an outlier exists, doesn’t mean it is incorrect. This step is needed to determine the validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider removing it.
- - -
Step 4: <b>Handle missing data</b><br><br>
You can’t ignore missing data because many algorithms will not accept missing values. There are a couple of ways to deal with missing data. Neither is optimal, but both can be considered.

As a first option, you can drop observations that have missing values, but doing this will drop or lose information, so be mindful of this before you remove it.
As a second option, you can input missing values based on other observations; again, there is an opportunity to lose integrity of the data because you may be operating from assumptions and not actual observations.
As a third option, you might alter the way the data is used to effectively navigate null values.

- - -
Step 5: <b>Validate and QA</b><br><br>
At the end of the data cleaning process, you should be able to answer these questions as a part of basic validation:

- Does the data make sense?
- Does the data follow the appropriate rules for its field?
- Does it prove or disprove your working theory, or bring any insight to light?
- Can you find trends in the data to help you form your next theory?
- If not, is that because of a data quality issue?
- - -
- False conclusions because of incorrect or “dirty” data can inform poor business strategy and decision-making. - - False conclusions can lead to an embarrassing moment in a reporting meeting when you realize your data doesn’t stand up to scrutiny. 

15. <b>What is over fitting and how to fix it?</b>

Overfitting occurs when you achieve a good fit of your model on the training data, while it does not generalize well on new, unseen data.  Translated: the model learned patterns specific to the training data, which are irrelevant in other data.<br><br>
One can identify overfitting by looking at validation metrics, like loss or accuracy.<br><br>
<b>Best way to fix it is - get more training data.</b><br><br>
Or other ways to handle overfitting:
- Reduce the network’s capacity by removing layers or reducing the number of elements in the hidden layers
- Apply regularization, which comes down to adding a cost to the loss function for large weights
- Use Dropout layers, which will randomly remove certain features by setting them to zero<br><br>The goal is to reduce overfitting by lowering the capacity of the model to memorize the training data.<br><br>

<b>Example</b>: Determing airline passengers sentiment on Twitter [Data](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) | [Code](https://towardsdatascience.com/handling-overfitting-in-deep-learning-models-c760ee047c6e)

16. <b>[Machine Learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/): Supervised learning versus unsupervised learning.</b>

- - -
SUPERVISED LEARNING is a machine learning approach that’s defined by its use of labeled datasets. These datasets are designed to train or “supervise” algorithms into classifying data or predicting outcomes accurately. Using labeled inputs and outputs, the model can measure its accuracy and learn over time.<br><br>
Supervised learning can be separated into two types of problems when data mining: classification and regression:
- CLASSIFICATION problems use an algorithm to accurately assign test data into specific categories, such as separating apples from oranges. Or, in the real world, supervised learning algorithms can be used to classify spam in a separate folder from your inbox. Linear classifiers, support vector machines, decision trees and random forest are all common types of classification algorithms.
- REGRESSION is another type of supervised learning method that uses an algorithm to understand the relationship between dependent and independent variables. Regression models are helpful for predicting numerical values based on different data points, such as sales revenue projections for a given business. Some popular regression algorithms are linear regression, logistic regression and polynomial regression.

- - -
UNSUPERVISED LEARNING uses machine learning algorithms to analyze and cluster unlabeled data sets. These algorithms discover hidden patterns in data without the need for human intervention (hence, they are “unsupervised”).<br><br>

Unsupervised learning models are used for three main tasks: clustering, association and dimensionality reduction:<br><br>
- CLUSTERING is a data mining technique for grouping unlabeled data based on their similarities or differences. For example, K-means clustering algorithms assign similar data points into groups, where the K value represents the size of the grouping and granularity. This technique is helpful for market segmentation, image compression, etc.
- ASSOCIATION is another type of unsupervised learning method that uses different rules to find relationships between variables in a given dataset. These methods are frequently used for market basket analysis and recommendation engines, along the lines of “Customers Who Bought This Item Also Bought” recommendations.
- DIMENSIONALITY REDUCTION is a learning technique used when the number of features  (or dimensions) in a given dataset is too high. It reduces the number of data inputs to a manageable size while also preserving the data integrity. Often, this technique is used in the preprocessing data stage, such as when autoencoders remove noise from visual data to improve picture quality.

<b>Main Difference:</b> The main distinction between the two approaches is the use of LABELED DATASETS. To put it simply, supervised learning uses labeled input and output data, while an unsupervised learning algorithm does not.

17. <b>State some biases that you are likely to encounter when cleaning a database.<b><br><br>

- - -
    
<b>SAMPLE BIAS</b><br>

Happens when the collected data doesn’t accurately represent the environment the program is expected to run into. There is no algorithm that can be trained on the entire universe of data, rather than a subset that is carefully chosen. There’s a science of choosing this subset that is both large enough and representative enough to mitigate sample bias.<br><br>
<b>Example: Security cameras</b><br>

If your goal is to create a model that can operate security cameras at daytime and nighttime, but train it on nighttime data only. You’ve introduced sample bias into your model.<br>
    
SAMPLE BIAS can be reduced or eliminated by:
- Training your model on both daytime and nighttime.
- Covering all the cases you expect your model to be exposed to. This can be done by examining the domain of each feature and make sure we have balanced evenly-distributed data covering all of it. Otherwise, you’ll be faced by erroneous results and outputs the don’t make sense will be produced.
    


- - -
<b>EXCLUSION BIAS</b><br><br>
Happens as a result of excluding some feature(s) from our dataset usually under the umbrella of cleaning our data.
We delete some feature(s) thinking that they’re irrelevant to our labels/outputs based on pre-existing beliefs.<br><br>
<b>Example: Titanic Survival prediction</b><br>   
In the famous titanic problem where we predict who survived and who didn’t. One might disregard the passenger id of the travelers as they might think that it is completely irrelevant to whether they survived or not.  Little did they know that Titanic passengers were assigned rooms according to their passenger id. The smaller the id number the closer their assigned rooms are to the lifeboats which made those people able to get to lifeboats faster than those who were deep in the center of the Titanic. Thus, resulting in a lesser ratio of survival as the id increases.<br><br>
EXCLUSION BIAS can be reduced or eliminated by:
- Investigate before discarding feature(s) by doing sufficient analysis on them.
- Ask a colleague to look into the feature(s) you’re considering to discard, afresh pair of eyes will definitely help.
- If you’re low on time/resources and need to cut your dataset size by discarding feature(s). Before deleting any, make sure to search the relation between this feature and your label. Most probably you’ll find similar solutions, investigate whether they’ve taken into account similar features and decide then.
- Remember humans are subject to bias. There are tools that can help.

- - -
<b>OBSERVER BIAS (aka experimenter bias)</b><br><br>

The tendency to see what we expect to see, or what we want to see. When a researcher studies a certain group, they usually come to an experiment with prior knowledge and subjective feelings about the group being studied. In other words, they come to the table with conscious or unconscious prejudices.<br>

<b>Example: Is Intelligence influenced by status? — The Burt Affair</b><br><br>
One famous example of observer bias is the work of Cyril Burt, a psychologist best known for his work on the heritability of IQ. He thought that children from families with low socioeconomic status (i.e. working class children) were also more likely to have lower intelligence, compared to children from higher socioeconomic statuses. His allegedly scientific approach to intelligence testing was revolutionary and allegedly proved that children from the working classes were in general, less intelligent. This led to the creation of a two-tier educational system in England in 1960s which sent middle and upper-class children to elite schools and working-class children to less desirable schools.<br><br>
Burt’s research was later of course debunked and it was concluded he falsified data. It is now accepted that intelligence is not hereditary.<br><br>
OBSERVER BIAS can be reduced or eliminated by:
- Ensuring that observers (people conducting experiments) are well trained.
- Screening observers for potential biases.
- Having clear rules and procedures in place for the experiment.
- Making sure behaviors are clearly defined.



- - -

<b>PREJUDICE BIAS</b><br><br>
Happens as a result of cultural influences or stereotypes. When things that we don’t like in our reality like judging by appearances, social class, status, gender and much more is not fixed in our machine learning model. When this model applies the same stereotyping that exists in real life due to prejudiced data it is fed.<br><br>
<b>Example: A computer vision program that detects people at work</b><br><br>
If your goal is to detect people at work. Your model has been fed to thousands of training data where men are coding and women are cooking. The algorithm is likely to learn that coders are men and women are chefs. Which is wrong since women can code and men can cook.<br><br>
The problem here is that the data is consciously or unconsciously reflecting stereotypes.<br>
Prejudice bias can be reduced or eliminated by:
- Ignoring the statistical relationship between gender and occupation.
- Exposing the algorithm to a more even-handed distribution of examples.

- - -
<b>MEASUREMENT BIAS</b><br><br>
Systematic value distortion happens when there’s an issue with the device used to observe or measure. This kind of bias tends to skew the data in a particular direction.<br><br>
<b>Example: Shooting images data with a camera that increases the brightness.</b><br><br>
This messed up measurement tool failed to replicate the environment on which the model will operate, in other words, it messed up its training data that it no longer represents real data that it will work on when it’s launched.<br><br>
This kind of bias can’t be avoided simply by collecting more data.<br><br>
MEASUREMENT BIAS can be reduced or eliminated by:
- Having multiple measuring devices.
- Hiring humans who are trained to compare the output of these devices.


- - -
### Bias Testing Tools
Bias Testing in your product development cycle
1. [FairML](https://github.com/adebayoj/fairml)
A ToolBox for diagnosing bias in predictive modeling. It audits them & determines the significance of the inputs. 

2. [LIME](https://github.com/marcotcr/lime)
[Local Interpretable Model-Agnostic Explanations (LIME)](https://www.oreilly.com/learning/introduction-to-local-interpretable-model-agnostic-explanations-lime). Many machine learning models are black boxes, understanding the rationale behind the model’s predictions would certainly help users decide when to trust or not to trust their predictions.  Lime's purpose is to explain what machine learning classifiers (or models) are doing.

<b>Example: Predicting the flu</b><br><br>
A model predicts that a certain patient has the flu. The prediction is then explained by an “explainer” that highlights the symptoms that are most important to the model. With this information about the rationale behind the model, the doctor is now empowered to trust the model—or not.
![title](images/figure1.jpeg)

<b>How it works</b>: generate a data set of perturbed instances by turning some of the interpretable components “off” (in this case, making them gray). For each perturbed instance, we get the probability that a tree frog is in the image according to the model. 
![title](images/figure2.jpeg)

18. <b>What is marketing automation</b>?

Marketing automation is the process by which software is used to automate conventional marketing processes.  Marketing automation helped companies segment customers, launch multichannel marketing campaigns, and provide personalized information for customers., based on their specific activities. In this way, users activity (or lack thereof) triggers a personal message that is customized to the user in their preferred platform. 

19. <b>What is root cause analysis</b>?

In science and engineering, root cause analysis is a method of problem solving used for identifying the root causes of faults or problems. 

20.  <b>What are some of my favorite R packages?</b>

#### [tidyverse](https://www.tidyverse.org/): tibbles, ggplot2 
- [lubridate](https://lubridate.tidyverse.org/) - break out date times, days, weeks, years
- [dplyr](https://dplyr.tidyverse.org/) - primarily a set of functions designed to enable dataframe manipulation in an intuitive, user-friendly way
- [tibble](https://tibble.tidyverse.org/) - simple dataframes<br>
<u>Models</u>
- [broom](https://cran.r-project.org/web/packages/broom/vignettes/broom.html) - takes messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy tibbles - built to work with dplyr
- [modelr](https://modelr.tidyverse.org/) provides functions that help you create elegant pipelines when modelling

![title](images/tidyverse.png)

21. <b>What is a [Machine Learning Pipeline](https://machinelearningmastery.com/machine-learning-modeling-pipelines/)</b>?

Applied machine learning is typically focused on finding a single model that performs well or best on a given dataset.  A linear sequence of data preparation and modeling steps that can be treated as an atomic unit that includes - 
- prepare the data
- tune the model
- transform the predictions 

This is called the modeling pipeline. 

A modeling pipeline requires that the sequence of one or more data preparation schemes, the model, the model configuration, and any prediction transform schemes must be evaluated consistently and correctly on a given test harness.

22. <b>What is [train-test split](https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/) in a Machine Learning pipeline?</b>

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model

23. <b>What is [k-fold cross-valdiation](https://machinelearningmastery.com/k-fold-cross-validation/) as a procedure within a Machine Learning pipeline?</b>

Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set.

24. <b>What is a [repeated k-fold cross-validation](https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/) as a procedure within a Machine Learning pipeline?</b>

Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.

25. <b>What are [Predictive Model Basics](https://r4ds.had.co.nz/model-basics.html)?</b>

There are two parts to a model:
1. <b>Define a family of models</b> that express a precise, but generic, pattern that you want to capture
2. <b>Generate a fitted model</b> by finding the model from the family that is the closest to your data; a fitted model is just the closest model from a family of models. That implies that you have the “best” model (according to some criteria); it doesn’t imply that you have a good model and it certainly doesn’t imply that the model is “true”. 

In addition to discovering which model performs the best on your dataset, you must discover:

- <b>Data transforms</b> that best expose the unknown underlying structure of the problem to the learning algorithms.
- <b>Model hyperparameters</b> that result in a good or best configuration of a chosen model.

26. <b>What are examples of [Python Scikit Learn](https://scikit-learn.org/) supervised learning versus unsupervised M/L model instance?</b>

<b>Supervised Learning</b>

Definition: Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples.

```
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,
                                                random_state=1)

from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB()                       # 2. instantiate model
model.fit(Xtrain, ytrain)                  # 3. fit model to data
y_model = model.predict(Xtest)             # 4. predict on new data

from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)
```

<b>Unsupervised Learning</b>

Definition: Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, the machine is forced to build a compact internal representation of its world.

``` 
from sklearn.decomposition import PCA  # 1. Choose the model class
model = PCA(n_components=2)            # 2. Instantiate the model with hyperparameters
model.fit(X_iris)                      # 3. Fit to data. Notice y is not specified!
X_2D = model.transform(X_iris)         # 4. Transform the data to two dimensions

iris['PCA1'] = X_2D[:, 0]
iris['PCA2'] = X_2D[:, 1]
sns.lmplot("PCA1", "PCA2", hue='species', data=iris, fit_reg=False);
```

27. <b>What are [Support Vector Machines (SVM)](https://scikit-learn.org/stable/modules/svm.html) and give an example?</b>

<b>[Support Vector Machines (SVM)](https://scikit-learn.org/stable/modules/svm.html)</b>

Definition: Supervised learning models with associated learning algorithms that analyze data for classification and regression analysis.

Example: Face Recognition
```
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.target_names)
print(faces.images.shape)

fig, ax = plt.subplots(3, 5)
for i, axi in enumerate(ax.flat):
    axi.imshow(faces.images[i], cmap='bone')
    axi.set(xticks=[], yticks=[],
            xlabel=faces.target_names[faces.target[i]])
```

28. <b>List some Predictive Analysis Models?</b>

- <b>PCA</b> - Principal component analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset. It's often used to make data easy to explore and visualize.

- <b>SSA</b> - Singular Spectrum Analysis is a nonparametric method. It tries to overcome the problems of finite sample length and noisiness of sampled time series not by fitting an assumed model to the available series, but by using a data-adaptive basis set, instead of the fixed sine and cosine of the BT method.

- <b>regression</b> - Regression analysis is a reliable method of identifying which variables have impact on a topic of interest. The process of performing a regression allows you to confidently determine which factors matter most, which factors can be ignored, and how these factors influence each other.

- <b>GLM</b> - general linear model (GLM) usually refers to conventional linear regression models for a continuous response variable given continuous and/or categorical predictors. It includes multiple linear regression, as well as ANOVA and ANCOVA (with fixed effects only).

- <b>time series</b> - Time series analysis is a specific way of analyzing a sequence of data points collected over an interval of time. In time series analysis, analysts record data points at consistent intervals over a set period of time rather than just recording the data points intermittently or randomly.

- <b>DSP</b> - Digital signal processing (DSP) is the process of analyzing and modifying a signal to optimize or improve its efficiency or performance. It involves applying various mathematical and computational algorithms to analog and digital signals to produce a signal that's of higher quality than the original signal.

- <b>financial modeling</b> - Financial modeling is a representation in numbers of a company's operations in the past, present, and the forecasted future. Such models are intended to be used as decision-making tools. Financial models are used to estimate the valuation of a business or to compare businesses to their peers in the industry.

- <b>nonlinear systems</b> - Nonlinear regression is a form of regression analysis in which data is fit to a model and then expressed as a mathematical function. 

29.  <b>What is the predictive model used for <i>college admission problem</i> and the <i>stable marriage problem</i></b>?

<b>Gale–Shapley algorithm</b> (also known as the deferred acceptance algorithm or propose-and-reject algorithm) is an algorithm for finding a solution to the stable matching problem, named for David Gale and Lloyd Shapley who had described it as solving both the college admission problem and the stable marriage problem. It takes polynomial time, and the time is linear in the size of the input to the algorithm. It is a truthful mechanism from the point of view of the proposing participants, for whom the solution will always be optimal.

![title](images/Gale-Shapley.gif)

The Gale–Shapley algorithm involves a number of "rounds" (or "iterations"):

- In the first round:<br><br>
A) each unengaged man proposes to the woman he prefers most <br><br>
B) each woman replies "maybe" to her suitor she most prefers and "no" to all other suitors. She is then provisionally "engaged" to the suitor she most prefers so far, and that suitor is likewise provisionally engaged to her <br><br>
- In each subsequent round:<br><br>
A) each unengaged man proposes to the most-preferred woman to whom he has not yet proposed (regardless of whether the woman is already engaged)<br><br>
B) each woman replies "maybe" if she is currently not engaged or if she prefers this man over her current provisional partner (in this case, she rejects her current provisional partner who becomes unengaged). The provisional nature of engagements preserves the right of an already-engaged woman to "trade up" (and, in the process, to "jilt" her until-then partner).<br><br>
- This process is repeated until everyone is engaged.

#### Algorithm
```
algorithm stable_matching is
    Initialize m ∈ M and w ∈ W to free
    while ∃ free man m who has a woman w to propose to do
        w := first woman on m's list to whom m has not yet proposed
        if ∃ some pair (m', w) then
            if w prefers m to m' then
                m' becomes free
                (m, w) become engaged
            end if
        else
            (m, w) become engaged
        end if
    repeat
```

This algorithm guarantees that:

<b>Everyone gets married</b><br>
- At the end, there cannot be a man and a woman both unengaged, as he must have proposed to her at some point (since a man will eventually propose to everyone, if necessary) and, being proposed to, she would necessarily be engaged (to someone) thereafter.<br>

<b>The marriages are stable</b><br>
- Let Alice and Bob both be engaged, but not to each other. Upon completion of the algorithm, it is not possible for both Alice and Bob to prefer each other over their current partners. If Bob prefers Alice to his current partner, he must have proposed to Alice before he proposed to his current partner. If Alice accepted his proposal, yet is not married to him at the end, she must have dumped him for someone she likes more, and therefore doesn't like Bob more than her current partner. If Alice rejected his proposal, she was already with someone she liked more than Bob.

#### Optimality of Solution

This is a general fact: the Gale-Shapley algorithm in which men propose to women always yields a stable matching that is the best for all men and worst for all women among all stable matchings.

#### R Packages
``` 
matchingMarkets
matchingR
```

#### Python Packages
```
matching library
QuantEcon/MatchingMarkets.py

```

[Javascript Demonstration](http://www.sephlietz.com/gale-shapley/)

30. <b>How Entropy is used for [Decision Trees to make decisions](https://towardsdatascience.com/entropy-how-decision-trees-make-decisions-2946b9c18c8)?</b>

Entropy is nothing but the measure of disorder.  Also it can be used to measure purity as well. The goal is to reduce more disorder in our target variable.

![title](images/entropy.png)

The x-axis measures the proportion of data points belonging to the positive class in each bubble and the y-axis axis measures their respective entropies. Right away, you can see the inverted ‘U’ shape of the graph. Entropy is lowest at the extremes, when the bubble either contains no positive instances or only positive instances. That is, when the bubble is pure the disorder is 0. Entropy is highest in the middle when the bubble is evenly split between positive and negative instances. Extreme disorder, because there is no majority.

#### Example: Contingency Table

![title](images/contingency.png)

For this illustration, will use this contingency table to calculate the entropy of our target variable by itself and then calculate the entropy of our target variable given additional information about the feature, credit rating. This will allow me to calculate how much additional information does “Credit Rating” provide for my target variable “Liability”.

Knowing the Credit Rating helped us reduce the uncertainty around our target variable, Liability - exactly what a feature is supposed to do: provide us information about our target variable. 

Decision trees use entropy and information gain to determine which feature to split their nodes on to get closer to predicting the target variable with each split and also to determine when to stop splitting the tree. 

- - -
# Statistics

- <b>What is the difference between false positive and false negative?</b><br><br>
A false positive is when a scientist determines something is true when it is actually false (also called a type I error). A false positive is a “false alarm.” <br><br>A false negative is saying something is false when it is actually true (also called a type II error). <br><br><b>Examples:</b><br><br>
- Airport Security: a "false positive" is when ordinary items such as keys or coins get mistaken for weapons
- Quality Control: a "false positive" is when a good quality item gets rejected, and a "false negative" is when a poor quality item gets accepted. (A "positive" result means there IS a defect.)
- COVID Test: a "false negative" - there's a chance that your COVID-19 diagnostic test could return a false-negative result. This means that the test didn't detect the virus, even though you actually are infected with it.


- <b>What is the null hypothesis and how do we state it?</b><br><br>
To write a null hypothesis, first start by asking a question. Rephrase that question in a form that assumes no relationship between the variables. In other words, assume a treatment has no effect.

- <b>How would you explain a linear regression to a business executive?</b><br><br>
Linear regression models are used to show or predict the relationship between two variables or factors. The factor that is being predicted (the factor that the equation solves for) is called the dependent variable. The factors that are used to predict the value of the dependent variable are called the independent variables.

- <b>Tell me what heteroskedasticity is and how to solve it.</b><br><br>

Its compliment is homoscedastic, or a sequence of random variables is homoscedastic if all its random variables have the same finite variance. See image below.
![title](images/homo1.png)

In simple terms, heteroscedasticity is any set of data that isn't homescedastic.  More technically, it refers to data with unequal variability (scatter) across a set of second, predictor variables. Heteroscedastic data tends to follow a cone shape on a scatter graph.<br><br>
When heteroscedasticity is present in a regression analysis, the results of the analysis become hard to trust. This makes it much more likely for a regression model to declare that a term in the model is statistically significant, when in fact it is not.<br><br>
The simplest way to detect heteroscedasticity is with a fitted value vs. residual plot. Once you fit a regression line to a set of data, you can then create a scatterplot that shows the fitted values of the model vs. the residuals of those fitted values.<br><br>
![title](images/het2.jpeg)

Notice how the residuals become much more spread out as the fitted values get larger. This “cone” shape is a telltale sign of heteroscedasticity.<br><br>
<b>Example 1</b>: Consider a dataset that includes the annual income and expenses of 100,000 people across the United States. For individuals with lower incomes, there will be lower variability in the corresponding expenses since these individuals likely only have enough money to pay for the necessities. For individuals with higher incomes, there will be higher variability in the corresponding expenses since these individuals have more money to spend if they choose to.<br><br>
<b>Example 2</b>: Consider a dataset that includes the populations and the count of flower shops in 1,000 different cities across the United States. For cities with small populations, it may be common for only one or two flower shops to be present. But in cities with larger populations, there will be a much greater variability in the number of flower shops. These cities may have anywhere between 10 to 100 shops. This means when we create a regression analysis and use population to predict number of flower shops, there will inherently be greater variability in the residuals for the cities with higher populations.

<b>Why do we log a variable?</b> When logs are applied, the distributions are better behaved. Taking logs also reduces the extrema in the Page 7 data, and curtails the effects of outliers. We often see economic variables measured in dol- lars in log form, while variables measured in units of time, or interest rates, are often left in levels.

#### How To Solve Heteroskedasticity

1. Transform the dependent variable<br><br>
One common transformation is to simply take the log of the dependent variable. For example, if we are using population size (independent variable) to predict the number of flower shops in a city (dependent variable), we may instead try to use population size to predict the log of the number of flower shops in a city.<br><br>

2. Redefine the dependent variable.<br><br>
For example, instead of using the population size to predict the number of flower shops in a city, we may instead use population size to predict the number of flower shops per capita. this reduces the variability that naturally occurs among larger populations since we’re measuring the number of flower shops per person, rather than the sheer amount of flower shops.<br><br>

3. Use weighted regression<br><br>
Another way to fix heteroscedasticity is to use weighted regression. This type of regression assigns a weight to each data point based on the variance of its fitted value. Essentially, this gives small weights to data points that have higher variances, which shrinks their squared residuals. When the proper weights are used, this can eliminate the problem of heteroscedasticity.<br><br>

- - -
# Coding

[Python](https://www.python.org/), [R](https://www.r-project.org/), [SAS](https://sas.com) (optional) and [SQL](https://www.tutorialspoint.com/sql/sql-overview.htm) are the bread-and-butter programming languages in data science.