# Data Science Mock Interview #1 

## Goal: Be able to apply experimental design in a hypothetical, given scenario. Have complete understanding and synthesis of learned concepts within Thinkful curriculum for Unit 1 and refresh on the bootcamp.

Topics covered:

* Complex Visualization
* Significance and data interaction
* Experimental Design

## Section 1: Complex Visualization 

Resources: 
1. [8 great examples of complex data visualized -- maptive](https://www.maptive.com/8-great-examples-complex-data-visualized-2/)
2. [What great data visualization looks like -- hubspot](https://blog.hubspot.com/marketing/great-visualization-examples)
3. [Data Visualization using Seaborn -- towardsdatascience](https://towardsdatascience.com/data-visualization-using-seaborn-fc24db95a850)
4. [Seaborn Documentation](http://seaborn.pydata.org/index.html)
5. [Best visualization projects in 2016](https://flowingdata.com/2016/12/29/best-data-visualization-projects-of-2016/)

### Main takeaways

* Plots should be appropriately simple and labeled. Someone looking at your visualization should be able to know what's going on from just the graph, without needing to read extra.
* Basic plots show every datapoint. Statistical plotting aggregates information to highlight features of data.

### What plot where? 

**Line Plots:**
* show data over time

**Scatter Plots:**
* show the relationship between two variables
* two continuous variables
    
**Bar Plots**
* most commonly used method for visualizing grouped data
* frequently objected to because of how subjective they can be / misused they can be
* use means or counts 
* should be evaluated based on the size of the error bars, not the height of the bars itself (dynamite plots)
* curriculum resource on misuse [here](https://www.washingtonpost.com/graphics/politics/2016-election/trump-charts/?noredirect=on)
* one continuous variable / one categorical variable or categorical variable & counts

**Histogram**
* statistical plot
* show the distribution of the dataset and any outliers
* show central tendencies and variance

**Box plot**
* statistical plot
* compare groups and to identify differences in variance, as well as outliers
* give central tendencies and variance

**QQ plots**
* statistical plot
* show how close a variable is to known distribution, and any outliers

**Heatmap**
* shows data density or correlation within a dataset 

**Violin plot**
* used to visualise the distribution of the data and its probability density

## Section 2: Significance and Data Interaction 

### Significance and Missingness within Data

**Defn:** *Statistically significant* is the likelihood that a relationship between two or more variables is caused by something other than chance. Statistical hypothesis testing is used to determine whether the result of a data set is statistically significant. This test provides a p-value, representing the probability that random chance could explain the result; in general, a p-value of 5% (0.05) or lower is considered to be statistically significant. [Source](https://www.investopedia.com/terms/s/statistically_significant.asp)

**Reminders:**

* We can only disprove and assume, never prove.

**Missingness and cleaning**:

*Significant:*
* Lost or missing data is significant if it will alter outcomes or if the data is missing because of bias.

*Insignificant:* 
* Lost or missing data is completely random within the sample size and there is no loss to statistical power.

**Types of missing information**

* __Missing Completely at Random__ (MCAR) - insignificant unless the remaining data is too small for accurate sample sizes. Proceed with analysis. **Ex:** Act of God results in losing 20% of the data.
* __Missing at Random__ (MAR) - proceed with caution. Hard to tell when data is MAR, but can be inferred. Proceed with analysis if we can explain the missing data (women of any weight are more likely to skip a weight-related question, 90% of people with missing values on the "depression" score are men)
* __Missing Not at Random__ (MNAR) - An assumption based on looking at the data and noticing what isn't there. Answers that people might not want recorded: anything to be wary of with respect to discrimination or values, anything with either/or choices where a person is less likely to answer at all than give a single answer (**Ex**: LGBTQ / Heterosexual). Throwing out MNAR data results in biased and incomplete conclusions and analysis.

**Inputing Data**

* In cases of missing data, we can choose to throw it out or use several methods to 'fill in'/guess or inpute the missing data.
* We can also use methods such as windsorization to clean data with extraneous outliers. Windsorization moves extreme data points which may be spurious (false/fake/incorrectly inputted) to the minimum or maximum allowed values or percentiles. I.e if 95% of the data falls within the values of 10-50 and there is a data point of -40 and a datapoint of 600, the value of -40 will be moved to 10 and the value of 600 will be moved to 50. 
* The most straightforward of inputing data involves replacing missing values with the mode, mean, or median of the variable.
* From curriculum: "If the causes of MNAR (or of major, catastrophic amounts of missingness that is MCAR or MAR) are clear and easy to fix, then fixing those causes and collecting new data may be easier than imputation."

### Awarness of Bias and Interaction

**Types of Bias:**

* **Sampling or Selection Bias:**
    -   Sample is collected in such a way that some members of the intended population are less likely to be included than others. 
    -  Ex: a survey of high school students to measure teenage use of illegal drugs will be a biased sample because it does not include home-schooled students or dropouts.
    -  Avoided by choosing the sample randomly from the population

* **Bias in Assignment to Conditions**
    -  Breaks samples into similar groups based on their population.
    -  Ex: Breaking up a random population sample into age groups to test whether sprinkles make ice cream more appealing. A group 'A' with mean age 13 and a group 'B' with mean age 30 are asked to rate how much they like ice cream. Group A has a higher rating for ice cream. Therefore, conclusions are drawn that sprinkles make ice cream better. However the conclusions could be biased because we don't know if sprinkles are more liked, or if younger people like ice cream more. 
    
* **Contextual Bias**
   - Where the context of the study influence the outcomes in a way that may not be readily apparent.
   - Ex: During summer two groups are asked to rate ice cream with or without sprinkles. One of the rooms has a broken AC unit and is significantly more hot.
   - Lots of considerations need to be taken for contextual bias, especially during A/B testing if the 'A' is going to come from a previous quarter's sales, previous year's traffic, previous history of adoption, etc, instead of being run concurrent to B population. 
   - Always consider: initial populations, bias that could come from who is distributing the test, major events, major weather, major news, holidays. If using previous results for 'A' group, consider how the population has changed w/r/t income, population, demographics.
   
### Significance of Results

Three things are needed to determine if the difference between two populations is significant:

1. Summaries of central tendency (mean, median, mode) for each group - most importantly the mean or average.
2. The standard deviations, or variability in each group - where most of the data falls. In a normal distribution, follow the 68–95–99.7 or empirical rule: 68% of results fall within 1 standard deviation of the mean in either direction. 95% of results fall within 2 in either direction. 99.7% of results fall within 3 standard deviations in either direction. 
3. The size of each group (N).
    *  A higher standard deviation indicates there was more noise, or unexplained variability. See the next section for in-depth talk about T-tests, P-values, etc.






## Experimental Design 

Experimental design is the *only* way to get exactly the data you want or need to answer new, specific questions. While conclusions and analysis can be drawn from existing datasets, an experiment is the only way to get data that is perfectly fitted to the questions at hand.

* [There are many types of experimental design](https://cirt.gcu.edu/research/developmentresources/research_ready/experimental/design_types) but for now focus is on A/B and A/A testing.

### A/B Testing (also called bucket tests or split-run testing)

__Defn (Technical)__: A randomized experiment with two variants, A and B. A/B testing is a way to compare two versions of a single variable, typically by testing a subject's response to variant A against variant B, and determining which of the two variants is more effective.

__Defn (Layman)__: A technique used to identify whether one version of an object of interest (product, marketing campaign, email text) is better at producing a desired outcome than another. Such tests are used to compare results from two options, A and B, thus allowing us to select the better of the two. 

#### The Five Steps of an A/B Experiment:

1. __Make an A and B version of something. Often one will already be in use, and there is another "test version."__ 
   -  Ask if you want to run the two verisions concurrently, if that is possible, or if you can use past metrics as your initial version.
   -  If someone is testing whether a 'daily special' increases sales at a coffe shop: Do they own more than one of the same shop? If so, one shop can be 'A' where nothing changes, while the other coffee shop is 'B', and introduces daily specials.
   - Make sure to minimize biased results by looking at the demographics and temporal situations. If you are using past data in a city where employment is decreasing, this may impact gathered data. 
   - The 'A' group is often referred to as the control group. Ex: in medicine, control groups are frequently used in double-blind tests (where neither the doctor or patient know if the patient is in the control or treatment group). Members of the control group for a certain medication are often given a placebo (commonly a sugar pill or saline infusion) instead of a medication, or a standard treatment. 
   
   
2. __Choose the sample population__ 
    -  Make sure the two groups are as comprable as possible
    -  The sample should be selected so that it is similar to the population you want to understand
    -  The split between the two groups should be as random as possible


3. __Select Metrics (come up with a hypothesis and a null hypothesis)__
     -  The null hypothesis is what we can disprove and is usually something like "The populations are the same"
     -  For the hypothesis, "Brighter emails will result in more website sign-ups," the null hypothesis will be, "Brighter emails will have no effect in website sign-ups." 
     -  Clarify definitions. Words like *more* or *effective* can be subjective. Determine primary and secondary metrics, if there are any.
     -  For "Brighter emails will result in more website sign-ups", sign-ups would be a primary metric. 
     -  If this metric isn't enough, secondary metrics can be added. An experiment that measures obesity might have a primary metric of BMI reduction, but a secondary metric of muscle mass. 
     -  Beware of contradictions between metrics. More website sign-ups might naturally lead to more traffic but less unique hits. Reduced BMI may lead to reduced muscle mass. 
     - Set benchmarks for success, failure, bailout plans, time/money/effort trade offs, as well as success/next steps
     - Within the above, decide if it's hurting anyone. Running a test that shuts down Amazon for 1 hour can result in a 40-million-dollar loss, whereas Pop's coffee shop going down for an hour will likely not drive away a lot of potential buisness. No harm, no foul.
     - Within this is writing the research proposal, which consists of: The problem, the potential solution, and the method of testing the solution.
 
 
4. __Collect Data__
    -  Decide on a timeframe for an initial testing and start the test. 
    -  Determine data collection time points and significance (what time a bright email goes out can impact when people see it. Is it being sent at random times, during work hours, on weekends, etc) 
    -  Standardize methods as much as possible (example: making sure to put the 'specials' board out before opening the coffee shop each day instead of putting it out 1/2/3, etc, hours after the shop has already been open). This minimizes 
    -  If running tests concurrently, prevent leaks. In medicine, this might be reminding participants not to share what side effects they are experiencing. It can also be trying to prevent people coming into coffee shop "A" asking what the special of the day is.
    
    
5. __Analyze data and make recommendations__
    -  Determine what kind of data is missing and from where, and fill in where appropriate. Missing data can be inputed, put into a separate category, or carefully removed (with special attention paid to how removing missing data can skew results)
    -  Select a statistical test. For a continuous outcome, a t-test can be used. See below for more on t-test and p-values. 
    - Check statistical assumptions. A t-test assumes that data collected falls in a normal distribution pattern. [See here for types of statistical tests](https://cyfar.org/types-statistical-tests)
    - Remember that a t-test (the higher t-value, the less likely it is that any differences are the result of noise or random chance) and p-value (the smaller the p-value, the less likely it is that any differences are the result of noise or random chance) *disprove* the null hypothesis, and do not definitively prove the hypothesis. 
    - Revisit the research proposal method of testing and analyze bench marks. 
    - If t-tests and/or p-values do not give definitive results, or if p-values are close (benchmark was 0.05 and experiment result was 0.06) decide whether or not to continue the experiment according to initial research proposal. 
    - When making recommendations for continued testing w/r/t close or failured results, determine whether to continue experiment. Take into account time/money/effort tradeoffs. Is it worth it from a business perspective to continue the experiment? Are the results too close to call? How likely is it that you will be able to accurately make a call on the null hypothesis at the end of the experiment's extension? 
    

### A/A Testing 

A/A testing provides context for the results of A/B testing and can help detect all but selection bias. It involves comparing the outcome of choice between two idential versions of something. 

A/A testing can identify problems with:
* The testing method
* The random split method
* The size of the sample
* Any other outside/contextual factors that can impact the results of a test

### T-testing and P-Values

__Defn (Technical):__ A statistical test that calculates the size of the difference between two means given their variance and sample size, defined as the difference of the sample means divided by the square root of the sum of the squared sample standard deviations divided by the sample sizes.

__Defn (Laymans):__ A statistical test that indicates whether or not the difference between two groups reflects a "real" difference in the population instead of the difference coming from random chance. 


**The T-Value**: The ratio of signal to noise. The larger the T-Value, the less likely it is that the difference of the groups are due to random variance or "noise". 


**The P-Value**: Is the probability of getting the above t-value if there were no difference between the groups in the population.

In many cases, a p-value of less than 0.05 (5%) is used to reject the null hypothesis. More specifically, a p-value of less than 0.05 means that there is less than a 5% chance that our experiment's data resulted from random chance. **Note** that while p < 0.05 is often accepted to reject a null hypothesis, it depends on the industry. In medicine, ideal p values might be p < 0.0001 (1 in 10,000) or p < 0.00001 (1 in 100,000), and physics might require p < 0.0000001 (1 in 10,000,000). 
 