# MACHINE LEARNING

### Q1: How to define/select metrics?
- There isn’t a one-size-fits-all metric. The metric(s) chosen to evaluate a machine learning model depends on various factors:
    - Is it a regression or classification task?
    - What is the business objective? E.g. Precision vs Recall
    - What is the distribution of the target variable?
- There are a number of metrics that can be used, including Adjusted R-Squared, MAE, MSE, RMSE, accuracy, recall, precision, f1 score, MCC, ROC-AUC score and the list goes on.

### Q2: How to deal with unbalanced binary classification?
- You can improve the balance of classes through oversampling the minority class (e.g. SMOTE Resampling) or by undersampling the majority class (e.g. Tomek links)
- Give attention to more relevant metrics such as Recall/Sensitivity (to reduce False Negatives) & Precision/Specificity (to reduce False Positives), not only relying on Accuracy score 
- Use machine learning models that are more robust against imbalances such as XGBoost 

### Q3: What is the difference between a box plot and a histogram?
- Histogram is usually used to approximate the probability distribution of the given variable & its distribution shape
- Boxplot is usually used for observing the data range, quartiles & its outliers. It is useful if you want to compare multiple continuous charts at the same time for they take less space than histograms in doing the same thing.

### Q4: Describe different regularization methods, such as L1 and L2 regularization?
- Both L1 (Lasso Regression) and L2 (Ridge Regression) regularization are methods used to reduce the overfitting of training data. Both works well for feature selection when there's too much feature to begin with.
- While L2 penalizes the less important features' coefficient, the L1 shrinks the less important feature’s coefficient to zero thus, removing features altogether. 
- L2 is less robust but has a more stable solution. L1 is more robust but has a more unstable solution, possibly multiple solutions.

### Q5: What is a Neural Network?
- A neural network is a multi-layered model inspired by the human brain. In more practical terms, neural networks are non-linear statistical data modeling or decision making tools. It consist of input layers, hidden layers and output layers. Each node in the hidden layers represents a function that the inputs go through, ultimately leading to the output layer. All inputs are modified by a weight and summed, with positive weight reflects an excitatory connection, while negative values mean inhibitory connections.

### Q6: What is cross-validation?
- A technique used to assess how well a model performs on a new independent dataset. The simplest example of cross-validation is when you split your data into two groups: training data and testing data, where you use the training data to build the model and the testing data to test the model.

### Q7: What is Recall & Precision?
- Recall attempts to answer “What proportion of actual positives was identified correctly?”
    - It seeks to minimize False Negative
    - [ True Positives/(True Positives + False Negatives) ]
- Precision attempts to answer “What proportion of positive identifications was actually correct?”
    - It seeks to minimize False Positive
    - [ True Positives/(True Positives + False Positives) ]

### Q8: What is False Positive & False Negative?
- False Positive (FP) is an incorrect identification of the presence of a condition when it’s actually absent, also known as Type 1 Error
- False Negative (FN) is an incorrect identification of the absence of a condition when it’s actually present, also known as Type 2 Error

### Q9: What's the difference between supervised learning and unsupervised learning?
- Supervised learning involves learning a function that maps an output based on assigned labels (e.g. predicting default or not)
    - labeled training data needed
- Unsupervised learning involves learning a function that maps an output without references to labeled results (e.g. clustering customer data)
    - no labeled training data needed

### Q10 What is an Adjusted R-Squared?
- R-Squared measures the proportion of the variation in your dependent variable (y) explained by your independent variable (x) for a linear regression model. 
- Adjusted R-Squared adjusts the previous statistic based on the number of independent variables in the model.

### Q11 What are the advantages of dimension reduction?
- It reduces the time and storage space required
- It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D
- It avoids the curse of dimensionality 
- It removes multi-collinearity & improves the interpretation of the parameters of the machine learning model

### Q12 What is Principal Component Analysis (PCA)?
- In practical terms, PCA is a tool for decomposing/reducing number of features while keeping all original variables in the model.
- It's commonly used for compression purposes, to reduce required memory and to speed up the algorithm, as well as for visualization purposes, making it easier to summarize data.

### Q13 What is the drawback of using Naive Bayes?
- One major drawback of Naive Bayes is that it holds a strong assumption in that the features are assumed to be uncorrelated with one another, which typically is never the case. One way to improve such an algorithm that uses Naive Bayes is by decorrelating the features so that the assumption holds true.

### Q14 What is the drawback of using linear models?
- The major drawback of linear models is that it holds a strong assumption on multivariate normality, linear relationship, no autocorrelation, no heteroscedasticity & no/little multicollinearity, which typically is never the case.
- Extreme violations of these assumptions will make the results redundant. Small violations of these assumptions will result in a greater bias or variance of the estimate.

### Q15 What is multicollinearity and what to do with it?
- Multicollinearity exists when an independent variable is highly correlated with another independent variable in a multiple regression equation. This can be problematic because it undermines the statistical significance of an independent variable.
- You could use the Variance Inflation Factors (VIF) to determine if there is any multicollinearity between independent variables. Standard benchmark is that if the VIF > 5 then multicollinearity exists while VIF <= 10 is still acceptable.

### Q16: Why is MSE a bad measure of model performance? What would you suggest instead?
- Mean Squared Error (MSE) gives a relatively high weight to large errors, therefore, MSE tends to put too much emphasis on large deviations.
- A more robust alternative is Root Mean Squared Error (RMSE) where the large penalized errors are taken into account while given a more balanced emphasis.

### Q17: How to check if the regression model fits the data well?
- RMSE: Absolute measure of fit.
- R-squared/Adjusted R-squared: Relative measure of fit.
- F1 Score: Evaluates the null hypothesis that all regression coefficients are equal to zero vs the alternative hypothesis that at least one doesn’t equal zero.

### Q18: What is a decision tree?
- Decision tree is a popular model, used both in Regression & Classification problems
- Each square above is called a node and starts with a root node. The more nodes you have, the more accurate your decision tree will be (generally). 
- The last nodes of the decision tree (where a decision is made), are called the leaves of the tree.

### Q19: What is a random forest? Why is it good?
- Random forests are an ensemble learning technique that builds off of multiple decision trees. 
- Random forests involve creating multiple decision trees using bootstrapped datasets of the original data and randomly selecting a subset of variables at each step of the decision tree. 
- The model then selects the mode of all of the predictions of each decision tree. By relying on a “majority wins” model, it reduces the risk of error & bias from an individual tree.

### Q20: What is overfitting?
- Overfitting is an error where the model ‘fits’ the data too well, resulting in a model with high variance and low bias. 
- As a consequence, an overfit model will inaccurately predict new data points even though it has a high accuracy on the training data.

### Q21: What is underfitting?
- Underfitting is an error where the model ‘fits’ the data too poorly, resulting in a model with low variance and high bias. 
- As a consequence, an underfit model will have low accuracy both in training & test data

### Q22: What is boosting?
- Boosting is an iterative techniqu which adjust the weight of an observation based on the last classificiation.
- Its an ensemble method to improve a model by reducing its bias and variance, ultimately converting weak learners to strong learners. 

### Q23: What is Bias?
- IN ML: Bias is an error introduced to our model due to the oversimplification of machine learning algorithm that can lead to underfitting
- IN STATISTICS: Bias a tendency of sample statistics to systematically over or underestimate population parameter

### Q24: What is Variance?
- IN ML: Variance is an error introduced to our model due to the complexity of machine learning algorithm that can lead to overfitting
- IN STATISTICS: Variance is a measure of how far a set of numbers is spread out from their average value.

### Q25: What is a Confusion Matrix?
- Its a 2x2 matrix that consisted of 4 outputs provide by the binary classifier
- It is where we get info of TP, TN, FP & FN

### Q26: What is a ROC-AUC Curve?
- Its a performance measurement for the classification problems at various threshold settings
- ROC (Receiver Operating Characteristics) is the porbability curve
- AUC (Area Under the Curve) is the degree of separability
    - AUC 1 means the model has a 100% probability to distinguish between classes
    - AUC 0.7 means the model has a 70% probability to distinguish between classes
    - AUC 0.5 means the model is unable to differentiate between classes (50-50)
    - AUC 0 means the model predicts the 1 as 0 and vice-versa
- TPR (True Positive Rate/Recall/Sensitivity) = TP/TP+FN
- FPR (False Positive Rate) = FP/FP+TN 

# STATISTICS & PROBABILITY

### Q1: What are the fundamentals of probability?
- Rule #1: For any event A, 0 ≤ P(A) ≤ 1; in other words, the probability of an event can range from 0 to 1
- Rule #2: The sum of the probabilities of all possible outcomes always equals 1.
- Rule #3: P(not A) = 1 — P(A); This rule explains the relationship between the probability of an event and its complement event. A complement event is one that includes all possible outcomes that aren’t in A.
- Rule #4: If A and B are disjoint events (mutually exclusive), then P(A or B) = P(A) + P(B); this is called the addition rule for disjoint events
- Rule #5: P(A or B) = P(A) + P(B) — P(A and B); this is called the general addition rule.
- Rule #6: If A and B are two independent events, then P(A and B) = P(A) * P(B); this is called the multiplication rule for independent events
- Rule #7: The conditional probability of event B given event A is P(B|A) = P(A and B) / P(A)
- Rule #8: For any two events A and B, P(A and B) = P(A) * P(B|A); this is called the general multiplication rule

### Q2: Fundamental Counting Principle (multiplication)
This method should be used when repetitions are allowed and the number of ways to fill an open place is not affected by previous fills.
- e.g. There are 3 types of breakfasts, 4 types of lunches, and 5 types of desserts. 
    - The total number of combinations is = 5 x 4 x 3 = 60

### Q3: Combinations Formula: C(n,r)=(n!)/[(n−r)!r!]
This is used when replacements are not allowed and the order in which items are ranked does not mater.
- e.g. To win the lottery, you must select the 5 correct numbers in any order from 1 to 52. 
    - What is the number of possible combinations?
C(n,r) = 52! / (52–5)!5! = 2,598,960


### Q4: Permutations: P(n,r)= n! / (n−r)!
This method is used when replacements are not allowed and order of item ranking matters.
- e.g. A code has 4 digits in a particular order and the digits range from 0 to 9. 
    - How many permutations are there if one digit can only be used once?
P(n,r) = 10!/(10–4)! = (10x9x8x7x6x5x4x3x2x1)/(6x5x4x3x2x1) = 5040

### Q5: Difference between convex and non-convex cost function; what does it mean when a cost function is non-convex?
- A convex function is one where a line drawn between any two points on the graph lies on or above the graph. It has one minimum.
- A non-convex function is one where a line drawn between any two points on the graph may intersect other points on the graph. It characterized as “wavy”.
- When a cost function is non-convex, it means that there’s a likelihood that the function may find local minima instead of the global minimum, which is typically undesired in machine learning models from an optimization perspective.

### Q6: Given two fair dices, what is the probability of getting scores that sum to 4? to 8? how about getting sum 4 or 8?
- The total sample space is of 36 (6 x 6)
- There are 3 combinations of rolling a 4 (1+3, 3+1, 2+2):
    - P(rolling a 4) = 3/36 = 1/12
- There are 5 combinations of rolling an 8 (2+6, 6+2, 3+5, 5+3, 4+4):
    - P(rolling an 8) = 5/36
- Getting the sum=4 or sum=8 means adding the probability of each: 
    - P(rolling a 4 or 8) = 3/36 + 5/36 = 8/36

### Q7: You are about to get on a plane to London, you want to know whether you have to bring an umbrella or not. You call three of your random friends and ask each one of them if it’s raining. The probability that your friend is telling the truth is 2/3 and the probability that they are playing a prank on you by lying is 1/3. If all 3 of them tell that it is raining, then what is the probability that it is actually raining in London.
- You can tell that this question is related to Bayesian theory because of the last statement which essentially follows the structure, “What is the probability A is true given B is true?” Therefore we need to know the probability of it raining in London on a given day. Let’s assume it’s 25%.
- P(A) = probability of it raining = 25%
- P(B) = probability of all 3 friends say that it’s raining
- P(A|B) probability that it’s raining given they’re telling that it is raining
- P(B|A) probability that all 3 friends say that it’s raining given it’s raining = (2/3)³ = 8/27


- Step 1: Solve for P(B)
    - P(A|B) = P(B|A) * P(A) / P(B), can be rewritten as
    - P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
    - P(B) = (2/3)³ * 0.25 + (1/3)³ * 0.75 = 0.25*8/27 + 0.75*1/27
- Step 2: Solve for P(A|B)
    - P(A|B) = 0.25 * (8/27) / ( 0.25*8/27 + 0.75*1/27)
    - P(A|B) = 8 / (8 + 3) = 8/11
- Therefore, if all three friends say that it’s raining, then there’s an 8/11 chance that it’s actually raining.

### Q8: You are given 40 cards with four different colors- 10 Green cards, 10 Red Cards, 10 Blue cards, and 10 Yellow cards. The cards of each color are numbered from one to ten. Two cards are picked at random. Find out the probability that the cards picked are not of the same number and same color.
- Since these events are not independent, we can use the rule:
- P(A and B) = P(A) * P(B|A) ,which is also equal to
- P(not A and not B) = P(not A) * P(not B | not A)
- For example:
    - P(not 4 and not yellow) = P(not 4) * P(not yellow | not 4)
    - P(not 4 and not yellow) = (36/39) * (27/36)
    - P(not 4 and not yellow) = 0.692
- Therefore, the probability that the cards picked are not the same number and the same color is 69.2%.

### Q9: What is 80/20 rule?: 
- also known as the Pareto principle; states that 80% of the effects come from 20% of the causes e.g. 80% of sales come from 20% of customers.

### Q10: What is the Law of Large Numbers?
- The Law of Large Numbers is a theory that states that as the number of trials increases, the average of the result will become closer to the expected value.
- e.g. flipping heads from fair coin 100,000 times should be closer to 0.5 than 100 times.

### Q11: Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?
- A long-tailed distribution is a type of heavy-tailed distribution that has a tail (or tails) that drop off gradually and asymptotically.
- 3 practical examples include the power law, the Pareto principle (more commonly known as the 80–20 rule), and product sales (i.e. best selling products vs others).
- It’s important to be mindful of long-tailed distributions in classification and regression problems because the least frequently occurring values make up the majority of the population. This can ultimately change the way that you deal with outliers, and it also conflicts with some machine learning techniques with the assumption that the data is normally distributed.

### Q12: What is an outlier?
- its an extreme value that differs significantly from other observations
- its an outlier if its below Q1 - 1.5IQR or above Q3 + 1.5IQR
- The use of median instead of mean is better for the measure of central location when outliers are apparent

### Q13: What is statistical power?

- It refers to the power of a binary hypothesis.
- Its the probability that the test rejects the null hypothesis given that the alternative is true

### Q14: What is A/B testing?
- A/B testing is a form of hypothesis testing, usually two-sample hypothesis testing to compare two versions, the control and variant, of a single variable. It is commonly used to improve and optimize user experience and marketing.

### Q15: How can you tell if a given coin is biased?
The answer is simply to perform a hypothesis test:
- The null hypothesis is that the coin is not biased and the probability of flipping heads should equal 50% (H0: p=0.5) 
- The alternative hypothesis is that the coin is biased (H1: p != 0.5)
- Flip the coin 500 times.
- Calculate Z-score (if the sample is less than 30, you would calculate the t-statistics).
- Compare against alpha (two-tailed test so 0.05/2 = 0.025).
- If p-value > alpha, the null is not rejected and the coin is not biased.
- If p-value < alpha, the null is rejected and the coin is biased.

### Q16: How do you prove that males are on average taller than females by knowing just gender height?
- You can use hypothesis testing to prove that males are taller on average than females.
- The null hypothesis would state that males and females are the same height on average, while the alternative hypothesis would state that the average height of males is greater than the average height of females.
- Then you would collect a random sample of heights of males and females and use a t-test to determine if you reject the null or not.

### Q17: You roll a biased coin (p(head)=0.8) five times. What’s the probability of getting three or more heads?

In [2]:
import scipy.stats as st

1 - st.binom.pmf(k=2, n=5, p=0.8)

0.9488

- there's 94% probability of getting three or more heads 

### Q18: A random variable X is normal with mean 1020 and a standard deviation 50. Calculate P(X>1200)?

In [3]:
1 - st.norm.cdf(1200, loc=1020, scale=50)

0.00015910859015755285

- there's 0.0159% probability of getting X more than 1200

### Q19: You are running for office and your pollster polled hundred people. Sixty of them claimed they will vote for you. Can you relax?

In [11]:
# confidence interval 95%
import statsmodels.api as sm
sm.stats.proportion_confint(nobs=100, count=60, alpha=0.05) 

(0.5039817664728937, 0.6960182335271062)

- the upper bounds are satisfactory but the lower bounds are not enough for me to relax (because its only 50%)

### Q20: Consider the number of people that show up at a bus station is Poisson with mean 2.5/h. What is the probability that at most three people show up in a four hour period?

In [14]:
st.poisson.cdf(3, 10)

0.010336050675925726

- probability at most 3 is 1,03%

In [7]:
# rate of 10 in 4 hrs
rate = 10

# guess range
import numpy as np
n = np.arange(5)

mypoisson = st.poisson.pmf(n, rate)
print('### Probabilities:')
for i,j in list(zip(n,mypoisson)):
    print("for",i,'people coming in a minute, the probability is: ',j)

### Probabilities:
for 0 people coming in a minute, the probability is:  4.5399929762484854e-05
for 1 people coming in a minute, the probability is:  0.0004539992976248486
for 2 people coming in a minute, the probability is:  0.0022699964881242435
for 3 people coming in a minute, the probability is:  0.007566654960414144
for 4 people coming in a minute, the probability is:  0.01891663740103538


### Q21: Geiger counter records 100 radioactive decays in 5 minutes. Find an approximate 95% interval for the number of decays per hour.
- Since this is a Poisson distribution question, mean = lambda = variance, which also means that standard deviation = square root of the mean
- a 95% confidence interval implies a z score of 1.96
- one standard deviation = 10
- Therefore the confidence interval = 100 +/- 19.6 = [964.8, 1435.2]

### Q22: The homicide rate in Scotland fell last year to 99 from 115 the year before. Is this reported change really noteworthy?
- Since this is a Poisson distribution question, mean = lambda = variance, which also means that standard deviation = square root of the mean
- a 95% confidence interval implies a z score of 1.96
- one standard deviation = sqrt(115) = 10.724
- Therefore the confidence interval = 115+/- 21.45 = [93.55, 136.45]. Since 99 is within this confidence interval, we can assume that this change is not very noteworthy.