## How you address/handle/prevent overfitting?
### Firstly you need to spot overfitting. You can do this by dividing your initial dataset in train and test sets. When you train your model on train set and see the performence on test set. If your model does much better on the train set than on the test set, then you are likely overfitting.
### You can try to remove useless features. Some algorithms can assign importance to feature. Thus you can drop less important. Obviously with validation of performance.
### When you’re training a learning algorithm iteratively, you can measure how well each iteration of the model performs. And stop training then validation error stop decreasing. This called early stopping.
### Tune the complexity of your model. For example, regularization - process of adding restriction on the weights of your model. You can force you tree model to have some max depth or restrict minimum amount of sample in leaves. 
### Ensembling: bagging (combine strong model independently) and boosting (combine  weak learners into a single strong learner).
### Try to gather more data or make data augmentation.

## Bias-Variance Tradeoff
### Bias is the difference between your model's expected predictions and the true values.
### Variance refers to your algorithm's sensitivity to specific sets of training data.
### Bias-Variance Tradeoff refer to fact what Low variance (high bias) algorithms tend to be less complex, with simple or rigid underlying structure. On the other hand, low bias (high variance) algorithms tend to be more complex, with flexible underlying structure.

## Bieses in data
### Sample bias occurs when the sample doesn’t reflect the population. You can resample you sample.
### Selection bias occurs when data is selected from a subjective perspective rather than objectively, or when non-random data has been selected. Subjectively selected data introduces a population that does not represent the actual population. The results from such data become skewed.
### Confirmation bias is the opposite of a hypothesis-based analysis. It occurs, for example, when a survey sets out to test a fully formed opinion rather than explore a hypothesis. It looks for data to support the opinion rather than forming a theory and planning an experiment to address if a hypothesis is supported by data.

## SVM
### Motivation is pretty simple. Suppose we are given data points each belong to one of two classes and we want to separate points of different classes with hyperplane. This is called a linear classifier. There are many hyperplanes that might classify the data. One reasonable choice to choose the hyperplane so that the distance from it to the nearest data point of each class is maximized. It turns out that the hyperplane depends on the training examples that lies on the boundary of the margin between classes and on the vectors that mistakenly classified. This vectors called support vectors.

## Naive Bayes
### Naive Bayes is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. So a particular feature in a class is unrelated to the presence of any other feature. Thus the probability of class given feature vector is probability of class multiplied by probability of feature given class divided by evidence.

## Decision tree
### As name goes decision tree use tree-like structure. It's defined recursively at each node there is a split on one of features, for example age is bigger than 40, true goes to left and false to right. And this process continues until some stop criteria is met, then this node become a leaf with value of the most frequent class or average target value depending on task. The split is chosen to maximize information gain, which is entropy of root node minus weighted average entropy in subtree nodes, but it can be defined differently. The splitting stops when the entropy of the node is 0, but usually other criteria can be used, for example we can restrict the tree height with some max values or IG should be bigger than some threshould.

## GB Trees
### Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion optimizing of an arbitrary differentiable loss function.
### Using a training set of known values of x and corresponding values of y, the goal is to find an approximation function that minimizes the expected value of specified loss function. The gradient boosting method seeks an approximation  in the form of a weighted sum of functions called base learners. It does so by starting with a model, consisting of a constant function and incrementally expands it in a greedy fashion. The idea is to apply a steepest gradient descent on loss function with respect to an approximation on previous step. This way we have pseudo-residuals for each training example and we can fit a base learner to pseudo-residuals and chose the weight and update our model.

## KNN
### K-nearest neighbors algorithm is a non-parametric method used for classification and regression. In both cases, the prediction consists of the k closest training examples in the feature space. Then in classification an object is assigned to the most common class among its k nearest neighbors. In regression task average of the values of k nearest neighbors.

## K-means
### K-means is clustering algorithms that aims to partition observations into k clusters in such way that mean euclidean distance between each point in cluster and cluster centroid is minimal. Firstly initialize set of k centoids randomly. When 2 iterative steps: assign each observation to the cluster whose centroid has the least squared Euclidean distance after that calculate the new centroids of the observations in the new clusters.

## Assumption of linear regression
* ### First, linear regression needs the relationship between the independent and dependent variables to be linear. 
* ### Secondly, the linear regression analysis requires all variables to be normally distributed.
* ### Thirdly, linear regression assumes that there is little or no multicollinearity in the data.  
* ### Fourthly, linear regression analysis requires that there is little or no autocorrelation in the data.  Autocorrelation occurs when the residuals are not independent from each other. 
* ### The last assumption of the linear regression analysis is homoscedasticity.

## Comon errors in data
* ### Wrong date format
* ### Multipal representation
* ### Duplicate records
* ### Redundant data
* ### Mixed numerical scales
* ### Spelling errors

## Handle missing data
### Before dealing with missing value, we have to understand the reason why data goes missing. Two possible reasons are that the missing value depends on the hypothetical value (e.g. People with high salaries generally do not want to reveal their incomes in surveys) or missing value is dependent on some other variable’s value (Let’s assume that females generally don’t want to reveal their ages! Here the missing value in age variable is impacted by gender variable)
### Time-Series Specific Methods: last observation carried forward & next observation carried backward, linear interpolation, seasonal adjustment + linear interpolation
### Impute Mean, Median and Mode
### For categorical: impute mode, missing values can be treated as a separate category by itself.
### KNN to find similar values 

## ROC
### A receiver operating characteristic curve is a graphical plot that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied. It is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is ratio of true positive to all positive. It's also known as sensitivity, recall. The false-positive rate is also known as probability of false alarm and can be calculated as (1 − specificity). So its ratio of false positive to all negative examples. So then threshold is < 0, recall = 1 (our algorithm predicts only positive classes) and specificity = 0. With threshold > 1 recall = 0) and specificity = 1.

## Bag-of-words & N-gram
### The bag-of-words model is a simplifying representation used in natural language processing. In this model, a text is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity. 
### N-gram is a contiguous sequence of n items from a given sample of text or speech.

## A/B testing
### In statistical hypothesis testing a result has statistical significance when it is very unlikely to have occurred given the null hypothesis. Significance level, α, is the probability of rejecting the null hypothesis, given that it were true; and the p-value of a result, p, is the probability of obtaining a result at least as extreme, given that the null hypothesis were true.
### Randomization is at the core of experimentation because it balances out these confounding variables. By assigning 50% of users to a control group and 50% of users to a treatment group randomly, you can ensure that the roughly same level of all possible confounding variable assign to groups.
### Multiple Comparisons. The more metrics you are measuring, the more likely you are to get at least one false positive.