## Algorithm Understanding
Feature selection methods are intended to reduce the number of input variables to those that are believed to be most useful to a model in order to predict the target variable. What algorithms can be used to automatically select the most important features (regression, etc..)? Describe at least 3?

<span style="color:purple">**Variance Threshold:**</span>

We would ideally like to have features with a wide variance so we get many training examples across the full distribution of the true data at test time and more importantly in production. Here we would set a variance threshold for the features then drop features below the threshold. 
    
<span style="color:purple">**Covariance and Correlation:**</span>

Correlation measures the proxmitity of linear association between two variables (between -1 and 1), high correlation values mean the variables tend to move in the same direction (positively or negatively); and low correlation means they tend to move in opposite directions. For feature selection, we want features that have strong correlations (i.e. closer to 1 or -1) with our target variable. We need to be careful here as we need to ensure the features we select are not excessively correlated with each other as this could lead to other issues (e.g. multicolinearity in the linear regression context). 

<span style="color:purple">**Univariate Methods:**</span>

In univariate methods we apply a statistical test to determine whether a single feature should remain in the training data or not. For example, if we were working on a classification problem, we might use a support vector machine (SVM) to calculate its score before and after removing the feature. First we use a statistical test like a p-value to determine the most relevant features then compare that with the SVM pre and post removal. 

## Interview Readiness 1
Explain data leakage and overfitting (define each)?
Explain the effect of data leakage and overfitting on the performance of an ML model.

<span style="color:purple">**Data Leakage:**</span>

Data leakage comes in numerous forms from information about the test set *leaking* into the training set to having features in training that may not be available during production. Data leakage will result in overly optimistic outcomes in both training and test.

ML model performance issues with data leakage would likely be seen in the form of poor predictions in production environments and going unnoticed in the modelling process (training and test sets). If serious enough, it could lead to the end of the entire modelling and need to go back to the modelling phase of your production pipeline. 

I've found that using pipelines for pre-processing tasks helps keep data leakage under check. You should also immediately remove any features that you know will not be available in real production environments. 

<span style="color:purple">**Overfitting:**</span>

Overfitting describes the process of a model performing well in training but not generalizing to unseen data well (i.e. test set, production). Overfitting is something that can generally be detected before finalizing your model as your training results will likely be significantly better than your cross-validation and test set results. 




## Interview Readiness 2
Explain what our outliers in your data?
Explain at least two methods to deal/treat outliers in your data?

<span style="color:purple">**Outliers:**</span>
    
Outliers are observations in your data set that can be considered extreme observations, generally seen as being very from from the mean/median observations (e.g. > 3 standard deviations away). 

Any given feature should be inspected for outliers to determine the outlier type which will make it easier to determine what to do with it. For example if it was known on a given day, some observations in an experiment were recorded using a faulty measurement device, you may be comfortable removing them from your dataset. Other outliers may in fact be relevant a needed to understand your dataset better and your model should be able to predict these values (e.g. very high priced luxury homes in a house price prediction model). 

As mentioned, determining what to do with outliers depends on the nature of your information about them. Either way you should use some method or visualization technique to identify outliers in your dataset (e.g. IQR plots).

- if you know they are faulty observations you may choose to remove them
- alternatively you may not want to delete the observations from your dataset, so you may choose to impute the values somehow (e.g. mean/median)



## Interview Readiness 3
What is feature scaling and why is it important to our model?
Explain the different between Normalization and Standardization?
