# ML Tech Interview

Hello and welcome to the Machine Learning Tech Interview. This interview will be divided in two parts: the theoretical part and the practical/coding part. 

### **I will review only the scripts that will be sent (by pull request on this repo) by 5:00 pm on Friday**

Good Luck!!

## Theoretical Part

Please answer the following questions. 

#### What are the assumptions of a linear model (or any other type of model)?

Most machine learning models (except Decision Trees and Random Forest) have a series of assumptions that are common between them.

- **Linear relationship:** linear and logistic regression models need the relationship between the independent and dependent variables to be linear. The best way to check this is with scatter plots. 

- **Multivariate normality:** a specific requirement of the linear regression is multivariate normality. Goodness-of-fit tests (Kolmogorov-Smirnov) can be used to confirm normality. And log-transformation sometimes help when the data is not normally distributed.

- **Multicollinearity:** most models also assume that there is no multicollinearity in the data (when features are highly correlated with each other). Because correlated variables add vewry little information to a model, the common solution is to remove them (after identifying using a correlation matrix).

- **Autocorrelation:** this problem occurs when the residuals are not independent from each other (for example in a time series). The Durbin-Watson test can be used to detect autocorrelation in the residuals of a correlation.

- **Homoscedasticity:** much like parametric hypothesis tests, linear regression require homoscedasticity (equal variances across the features).

Other examples of assumptions:

Different models have different requirements. For example, time series models (such as ARIMA) assume stationarity: a process where the mean, variance and autocorrelation structure do not change over time. Another example is the K-means clustering, which assumes that clusters are spherical and of equal size between each other.



#### What’s the difference between K Nearest Neighbor and K-means Clustering?

K Nearest Neighbor (KNN) is a supervised classification or regression algorithm while K-means Clustering, as the name implies, is an unsupervised method for clustering. The models are fundamentally different and the **k** value represents a very different parameter as well.

The KNN as a supervised method requires labeled data to train on. New and unlabeled data is then classified by majority rule of the **k** number of nearest data points. On the other hand, K-means clustering aims to partition data points into **k** clusters. Starting from a random points, K-means iteratively calculates the distance of a point to every other point and updates the mean of each cluster. 

#### How do you address overfitting?

Overfitting is the production of a model that corresponds too closely to a particular set of data, and may therefore fail to predict additional data. This usually occurs when a model has too many features for the amount of data. Overfitting can be detected when our model does much better on the training set than on the test set. There are several ways to address overfitting:

- **Feature selection**: simply dropping columns. 
- **Dimensionality reduction:** techniques such as Principal Component Analysis (PCA) reduce the dimensions of your feature space, hence you have fewer relationships between variables to consider and you are less likely to overfit your model.
- **More data points**
- **Cross validation:** not really a solution but more of a prevention measure, like washing our hands. 
- **Regularization:** these are techniques that attempt to simplify the model such as adding penalties to the cost function, but I admit I don't entirely understand how it works.


#### Explain Naive Bayes algorithms.

Naive Bayes is a classification algorithm based on Bayes’ Theorem with very important assumptions of independence between features and that every feature contributes equally to the outcome, hence the naive in the name. 
It is simple and powerful.

#### When do you use an AUC-ROC score? What kind of information can you gather from it?

AUC (area under curve) and ROC (receiver operating characteristic) are evaluation metrics for classification algorithms. 
A ROC curve plots two parameters at all classification thresholds: True Positive Rate and False Positive Rate. It tells how much model is capable of distinguishing between classes. The higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.

#### What is cross validation?

Cross validation is a statistical technique for testing the perfomance of a machine learning model.It gives us a comprehensive measure of a model's performance throughout the whole dataset. It is particularly useful on limited samples. Such datasets can suffer from selection bias during the test train split. One method of cross validation is the K-fold: it divides the data in to k-splits of test and train. This division is a good way so we can check that our model performs well in the entire data set.



#### What are confounding variables?

When an unknown or unnacounted for variable affects both dependent and dependent variables its called a confounding variable. It is particularly problematic in experimental design and interpretation, as it can lead to spurious correlations.

#### If an important metric for our company stopped appearing in our data source, how would you investigate the causes?

At first glance this seems like it could a problem during the ETL process. I would start by asking my colleagues that handle data extraction if this particular metric is still present in the source database where it is extracted from. Next I would check all other points of data transformation and loading to find out where the information is being lost.

## Practical Machine Learning

In this challenge, you will showcase your knowledge in feature engineering, dimensionality reduction, model selection and evaluation, hyperparameter tuning, and any other techniques of machine learning.

There isn't a correct solution to this challenge. All we would like to learn is your thinking process that demonstrates your knowledge, experience, and creativity in developing machine learning models. Therefore, in addition to developing the model and optimizing its performance, you should also elaborate your thinking process and justify your decisions thoughout the iterative problem-solving process.

The suggested time to spend on this challenge is 90-120 minutes. If you don't have time to finish all the tasks you plan to do, simply document the to-dos at the end of your response.

#### Instructions:

- Download the housing prices data set (housing_prices.csv). The data is big enough to showcase your thoughts but not so that processing power will be a problem.
- Using Python, analyze the features and determine which feature set to select for modeling.
- Train and cross validate several regression models, attempting to accurately predict the SalePrice target variable.
- Evaluate all models and show comparison of performance metrics.
- State your thoughts on model performance, which model(s) you would select, and why.

#### Deliverables Checklist:

- Python code.
- Your thinking process.
- The features selected for machine learning.
- The results (e.g., performance metrics) of your selected model(s).

In [19]:
# import libraries
import pandas as pd

pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', -1)

In [13]:
# load the data
housing = pd.read_csv('housing_prices.csv')

In [14]:
# examine the dataset
print(housing.shape)
housing.head()

(1460, 81)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [None]:
# my first impressions by looking at the shape is that there could be too many columns for the number of
# observations in the dataset.

In [15]:
housing.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [20]:
housing.dtypes

                 int64  
Id               int64  
MSSubClass       object 
MSZoning         float64
LotFrontage      int64  
LotArea          object 
Street           object 
Alley            object 
LotShape         object 
LandContour      object 
Utilities        object 
LotConfig        object 
LandSlope        object 
Neighborhood     object 
Condition1       object 
Condition2       object 
BldgType         object 
HouseStyle       int64  
OverallQual      int64  
OverallCond      int64  
YearBuilt        int64  
YearRemodAdd     object 
RoofStyle        object 
RoofMatl         object 
Exterior1st      object 
Exterior2nd      object 
MasVnrType       float64
MasVnrArea       object 
ExterQual        object 
ExterCond        object 
Foundation       object 
BsmtQual         object 
BsmtCond         object 
BsmtExposure     object 
BsmtFinType1     int64  
BsmtFinSF1       object 
BsmtFinType2     int64  
BsmtFinSF2       int64  
BsmtUnfSF        int64  
TotalBsmtSF      object 


In [8]:
# data cleaning
# missing values
housing.isna().sum().sort_values(ascending=False)

PoolQC           1453
MiscFeature      1406
Alley            1369
Fence            1179
FireplaceQu       690
                 ... 
CentralAir          0
SaleCondition       0
Heating             0
TotalBsmtSF         0
Id                  0
Length: 81, dtype: int64