# ML Tech Interview

Hello and welcome to the Machine Learning Tech Interview. This interview will be divided in two parts: the theoretical part and the practical/coding part. 

### **I will review only the scripts that will be sent (by pull request on this repo) by 1:00 pm**

Good Luck!!

## Theoretical Part

Please answer the following questions. 

#### What are the assumptions of a linear model (or any other type of model)?

To machine learning models work properly the data should not have auto-correlation in the residuals neither collinearity between different features. It also should be balanced otherwise the results are going to be untrustworthy.

#### What’s the difference between K Nearest Neighbor and K-means Clustering?

The K Nearest Neighbor is used in supervised models and the K-means Clustering is used in unsupervised models.   

#### How do you address overfitting?

Overfitting is one common problem of machine learning models. It happens when the model is so linked to the data that it can't make predictions outside of it. In other words, the predictions are just the results it already knows and won't work with other data.

#### Explain Naive Bayes algorithms.

Naive Bayes is a probabilistic machine learning algorithm based on the Bayes Theorem. It assumes the features that go into the model are independent of each other. That is, changing the value of one feature, does not directly influence or change the value of any of the other features used in the algorithm.

#### When do you use an AUC-ROC score? What kind of information can you gather from it?

AUC - ROC(Area Under The Curve - Receiver Operating Characteristics) curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.

#### What is cross validation?

Cross-validation is a method used to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model. It splits the data in k equal size sub-datasets and train & test in each one of them.

#### What are confounding variables?

Confounding variables are variables that influences both the independent and the dependent variables and cause spurious relations between them. It can ruin a test if not taken care of and understand if it's useful or not

#### If an important metric for our company stopped appearing in our data source, how would you investigate the causes?

First of all I'd take a look at previous data if available and check if there was anything irregular there and what to inspect. Then investigate how that data is collected and from were to. If the problem cannot be solved by doing what was being doing in the past I'd try to figure out new ways of collecting the same data.

## Practical Machine Learning

In this challenge, you will showcase your knowledge in feature engineering, dimensionality reduction, model selection and evaluation, hyperparameter tuning, and any other techniques of machine learning.

There isn't a correct solution to this challenge. All we would like to learn is your thinking process that demonstrates your knowledge, experience, and creativity in developing machine learning models. Therefore, in addition to developing the model and optimizing its performance, you should also elaborate your thinking process and justify your decisions thoughout the iterative problem-solving process.

The suggested time to spend on this challenge is 90-120 minutes. If you don't have time to finish all the tasks you plan to do, simply document the to-dos at the end of your response.

#### Instructions:

- Download the housing prices data set (housing_prices.csv). The data is big enough to showcase your thoughts but not so that processing power will be a problem.
- Using Python, analyze the features and determine which feature set to select for modeling.
- Train and cross validate several regression models, attempting to accurately predict the SalePrice target variable.
- Evaluate all models and show comparison of performance metrics.
- State your thoughts on model performance, which model(s) you would select, and why.

#### Deliverables Checklist:

- Python code.
- Your thinking process.
- The features selected for machine learning.
- The results (e.g., performance metrics) of your selected model(s).

In [26]:
# LIBRARIES

import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import silhouette_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from scipy import stats

In [4]:
# Import the data
house = pd.read_csv('housing_prices.csv')

In [5]:
house.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [6]:
# This functions drops the columns that have more than 10% of their values as NaN
def dropnull(data):
    for i in data:
        if len(data[data[i].isnull()==True]) > 0.1*len(data):
            data = data.drop(columns=[i])
    return data

In [7]:
house_clean = dropnull(house)

In [8]:
# Taking a look on the different values on the features with categorical values
for i in house_clean.columns:
    if house_clean[i].dtype=="object":
        print(f"{i} //// {set(house_clean[i])}")

MSZoning //// {'RM', 'FV', 'C (all)', 'RH', 'RL'}
Street //// {'Pave', 'Grvl'}
LotShape //// {'IR3', 'IR1', 'IR2', 'Reg'}
LandContour //// {'Lvl', 'Bnk', 'HLS', 'Low'}
Utilities //// {'AllPub', 'NoSeWa'}
LotConfig //// {'Inside', 'FR3', 'CulDSac', 'Corner', 'FR2'}
LandSlope //// {'Gtl', 'Mod', 'Sev'}
Neighborhood //// {'NAmes', 'Blueste', 'Blmngtn', 'Timber', 'OldTown', 'SawyerW', 'IDOTRR', 'NPkVill', 'BrkSide', 'Gilbert', 'Veenker', 'CollgCr', 'NridgHt', 'SWISU', 'Mitchel', 'ClearCr', 'Sawyer', 'StoneBr', 'NoRidge', 'Edwards', 'Crawfor', 'NWAmes', 'BrDale', 'MeadowV', 'Somerst'}
Condition1 //// {'RRNe', 'RRNn', 'RRAn', 'Norm', 'PosN', 'Feedr', 'RRAe', 'Artery', 'PosA'}
Condition2 //// {'RRNn', 'RRAn', 'Norm', 'PosN', 'Feedr', 'RRAe', 'Artery', 'PosA'}
BldgType //// {'1Fam', 'TwnhsE', 'Twnhs', 'Duplex', '2fmCon'}
HouseStyle //// {'1.5Unf', 'SLvl', '2.5Fin', '1Story', 'SFoyer', '1.5Fin', '2.5Unf', '2Story'}
RoofStyle //// {'Flat', 'Gable', 'Shed', 'Hip', 'Mansard', 'Gambrel'}
RoofMatl /

In [9]:
# This function drops the columns that have more than 5 different categorical values
def dropcategorical(data):
    for i in data:
        if len(set(data[i])) > 5 and data[i].dtype=="object":
            data = data.drop(columns=[i])
    return data

In [10]:
house_clean2 = dropcategorical(house_clean)
house_clean2 = house_clean2.drop(columns=['Id'])

In [11]:
house_clean2.head()

Unnamed: 0,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,BldgType,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,60,RL,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,1Fam,...,0,61,0,0,0,0,0,2,2008,208500
1,20,RL,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,1Fam,...,298,0,0,0,0,0,0,5,2007,181500
2,60,RL,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,1Fam,...,0,42,0,0,0,0,0,9,2008,223500
3,70,RL,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,1Fam,...,0,35,272,0,0,0,0,2,2006,140000
4,60,RL,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,1Fam,...,192,84,0,0,0,0,0,12,2008,250000


In [12]:
for i in house_clean2.columns:
    if len(set(house_clean[i]))<6:
        print(f"{i} //// {set(house_clean[i])}")

MSZoning //// {'RM', 'FV', 'C (all)', 'RH', 'RL'}
Street //// {'Pave', 'Grvl'}
LotShape //// {'IR3', 'IR1', 'IR2', 'Reg'}
LandContour //// {'Lvl', 'Bnk', 'HLS', 'Low'}
Utilities //// {'AllPub', 'NoSeWa'}
LotConfig //// {'Inside', 'FR3', 'CulDSac', 'Corner', 'FR2'}
LandSlope //// {'Gtl', 'Mod', 'Sev'}
BldgType //// {'1Fam', 'TwnhsE', 'Twnhs', 'Duplex', '2fmCon'}
MasVnrType //// {nan, 'BrkCmn', 'Stone', 'BrkFace', 'None'}
ExterQual //// {'TA', 'Ex', 'Gd', 'Fa'}
ExterCond //// {'Ex', 'TA', 'Fa', 'Gd', 'Po'}
BsmtQual //// {nan, 'Ex', 'TA', 'Gd', 'Fa'}
BsmtCond //// {nan, 'TA', 'Fa', 'Gd', 'Po'}
BsmtExposure //// {nan, 'No', 'Av', 'Mn', 'Gd'}
HeatingQC //// {'Ex', 'TA', 'Fa', 'Gd', 'Po'}
CentralAir //// {'N', 'Y'}
BsmtFullBath //// {0, 1, 2, 3}
BsmtHalfBath //// {0, 1, 2}
FullBath //// {0, 1, 2, 3}
HalfBath //// {0, 1, 2}
KitchenAbvGr //// {0, 1, 2, 3}
KitchenQual //// {'TA', 'Ex', 'Gd', 'Fa'}
Fireplaces //// {0, 1, 2, 3}
GarageFinish //// {'Fin', nan, 'RFn', 'Unf'}
GarageCars //// {0, 1, 2

In [13]:
len(house_clean2[house_clean2.isna().any(axis=1)]) / len(house_clean2)
# Only 8% of the rows have NaN. I'll drop them.

0.0821917808219178

In [14]:
house_clean3 = house_clean2.dropna()

In [15]:
len(house_clean3[house_clean3.isna().any(axis=1)]) / len(house_clean3)

0.0

In [42]:
def outliers_z(data):
    z = np.abs(stats.zscore(data))
    data_out = data[(z < 3).all(axis=1)]
    return data_out

In [43]:
# I'm using two different datasets to test and see if the outliers make that much of a difference
house_final = outliers_z(house_clean3)

In [44]:
# Encode the categorical values to numerical
lb_make = LabelEncoder()
for i in house_final.columns:
    if np.mean(house_final[i])>10:
        house_final[i] = lb_make.fit_transform(house_final[i])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [49]:
# Encode the categorical values to numerical
for i in house_clean3.columns:
    if np.mean(house_clean3[i])>10:
        house_clean3[i] = lb_make.fit_transform(house_clean3[i])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [45]:
house_final

Unnamed: 0,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,BldgType,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,5,3,186,1,3,3,0,4,0,0,...,0,42,0,0,0,0,0,2,2,260
2,5,3,434,1,0,3,0,4,0,0,...,0,24,0,0,0,0,0,9,2,282
4,5,3,553,1,0,3,0,2,0,0,...,78,61,0,0,0,0,0,12,2,321
6,0,3,339,1,3,3,0,4,0,0,...,113,38,0,0,0,0,0,8,1,371
10,0,3,426,1,3,3,0,4,0,0,...,0,0,0,0,0,0,0,2,2,66
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1446,0,3,600,1,0,3,0,1,0,0,...,116,22,0,0,0,0,0,4,4,139
1447,5,3,333,1,3,3,0,4,0,0,...,0,46,0,0,0,0,0,12,1,310
1451,0,3,268,1,3,3,0,4,0,0,...,0,20,0,0,0,0,0,5,3,361
1455,5,3,149,1,3,3,0,4,0,0,...,0,23,0,0,0,0,0,8,1,182


scaler = MinMaxScaler(feature_range=(0,10))
for i in house_clean3.columns:
    if np.mean(house_clean3[i]) > 10:
        house_clean3[i] = scaler.fit_transform(house_clean3[i].values.reshape(-1, 1)y)

In [46]:
house_final.describe()

Unnamed: 0,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,BldgType,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,751.0,751.0,751.0,751.0,751.0,751.0,751.0,751.0,751.0,751.0,...,751.0,751.0,751.0,751.0,751.0,751.0,751.0,751.0,751.0,751.0
mean,3.707057,3.111851,281.30759,1.0,1.842876,2.941411,0.0,2.994674,0.0,0.463382,...,35.972037,26.027963,2.296937,0.030626,7.138482,0.0,0.071904,6.298269,1.822903,178.123835
std,3.94927,0.331873,172.885719,0.0,1.443819,0.333511,0.0,1.628078,0.0,1.234365,...,44.27346,33.1309,8.40276,0.839282,32.1649,0.0,0.551504,2.649576,1.350282,107.817967
min,0.0,2.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,0.0,3.0,131.5,1.0,0.0,3.0,0.0,1.5,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,1.0,85.5
50%,4.0,3.0,275.0,1.0,3.0,3.0,0.0,4.0,0.0,0.0,...,20.0,12.0,0.0,0.0,0.0,0.0,0.0,6.0,2.0,171.0
75%,5.0,3.0,426.5,1.0,3.0,3.0,0.0,4.0,0.0,0.0,...,61.0,43.5,0.0,0.0,0.0,0.0,0.0,8.0,3.0,268.0
max,12.0,4.0,602.0,1.0,3.0,3.0,0.0,4.0,0.0,4.0,...,165.0,129.0,49.0,23.0,189.0,0.0,7.0,12.0,4.0,397.0


# Using the house_final data
This data includes the outliers and is more complete although the scores are lower

In [50]:
X = house_clean3.drop(columns=['SalePrice'])
y = house_clean3['SalePrice']

# divide train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=13)

In [48]:
# Simple Linear Regression model
regr = LinearRegression()
model = regr.fit(X_train, y_train)

model.score(X_test, y_test)

0.8323714202895727

In [21]:
# Polynomial model
for k in range(1,4):
    poly_model = make_pipeline (StandardScaler(), PolynomialFeatures(k), LinearRegression())
    model2 = poly_model.fit(X_train, y_train)

    print(k, poly_model.score(X_test, y_test))

1 0.8324779267638921
2 -11.023993253288994
3 -3.4727974122523566


In [22]:
# Decision Tree model
regr2 = DecisionTreeRegressor(random_state = 13)

model3 = regr2.fit(X_train, y_train)

regr2.score(X_test, y_test)

0.7210200025492657

In [23]:
# KNeighborsRegressor model
for k in range(1,9):
    knnr = KNeighborsRegressor(n_neighbors = k)
    model4 = knnr.fit(X_train, y_train)
    print(k, model4.score(X_test, y_test))

1 0.5389700556158278
2 0.6214691812659408
3 0.651611926073848
4 0.6678085867809576
5 0.6672864231650107
6 0.6665942387145387
7 0.6574378570097043
8 0.639688394802408


# Using the house_final data
This data excludes the outliers but removes some features such as the existance of pools which is an important feature to the price.

In [51]:
X = house_final.drop(columns=['SalePrice'])
y = house_final['SalePrice']

# divide train test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=13)

In [52]:
# Simple Linear Regression model
regr = LinearRegression()
model = regr.fit(X_train, y_train)

model.score(X_test, y_test)

0.9294296864372703

In [53]:
# Polynomial model
for k in range(1,4):
    poly_model = make_pipeline (StandardScaler(), PolynomialFeatures(k), LinearRegression())
    model2 = poly_model.fit(X_train, y_train)

    print(k, poly_model.score(X_test, y_test))

1 0.9294296864372683
2 0.44779842805445946
3 0.8385921522097797


In [55]:
# Decision Tree model
regr2 = DecisionTreeRegressor(random_state = 13)

model3 = regr2.fit(X_train, y_train)

regr2.score(X_test, y_test)

0.7174599773920229

In [54]:
# KNeighborsRegressor model
for k in range(1,9):
    knnr = KNeighborsRegressor(n_neighbors = k)
    model4 = knnr.fit(X_train, y_train)
    print(k, model4.score(X_test, y_test))

1 0.6923054330683349
2 0.7661202591751385
3 0.8017005712185548
4 0.8254631800646975
5 0.8320632657468604
6 0.8352190359111544
7 0.8336026443316213
8 0.8336559040545255


 ## Overall the best model is the Linear Regression using the house_final data. I would go with this although it doesn't include the outliers if I want to evaluate the average houses. The score is good (0.92943) so I think it's a good model