# ML Tech Interview

Hello and welcome to the Machine Learning Tech Interview. This interview will be divided in two parts: the theoretical part and the practical/coding part. 

### **I will review only the scripts that will be sent (by pull request on this repo) by 1:00 pm**

Good Luck!!

## Theoretical Part

Please answer the following questions. 

#### What are the assumptions of a linear model (or any other type of model)?

Weak exogeneity,
Linearity,
Constant variance,
Independence of errors,
Lack of perfect multicollinearity in the predictors.

#### What’s the difference between K Nearest Neighbor and K-means Clustering?

K-nearest neighbors is a classification algorithm, which is a subset of supervised learning.

K-means is a clustering algorithm, which is a subset of unsupervised learning.

#### How do you address overfitting?

With more data

#### Explain Naive Bayes algorithms.

Are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

#### When do you use an AUC-ROC score? What kind of information can you gather from it?

The AUC is the area under the ROC curve. This score gives us a good idea of how well the model performances.

#### What is cross validation?

Cross-validation is a technique used to evaluate the results of a statistical analysis and ensure that they are independent of the partition between training and test data.

#### What are confounding variables?

is a variable that influences both the dependent variable and independent variable

#### If an important metric for our company stopped appearing in our data source, how would you investigate the causes?

If you have not changed the data source, check the structure of the data source and update the data source fields

## Practical Machine Learning

In this challenge, you will showcase your knowledge in feature engineering, dimensionality reduction, model selection and evaluation, hyperparameter tuning, and any other techniques of machine learning.

There isn't a correct solution to this challenge. All we would like to learn is your thinking process that demonstrates your knowledge, experience, and creativity in developing machine learning models. Therefore, in addition to developing the model and optimizing its performance, you should also elaborate your thinking process and justify your decisions thoughout the iterative problem-solving process.

The suggested time to spend on this challenge is 90-120 minutes. If you don't have time to finish all the tasks you plan to do, simply document the to-dos at the end of your response.

#### Instructions:

- Download the housing prices data set (housing_prices.csv). The data is big enough to showcase your thoughts but not so that processing power will be a problem.
- Using Python, analyze the features and determine which feature set to select for modeling.
- Train and cross validate several regression models, attempting to accurately predict the SalePrice target variable.
- Evaluate all models and show comparison of performance metrics.
- State your thoughts on model performance, which model(s) you would select, and why.

#### Deliverables Checklist:

- Python code.
- Your thinking process.
- The features selected for machine learning.
- The results (e.g., performance metrics) of your selected model(s).

In [2]:
# Pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# Matplotlib visualization
import matplotlib.pyplot as plt
%matplotlib inline

# Seaborn for visualization
import seaborn as sns
sns.set(font_scale = 2)

# Splitting data into training and testing
from sklearn.model_selection import train_test_split


In [3]:
data=pd.read_csv('housing_prices.csv')

In [4]:
def info(df):
    # this function applies many exploratory techniques to a given dataframe
    display("Head",
            df.head()
            .style
            
           )
    display("Data Types",
            df.dtypes
            .to_frame()
           )
    display("Data Types Count",
            df.dtypes
            .to_frame()[0]
            .value_counts()
            .to_frame()
           )
    display("Nans",
            df.isna()
            .sum()
            .to_frame()
            .sort_values(by=[0], ascending=False)
           )
    display("Descriptive Statistics",
            df.describe()
           )
    display("Correlation Matrix",
            df.corr()
            .style.background_gradient(cmap='coolwarm')
            .set_precision(2)
           )
    
info(data)

'Head'

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


'Data Types'

Unnamed: 0,0
Id,int64
MSSubClass,int64
MSZoning,object
LotFrontage,float64
LotArea,int64
Street,object
Alley,object
LotShape,object
LandContour,object
Utilities,object


'Data Types Count'

Unnamed: 0,0
object,43
int64,35
float64,3


'Nans'

Unnamed: 0,0
PoolQC,1453
MiscFeature,1406
Alley,1369
Fence,1179
FireplaceQu,690
LotFrontage,259
GarageYrBlt,81
GarageCond,81
GarageType,81
GarageFinish,81


'Descriptive Statistics'

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


'Correlation Matrix'

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
Id,1.0,0.011,-0.011,-0.033,-0.028,0.013,-0.013,-0.022,-0.05,-0.005,-0.006,-0.0079,-0.015,0.01,0.0056,-0.044,0.0083,0.0023,-0.02,0.0056,0.0068,0.038,0.003,0.027,-0.02,7.2e-05,0.017,0.018,-0.03,-0.00048,0.0029,-0.047,0.0013,0.057,-0.0062,0.021,0.00071,-0.022
MSSubClass,0.011,1.0,-0.39,-0.14,0.033,-0.059,0.028,0.041,0.023,-0.07,-0.066,-0.14,-0.24,-0.25,0.31,0.046,0.075,0.0035,-0.0023,0.13,0.18,-0.023,0.28,0.04,-0.046,0.085,-0.04,-0.099,-0.013,-0.0061,-0.012,-0.044,-0.026,0.0083,-0.0077,-0.014,-0.021,-0.084
LotFrontage,-0.011,-0.39,1.0,0.43,0.25,-0.059,0.12,0.089,0.19,0.23,0.05,0.13,0.39,0.46,0.08,0.038,0.4,0.1,-0.0072,0.2,0.054,0.26,-0.0061,0.35,0.27,0.07,0.29,0.34,0.089,0.15,0.011,0.07,0.041,0.21,0.0034,0.011,0.0074,0.35
LotArea,-0.033,-0.14,0.43,1.0,0.11,-0.0056,0.014,0.014,0.1,0.21,0.11,-0.0026,0.26,0.3,0.051,0.0048,0.26,0.16,0.048,0.13,0.014,0.12,-0.018,0.19,0.27,-0.025,0.15,0.18,0.17,0.085,-0.018,0.02,0.043,0.078,0.038,0.0012,-0.014,0.26
OverallQual,-0.028,0.033,0.25,0.11,1.0,-0.092,0.57,0.55,0.41,0.24,-0.059,0.31,0.54,0.48,0.3,-0.03,0.59,0.11,-0.04,0.55,0.27,0.1,-0.18,0.43,0.4,0.55,0.6,0.56,0.24,0.31,-0.11,0.03,0.065,0.065,-0.031,0.071,-0.027,0.79
OverallCond,0.013,-0.059,-0.059,-0.0056,-0.092,1.0,-0.38,0.074,-0.13,-0.046,0.04,-0.14,-0.17,-0.14,0.029,0.025,-0.08,-0.055,0.12,-0.19,-0.061,0.013,-0.087,-0.058,-0.024,-0.32,-0.19,-0.15,-0.0033,-0.033,0.07,0.026,0.055,-0.002,0.069,-0.0035,0.044,-0.078
YearBuilt,-0.013,0.028,0.12,0.014,0.57,-0.38,1.0,0.59,0.32,0.25,-0.049,0.15,0.39,0.28,0.01,-0.18,0.2,0.19,-0.038,0.47,0.24,-0.071,-0.17,0.096,0.15,0.83,0.54,0.48,0.22,0.19,-0.39,0.031,-0.05,0.0049,-0.034,0.012,-0.014,0.52
YearRemodAdd,-0.022,0.041,0.089,0.014,0.55,0.074,0.59,1.0,0.18,0.13,-0.068,0.18,0.29,0.24,0.14,-0.062,0.29,0.12,-0.012,0.44,0.18,-0.041,-0.15,0.19,0.11,0.64,0.42,0.37,0.21,0.23,-0.19,0.045,-0.039,0.0058,-0.01,0.021,0.036,0.51
MasVnrArea,-0.05,0.023,0.19,0.1,0.41,-0.13,0.32,0.18,1.0,0.26,-0.072,0.11,0.36,0.34,0.17,-0.069,0.39,0.085,0.027,0.28,0.2,0.1,-0.038,0.28,0.25,0.25,0.36,0.37,0.16,0.13,-0.11,0.019,0.061,0.012,-0.03,-0.006,-0.0082,0.48
BsmtFinSF1,-0.005,-0.07,0.23,0.21,0.24,-0.046,0.25,0.13,0.26,1.0,-0.05,-0.5,0.52,0.45,-0.14,-0.065,0.21,0.65,0.067,0.059,0.0043,-0.11,-0.081,0.044,0.26,0.15,0.22,0.3,0.2,0.11,-0.1,0.026,0.062,0.14,0.0036,-0.016,0.014,0.39


In [5]:
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

missing_values_table(data)

Your selected dataframe has 81 columns.
There are 19 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
PoolQC,1453,99.5
MiscFeature,1406,96.3
Alley,1369,93.8
Fence,1179,80.8
FireplaceQu,690,47.3
LotFrontage,259,17.7
GarageType,81,5.5
GarageYrBlt,81,5.5
GarageFinish,81,5.5
GarageQual,81,5.5


In [6]:
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [7]:
#Deleting columns with NAN values
datad=data.dropna(axis='columns')

In [8]:
datad.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,2,2008,WD,Normal,208500
1,2,20,RL,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,5,2007,WD,Normal,181500
2,3,60,RL,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,9,2008,WD,Normal,223500
3,4,70,RL,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,5,60,RL,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,12,2008,WD,Normal,250000


In [9]:
round(datad[['SalePrice','YrSold']].describe())

Unnamed: 0,SalePrice,YrSold
count,1460.0,1460.0
mean,180921.0,2008.0
std,79443.0,1.0
min,34900.0,2006.0
25%,129975.0,2007.0
50%,163000.0,2008.0
75%,214000.0,2009.0
max,755000.0,2010.0


In [10]:
# Matplotlib visualization
import matplotlib.pyplot as plt


figsize(8, 8)
plt.hist(data['SalePrice'].dropna(), bins = 20, edgecolor = 'black');
plt.xlabel('Sale Price'); 
plt.ylabel('Count'); plt.title('Price Houses');

NameError: name 'figsize' is not defined

In [None]:
fig, ax = plt.subplots(figsize=(50,50))   
sns.heatmap(data.corr())


In [None]:
###StandardScaler?? collinear_features???


from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:

target = datad['SalePrice']
carart = datad.drop(columns = ['SalePrice'])
    

In [None]:
# Split into 80% training and 20% testing set
X, X_test, y, y_test = train_test_split(carart, target, test_size = 0.2, random_state = 42)
print(X.shape)
print(X_test.shape)
print(y.shape)
print(y_test.shape)

In [None]:
# Function to calculate mean absolute error
def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))

In [11]:
# Import 'r2_score'
from sklearn.metrics import r2_score

def performance_metric(y_true, y_predict):
    """ Calculates and returns the performance score between 
        true and predicted values based on the metric chosen. """
    
    # Calculate the performance score between 'y_true' and 'y_predict'
    score = r2_score(y_true, y_predict)
    
    # Return the score
    return score

In [12]:
# Only select the columns with number and without nan fot apply the models

#pd.to_numeric(data['column'], errors='coerce').notnull().all()
datan=data.select_dtypes(include=["float", 'int']).dropna()


#  MODELS                   


In [13]:
y = datan["SalePrice"]

X = datan.drop(["SalePrice"],axis=1)


X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.2)

### LogisticRegression

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
lr = LogisticRegression()


In [15]:
lr.fit(X_train,y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [16]:
import numpy as np

y_pred=lr.predict(X_test)


In [17]:
np.corrcoef(y_pred,list(y))

ValueError: all the input array dimensions except for the concatenation axis must match exactly

### Confusion_matrix

In [None]:
from sklearn.metrics import confusion_matrix
y_pred = lr.predict(X_test)
print(confusion_matrix(y_test, y_pred))
acc = lr.score(X_test,y_test)*100 
acc


### Precision

In [None]:
from sklearn.metrics import precision_score

print("Precision Score : ",precision_score(y_test, y_pred, 
                                           pos_label='positive'
                                           average ='micro'))
print("Precision Score : ",recall_score(y_test, y_pred, 
                                           pos_label='positive'
                                           average ='micro'))

### Accuracy

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

### Recall

In [None]:
from sklearn.metrics import recall_score
recall_score(y_test, y_pred)

### F1

In [None]:
from sklearn.metrics import f1_score
f1_score(y_test, y_pred)

## KNeighborsClassifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 3) 
knn.fit(X_train, y_train)

In [None]:
y_pred = knn.predict(X_test)

acc = knn.score(X_test, y_test)*100
print(acc)
print(confusion_matrix(y_test, y_pred))

### RobustScaler

In [None]:
from sklearn.preprocessing import RobustScaler

robust = RobustScaler().fit(X).transform(X)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
lr = LogisticRegression()
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
print(confusion_matrix(y_test, y_pred))
acc = lr.score(X_test,y_test)*100 
acc

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

from sklearn.pipeline import make_pipeline

for k in range(1,8):
    poly_model = make_pipeline (StandardScaler(), PolynomialFeatures(k), LinearRegression())
    model = poly_model.fit(X, y)

    print(k, poly_model.score(X_test, y_test))