## **Time to Build Something – End-to-end Machine Learning project using different machine learning algorithms**

##### Asterios D. Pantousas

* The primary purpose of this notebook is to help anyone understand Machine Learning by coding.


* The code is interpretable and scaling in difficulty after some notebooks but I want to help anyone understand the concepts from the simplest to the most complicated algorithms and techniques


* The meaning of this series of notebooks is to solve a particular machine learning problem [*Kaggle Competitions House Prices - Advanced Regression Techniques*](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview) with different approaches from very simple to very roboost and complicated and compare the final score we get from every notebook.


In this particular notebook we will start with the simplest version of approaching the solution in a machine learning problem which is just put everything together with the least necessary data cleaning

In [139]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib

import matplotlib.pyplot as plt
from scipy.stats import skew
from scipy.stats import pearsonr

%matplotlib inline

In [140]:
### Aquire the test and train data for the modeling
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [141]:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [142]:
data = pd.concat([train.loc[:,"MSSubClass":"SaleCondition"],
                  test.loc[:,"MSSubClass":"SaleCondition"]])
data

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,2,2008,WD,Normal
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,0,,,,0,5,2007,WD,Normal
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,9,2008,WD,Normal
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,,0,2,2006,WD,Abnorml
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,0,,,,0,12,2008,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,6,2006,WD,Normal
1455,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,4,2006,WD,Abnorml
1456,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,9,2006,WD,Abnorml
1457,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,MnPrv,Shed,700,7,2006,WD,Normal


### Data exploration

In [143]:
data = pd.get_dummies(data)
data.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,60,65.0,8450,7,5,2003,2003,196.0,706.0,0.0,...,0,0,0,1,0,0,0,0,1,0
1,20,80.0,9600,6,8,1976,1976,0.0,978.0,0.0,...,0,0,0,1,0,0,0,0,1,0
2,60,68.0,11250,7,5,2001,2002,162.0,486.0,0.0,...,0,0,0,1,0,0,0,0,1,0
3,70,60.0,9550,7,5,1915,1970,0.0,216.0,0.0,...,0,0,0,1,1,0,0,0,0,0
4,60,84.0,14260,8,5,2000,2000,350.0,655.0,0.0,...,0,0,0,1,0,0,0,0,1,0


In [144]:
data = data.fillna(data.mean())

In [145]:
X_train = data[:train.shape[0]]
X_test = data[train.shape[0]:]
y = train.SalePrice

In [149]:
from sklearn.linear_model import LinearRegression
simplest_model = LinearRegression()
simplest_reg = simplest_model.fit(X_train,y)

In [151]:
predictions = simplest_reg.predict(X_test)
predictions

array([112689.58663729, 159804.00848064, 186573.67809446, ...,
       179181.14326985, 115065.78432814, 223270.04135271])

In [152]:
sub = pd.DataFrame()
sub['Id'] = test.Id
sub['SalePrice'] = predictions
sub.to_csv('1st_VerySimple.csv',index=False)

### We upload the csv in kaggle competition House Prices - Advanced Regression Techniques and we get a Score: 0.19266