<a href="https://colab.research.google.com/github/ath0217/hello-github/blob/main/2022310201_%E1%84%8B%E1%85%A1%E1%86%AB%E1%84%90%E1%85%A2%E1%84%92%E1%85%A7%E1%86%BC_2022310292_%E1%84%87%E1%85%A1%E1%86%A8%E1%84%8C%E1%85%A2%E1%84%80%E1%85%B2%E1%86%AB_Project_A_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

In [None]:
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

In [None]:
sns.set_style("darkgrid")

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

In [None]:
!pip list | grep imb

In [None]:
!mkdir data

#DATA IMPORTING

 We imported AmesHousing data from google drive.

In [None]:
import gdown

urls = ['https://docs.google.com/uc?export=download&id=1cs0UFK8jfF8BhTkB2YtIBxveUG52TbKF']
outputs = ['AmesHousing.csv']
for url,output in zip(urls,outputs):
  gdown.download(url, f'data/{output}', quiet=False)

In [None]:
df=pd.read_csv('data/AmesHousing.csv')

We started by taking all of the 2010 data and holding it out as a validation set, and this data will be not used for training our model.

In [None]:
test = df[df['Yr Sold']==2010]
train = df[df['Yr Sold']<=2009]

In [None]:
test.head()

In [None]:
test.info()

In [None]:
train.head()

In [None]:
train.info()

# Exploratory Data Analysis

We applied .corr() to train set to identify general trends in sales price across time, neighborhood, and home characteristics. 

However, since the neighborhood is a variable made up of text, we will treat it as a box plot below.

In [None]:
train.corr()

In [None]:
corr = train.corr()

corr.style.background_gradient(cmap='coolwarm')

In [None]:
sns.heatmap(train.corr())

plt.savefig("Plotting_Correlation_HeatMap.jpg")

In [None]:
train_corr = train.corr()
train_corr_sort = train_corr.sort_values('SalePrice', ascending = False)
train_corr_sort

The five variables most related to SalePrice were extracted.

In [None]:
train_corr_sort['SalePrice'].head(6)

We visualized as a scatter plot to see the relationship between Gr Liv Area and Salesprice, which have the highest correlation with Salesprice among numeric variables.

As can be seen from the following graph, as the Gr liv area increased, the Salesprice showed a tendency to increase.

In [None]:
#scatter plot Gr Liv Area / SalePrice
var = 'Gr Liv Area'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000), s=32);

The following is a scatter plot showing the relationship between Total Bsmt SF and Saleprice. The same as in the graph above, as Total Bsmt SF increases, we can see a trend that SalePrice increases.

In [None]:
#scatter plot totalbsmtsf/saleprice
var = 'Total Bsmt SF'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

The graph below visualizes the relationship between garage area and sale price as a scatter plot. We found that the correlation between the two was high.

In [None]:
sns.lmplot(x='Garage Area',y='SalePrice',data=train)

Three graphs below show the relationship between categorical variables and Saleprice.

Of course, as the Overall Qual increased, the Saleprice increased, and we showed this as a box plot.

In [None]:
#box plot overallqual/saleprice
var = 'Overall Qual'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

The following graph is a box plot that visualizes the relationship between neighborhood and Saleprice. We tried to find out in which regions our model performs better and in which regions it does not perform well.

As this box plot shows, our data show the lowest Saleprice in the meadowW region. Also, the StoneBr region had the highest Saleprice.

In [None]:
#box plot Neighborhood/saleprice
var = 'Neighborhood'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(32, 10))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

The graph below visualizes the relationship between garage cars and sale price as a box plot. We found that the correlation between the two was high.

However, when the number of acceptable Cars is 4 or more, the Salesprice tends to decrease.

In [None]:
plt.figure(figsize=(16,8))
sns.boxplot(x='Garage Cars',y='SalePrice',data=train)
plt.show()

In [None]:
corr = train.corr()

To find the top 38 variables highly correlated with Saleprice, we used the code below.

In [None]:
train_corr_sort['SalePrice'].head(39)

variables below were extracted to utilize the variables with a correlation coefficient of 0.3 or higher with the saleprice. This is because we judged that there is no need to consider variables lower than 0.3.

In [None]:
corr[corr['SalePrice']>0.3].index

In order to remove unnecessary variables in the train set and test set, we put the following list into each data set.

In [None]:
train = train[['Lot Frontage', 'Overall Qual', 'Year Built', 'Year Remod/Add',
       'Mas Vnr Area', 'BsmtFin SF 1', 'Total Bsmt SF', '1st Flr SF',
       'Gr Liv Area', 'Full Bath', 'TotRms AbvGrd', 'Fireplaces',
       'Garage Yr Blt', 'Garage Cars', 'Garage Area', 'Wood Deck SF',
       'Open Porch SF', 'SalePrice']]
test=test[['Lot Frontage', 'Overall Qual', 'Year Built', 'Year Remod/Add',
       'Mas Vnr Area', 'BsmtFin SF 1', 'Total Bsmt SF', '1st Flr SF',
       'Gr Liv Area', 'Full Bath', 'TotRms AbvGrd', 'Fireplaces',
       'Garage Yr Blt', 'Garage Cars', 'Garage Area', 'Wood Deck SF',
       'Open Porch SF', 'SalePrice']]

# Buidling a sales price prediction model

In order to reduce the sample size in the process of building the sales price prediction model, we removed missing data. This is to ensure that the process is not biased.

In [None]:
total = train.isnull().sum().sort_values(ascending=False)
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(25)

We removed variables with more than 136 missing data from the train set.

In [None]:
train = train.drop((missing_data[missing_data['Total'] > 136]).index,1)

Below is the result.

In [None]:
train.isnull().sum().sort_values(ascending=False).head(20)

We are going to do the same thing to the test data



In [None]:
total_test = test.isnull().sum().sort_values(ascending=False)
percent_test = (test.isnull().sum()/test.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total_test, percent_test], axis=1, keys=['Total', 'Percent'])
missing_data.head(25)

In [None]:
test = test.drop((missing_data[missing_data['Total'] > 23]).index,1)

Below is the result.

In [None]:
test.isnull().sum().sort_values(ascending=False).head(20)

# We need to handle missing data.

We checked whether missing data was properly removed.

In [None]:
train.head()

In [None]:
test.head()

Now we are going to pick some features for the model. For this, we are going to use correlation matrix and we are going to pick most correlated variable with saleprice.

In [None]:
k = 15 
plt.figure(figsize=(16,8))
corrmat = train.corr()
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

We intentionally put the mean value in the missing value in train set.

In [None]:
train['Garage Yr Blt'] = train['Garage Yr Blt'].fillna(train['Garage Yr Blt'].mean())
train['Mas Vnr Area'] = train['Mas Vnr Area'].fillna(train['Mas Vnr Area'].mean())
train['BsmtFin SF 1'] = train['BsmtFin SF 1'].fillna(train['BsmtFin SF 1'].mean())
train['Total Bsmt SF'] = train['Total Bsmt SF'].fillna(train['Total Bsmt SF'].mean())
train['Garage Area'] = train['Garage Area'].fillna(train['Garage Area'].mean())
train['Garage Cars'] = train['Garage Cars'].fillna(train['Garage Cars'].mean())

In [None]:
train = train[cols]

In [None]:
cols

In [None]:
test=test[cols]

In [None]:
cols

In [None]:
test.isnull().sum().sort_values(ascending=False).head(20)

We checked whether the mean values were well added to the missing values.

In [None]:
train.head()

In [None]:
test.head()

We intentionally put the mean value in the missing value in test set either.

In [None]:
test['Mas Vnr Area'] = test['Mas Vnr Area'].fillna(test['Mas Vnr Area'].mean())
test['Garage Yr Blt'] = test['Garage Yr Blt'].fillna(test['Garage Yr Blt'].mean())

# Linear Regression

We are going to begin to train out regression model. We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case the Price column. We will toss out the Address column because it only has text infomation that the linear regression model can't use.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train.drop('SalePrice', axis=1), train['SalePrice'], test_size=0.3, random_state=101)


We are going to scale to data

In [None]:
y_train= y_train.values.reshape(-1,1)
y_test= y_test.values.reshape(-1,1)

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
y_train = sc_X.fit_transform(y_train)
y_test = sc_y.fit_transform(y_test)

In [None]:
X_train

Creating and Training the Model 

In [None]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()

In [None]:
lm.fit(X_train,y_train)
print(lm)

Model Evaluation

We are going to evaluate the model by checking out it's coefficients and then we can interpret them.

In [None]:
print(lm.intercept_)

In [None]:
print(lm.coef_)

Predictions from our Model 

We are going to grab predictions off our test set and see how well it did.

In [None]:
predictions = lm.predict(X_test)
predictions= predictions.reshape(-1,1)

We visualized Y Test and Predicted Y as scatter plots. As a result, it appeared very well linearly.

In [None]:
plt.figure(figsize=(15,8))
plt.scatter(y_test,predictions)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.show()

To evaluate the performance of our model, we calculated the MAE value.

In [None]:
from sklearn import metrics

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, predictions))

# Gradient Boosting Regression

Gradient Boosting trains many models in a gradual, additive and sequential manner. 

In [None]:
from sklearn import ensemble
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
params = {'n_estimators': 100, 'max_depth': 4, 'min_samples_split': 2,
          'learning_rate': 0.05, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)

clf.fit(X_train, y_train)

In [None]:
clf_pred=clf.predict(X_test)
clf_pred= clf_pred.reshape(-1,1)

To evaluate the performance of our model, we calculated the MAE value.

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, clf_pred))

We visualized Y Test and Predicted Y as a scatter plot. As a result, it appeared very well linearly.

In [None]:
plt.figure(figsize=(15,8))
plt.scatter(y_test,clf_pred, c= 'brown')
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.show()

# Decision Tree Regression

The decision tree is a simple machine learning model for getting started with regression tasks.

In [None]:
from sklearn.tree import DecisionTreeRegressor
dtreg = DecisionTreeRegressor(random_state = 100)
dtreg.fit(X_train, y_train)

In [None]:
params={'max_depth':[2,3,4],
        'criterion':["squared_error", "friedman_mse"],
        'ccp_alpha':[0.0, 0.1, 1]}
gridsearch_dt =GridSearchCV(dtreg,param_grid=params,cv=5)
gridsearch_dt.fit(X_train,y_train)

In [None]:
gridsearch_dt.best_params_

In [None]:
gridsearch_dt.best_estimator_

In [None]:
dtr_pred = dtreg.predict(X_test)

dtr_pred= dtr_pred.reshape(-1,1)

To evaluate the performance of our model, we calculated the MAE value.

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, dtr_pred))

We visualized Y Test and Predicted Y as scatter plots. As a result, it appeared quite linearly.

In [None]:
plt.figure(figsize=(15,8))
plt.scatter(y_test,dtr_pred,c='green')
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.show()

# Support Vector Machine

Support Vector Machine can also be used as a regression method, maintaining all the main features that characterize the algorithm. The Support Vector uses the same principles as the SVM for classification, with only a few minor differences.

In [None]:
from sklearn.svm import SVR
svr = SVR(kernel = 'rbf')
svr.fit(X_train, y_train)

In [None]:
svr_pred = svr.predict(X_test)
svr_pred= svr_pred.reshape(-1,1)

To evaluate the performance of our model, we calculated the MAE value.

SVM showed the lowest MAE value.

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, svr_pred))

We visualized Y Test and Predicted Y as scatter plots. As a result, it appeared linearly, but quite spread.

In [None]:
plt.figure(figsize=(15,8))
plt.scatter(y_test,svr_pred, c='red')
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.show()

# Random Forest

A Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap Aggregation, commonly known as bagging.

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators = 100, random_state = 0)
rfr.fit(X_train, y_train)

In [None]:
rfr_pred= rfr.predict(X_test)
rfr_pred = rfr_pred.reshape(-1,1)

To evaluate the performance of our model, we calculated the MAE value.

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, rfr_pred))


In [None]:
plt.figure(figsize=(15,8))
plt.scatter(y_test,rfr_pred, c='orange')
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.show()

In conclusion, when we evaluated our prediction model with Random Forest, the MAE value was the lowest at 17141.714.

# Renovation value calculator

We made a Renovation value calculator. We were able to build on what we learned in class (lab session #4). Linear regression will be used.

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import sklearn.linear_model as skl

In [None]:
!mkdir data

An error occurred in columns that are not spaced. So we created a file with some of columns' spaces removed.

In [None]:
import gdown

urls = ['https://docs.google.com/uc?export=download&id=1DSBXiDE-ZZj14lG4y9j8Uu-3UXzxC-JN']
outputs = ['AmesHousing_train_GarageCars.csv']
for url,output in zip(urls,outputs):
  gdown.download(url, f'data/{output}', quiet=False)

In [None]:
lr = pd.read_csv('data/AmesHousing_train_GarageCars.csv')
lr.info()

We represent the relationship between Garage Cars, Fireplaces, Half bath, and Saleprice as a linear regression.

The reason we chose these three variables is because we thought they were variables that had a strong correlation with Saleprice and corresponded to home features.

In [None]:
plt.figure(figsize=(8,6)) 
sns.regplot(lr.GarageCars, train.SalePrice, order=1, ci=None, scatter_kws={'color':'r', 's':9})

In the results below, we found that for every additional vehicle that can be accommodated in the garage, the Saleprice increases by $68,380.

In [None]:
est = smf.ols('SalePrice ~ GarageCars', lr).fit()
print(est.summary().tables[1])

In the results below, we found that the saleprice increases by $59,700 for every 1 fireplace increase in the house.

In [None]:
est = smf.ols('SalePrice ~ Fireplaces', lr).fit()
print(est.summary().tables[1])

In the results below, we found that for every increase in a house's HalfBath by 1, the Saleprice increases by $45,470.

In [None]:
est = smf.ols('SalePrice ~ HalfBath', lr).fit()
print(est.summary().tables[1])

# We recommend to view this section in edit mode. Because the $ sign changes the format.

We want to recommend particular home improvements which can increase the value of their home to our customers. To do so, we must consider the cost of such improvements.



1. Therefore, we looked up the improvement cost of building a garage. We refer to the Typical range and National Average from the link below.

https://www.bobvila.com/articles/cost-to-build-a-garage/
Typical Range: $16,747 to $38,926
National Average: $27,774

The average improvement cost of building a garage is $27,774.

As we found that for every additional vehicle that can be accommodated in the garage, the Saleprice increases by $68,380. Therefore, We estimate that we can generate approximately $40,000 in net revenue from building a garage.



2. We looked up the improvement cost of istallation a fireplace. We refer to the Typical range and National Average from the link below.

https://www.bobvila.com/articles/fireplace-installation-cost/
Typical Range: $870 to $3,792
National Average: $2,314

The average improvement cost of installing a fireplace is $2,314.
As we found that for every additional fireplace, the Saleprice increases by $59,700. Therefore, We estimate that we can generate approximately $57,000 in net revenue from intalling a fireplace.



3. We looked up the improvement cost of adding a bathroom. We refer to the Average cost from the link below. Based on the regression equation in our linear regression analysis, for each increase in half bathroom, the saleprice increases by $45,470. Therefore, we refer to the average cost of new addition in the link below.

https://www.homeadvisor.com/cost/bathrooms/
Average cost: $35,000

The average improvement cost of adding a bathroom is $35,000. Therefore, We estimate that we can generate approximately $10,000 in net revenue from adding a bathroom.

Based on net revenue, positive improvement is installing a fireplace ($57,000), and negative improvement is adding bathroom ($10,000).


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train.head()

In [None]:
test.head()

In [None]:
X_train = train.drop(['SalePrice'], axis=1)
y_train = train['SalePrice']

In [None]:
X_test = test.drop(['SalePrice'], axis=1)
y_test = test['SalePrice']

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Lasso

In [None]:
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

In [None]:
models =[RandomForestRegressor(),
         DecisionTreeRegressor(),
         Lasso()]
param_grids=[{'n_estimators':[10,25,50],
              'max_depth':[2,3,4]},
             {'max_depth':[2,3,4],
              'criterion':["squared_error", "friedman_mse"]},
             {'alpha':[0,1,1,10]}
             ]

In [None]:
grid_search_list=[]
for model, params in zip(models, param_grids):
  grid_search = GridSearchCV(model, params, cv=5)
  grid_search.fit(X_train, y_train)
  grid_search_list.append(grid_search)

In [None]:
grid_search_list

In [None]:
rf_grid_search = grid_search_list[0]

In [None]:
pd.DataFrame(rf_grid_search.cv_results_)

In [None]:
rf_grid_search.best_params_

In [None]:
rf_grid_search.best_estimator_

In [None]:
for best in grid_search_list:
  print(best.best_params_)

In [None]:
for best in grid_search_list:
  y_pred = best.best_estimator_.predict(X_test)
  mse= mean_squared_error(y_test, y_pred)
  mae= mean_absolute_error(y_test, y_pred)
  print(mae)