<a href="https://colab.research.google.com/github/christian-thomas-schmidt/Python_Thursday/blob/main/Real_Estate_Project_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Christian Schmidt: Real Estate Dataset

This is my MET CS 777 project using pyspark and machine learning in Python.<br>
The dataset is provided by Ahmed Shahriar Sakib on Kaggle.com.<br><br>
The <b>"Goal"</b> of this project is to see whether I can predict house prices based on the features provided using machine learning in pyspark.<br><br>
I expect to use Linear Regression algorithm to create a model and I hypothesize that the greater the features (more beds/baths, a bigger acre_lot, larger house_size) the more expensive a house will be.<br><br>
The link for the dataset can be found [here](https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset?resource=download).<br>

### Context:
This dataset contains Real Estate listings in the US broken by State and zip code.<br> 
Data was collected via web scraping using python libraries.

### Content:

--- The dataset has 1 CSV file with 12 columns ---

*   realtor-data.csv (200k+ entries)
*   status
*   bed
*   bath
*   acre_lot
*   full_address
*   street
*   city
*   state
*   zip_code
*   house_size
*   sold_date




First step is to download the data and import into google colab enviroment

In [1]:
! pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [6]:
! mkdir ~/.kaggle

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [9]:
! cp kaggle.json ~/.kaggle/

In [10]:
! chmod 600 ~/.kaggle/kaggle.json

In [11]:
!kaggle datasets download -d ahmedshahriarsakib/usa-real-estate-dataset

Downloading usa-real-estate-dataset.zip to /content
  0% 0.00/5.07M [00:00<?, ?B/s]
100% 5.07M/5.07M [00:00<00:00, 147MB/s]


In [12]:
!unzip usa-real-estate-dataset

Archive:  usa-real-estate-dataset.zip
  inflating: realtor-data.csv        


### Importing Libraries

In [163]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from xgboost.sklearn import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge, BayesianRidge, SGDRegressor, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import LinearSVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import RandomizedSearchCV

### Exploring the Dataset
I want to gather information about this dataset before I begin my clean-up process.

In [14]:
df = pd.read_csv('realtor-data.csv')
df.head()

Unnamed: 0,status,price,bed,bath,acre_lot,full_address,street,city,state,zip_code,house_size,sold_date
0,for_sale,105000.0,3.0,2.0,0.12,"Sector Yahuecas Titulo # V84, Adjuntas, PR, 00601",Sector Yahuecas Titulo # V84,Adjuntas,Puerto Rico,601.0,920.0,
1,for_sale,80000.0,4.0,2.0,0.08,"Km 78 9 Carr # 135, Adjuntas, PR, 00601",Km 78 9 Carr # 135,Adjuntas,Puerto Rico,601.0,1527.0,
2,for_sale,67000.0,2.0,1.0,0.15,"556G 556-G 16 St, Juana Diaz, PR, 00795",556G 556-G 16 St,Juana Diaz,Puerto Rico,795.0,748.0,
3,for_sale,145000.0,4.0,2.0,0.1,"R5 Comunidad El Paraso Calle De Oro R-5 Ponce,...",R5 Comunidad El Paraso Calle De Oro R-5 Ponce,Ponce,Puerto Rico,731.0,1800.0,
4,for_sale,65000.0,6.0,2.0,0.05,"14 Navarro, Mayaguez, PR, 00680",14 Navarro,Mayaguez,Puerto Rico,680.0,,


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203126 entries, 0 to 203125
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   status        203126 non-null  object 
 1   price         203126 non-null  float64
 2   bed           171963 non-null  float64
 3   bath          172835 non-null  float64
 4   acre_lot      151066 non-null  float64
 5   full_address  203126 non-null  object 
 6   street        203041 non-null  object 
 7   city          203074 non-null  object 
 8   state         203126 non-null  object 
 9   zip_code      202931 non-null  float64
 10  house_size    173448 non-null  float64
 11  sold_date     75339 non-null   object 
dtypes: float64(6), object(6)
memory usage: 18.6+ MB


In [16]:
df.shape

(203126, 12)

With only 203,126 rows of data, I wonder what states this data includes?

In [17]:
state_array = df['state']
state_unique, state_count = np.unique(state_array, return_counts=True)
result = np.column_stack((state_unique, state_count))
print(result)

[['Connecticut' 12207]
 ['Massachusetts' 150792]
 ['New Hampshire' 4721]
 ['New Jersey' 2]
 ['New York' 1874]
 ['Puerto Rico' 24679]
 ['Rhode Island' 4907]
 ['South Carolina' 24]
 ['Tennessee' 18]
 ['Vermont' 1324]
 ['Virgin Islands' 2573]
 ['Virginia' 5]]


with 150,000 homes, the dataset is mostly focused on the housing market in Massachusetts.<br> With this information I may want to narrow the scope of my project to just look at the housing market in Massachusetts.<br><br> However, what if there are duplicates?

In [18]:
df.duplicated().sum()

182207

The vast majority of this dataset includes duplicate rows, I think it would be best to remove these and then see what the distrubtion is at the state level.

In [19]:
df2 = df.drop_duplicates()
df2.shape

(20919, 12)

In [20]:
# repeat prior step, but include percent distribution
def state_table(df):
  state_array = df['state']
  state_unique, state_count = np.unique(state_array, return_counts=True)
  state_percent = ["%.3f%%" % elem for elem in list(state_count*100/len(state_array))]
  state_df = pd.DataFrame({'State':state_unique,'Count':state_count,"% Dist.":state_percent}).sort_values(by=['Count'],ascending=False)
  return state_df


In [21]:
print(state_table(df2))

             State  Count  % Dist.
1    Massachusetts   9514  45.480%
0      Connecticut   3870  18.500%
5      Puerto Rico   2664  12.735%
6     Rhode Island   2117  10.120%
2    New Hampshire    965   4.613%
4         New York    800   3.824%
10  Virgin Islands    750   3.585%
9          Vermont    235   1.123%
3       New Jersey      1   0.005%
7   South Carolina      1   0.005%
8        Tennessee      1   0.005%
11        Virginia      1   0.005%


Given this insight, I will shift my focus to building my model to predict Massachusetts housing prices.<br>
I will then use Conneticut as a control group to analyze the affectiveness cross border.<br><br>
Finally I will look at the housing market in Puerto Rico and see if I can find any unique differences and see if the model is also applicable to the US territory.

### Pre-processing

I will limit the scope of this project to the state level.<br>
With that I will pre-process data to prepare the dataset for modeling.

In [85]:
# since we want to predict home prices, I will remove unwanted columns.
df3 = df2.drop(columns=['status','full_address','street','zip_code','sold_date'])
df3.head()

Unnamed: 0,price,bed,bath,acre_lot,city,state,house_size
0,105000.0,3.0,2.0,0.12,Adjuntas,Puerto Rico,920.0
1,80000.0,4.0,2.0,0.08,Adjuntas,Puerto Rico,1527.0
2,67000.0,2.0,1.0,0.15,Juana Diaz,Puerto Rico,748.0
3,145000.0,4.0,2.0,0.1,Ponce,Puerto Rico,1800.0
4,65000.0,6.0,2.0,0.05,Mayaguez,Puerto Rico,


In [86]:
# analyze if the are na values
df3.isnull().sum()

price            0
bed           3987
bath          3943
acre_lot      3776
city            16
state            0
house_size    3927
dtype: int64

In [87]:
# how many rows wil this affect?
print("Total rows we started with:",df3.shape[0])
print("Total rows left:",df3.shape[0] - df3.dropna().shape[0])

Total rows we started with: 20919
Total rows left: 7850


I will remove the na values as I want avoid any bias in my data and only deal with known features.<br>
I will keep the data in another dataframe if I need to revisit this topic.

In [88]:
# create null dataframe
df_null = df3[df3.isna().any(axis=1)]

# create data frame with null removed
df3.dropna(axis = 0,inplace=True)


In [89]:
# See the results for NA values removed
print(state_table(df3))

            State  Count  % Dist.
1   Massachusetts   5772  44.166%
0     Connecticut   2805  21.463%
6    Rhode Island   1788  13.681%
5     Puerto Rico   1548  11.845%
2   New Hampshire    501   3.833%
4        New York    423   3.237%
7         Vermont    126   0.964%
8  Virgin Islands    105   0.803%
3      New Jersey      1   0.008%


I'm happy with the distribution and glad to see that Massachusetts still remains as the largest count of homes. From here I will create subset the dataset with the 3 states I am interested in.

1.   Massachusetts
2.   Conneticut
3.   Puerto Rico



In [90]:
df_3_states = df3.loc[(df3['state'] == 'Massachusetts')\
                      | (df3['state'] == 'Connecticut')\
                      | (df3['state'] == 'Puerto Rico')]
df_3_states.head()

Unnamed: 0,price,bed,bath,acre_lot,city,state,house_size
0,105000.0,3.0,2.0,0.12,Adjuntas,Puerto Rico,920.0
1,80000.0,4.0,2.0,0.08,Adjuntas,Puerto Rico,1527.0
2,67000.0,2.0,1.0,0.15,Juana Diaz,Puerto Rico,748.0
3,145000.0,4.0,2.0,0.1,Ponce,Puerto Rico,1800.0
5,179000.0,4.0,3.0,0.46,San Sebastian,Puerto Rico,2520.0


### Exploratory Analysis

Before I begin. I will perform exploratory analysis to see if there is any additional information I can gleam from all 3 datasets.

In [91]:
import plotly.express as px
#import plotly.graph_objects as go

In [92]:
col = ["bed","bath","acre_lot","house_size"]

for i in col:
  fig = px.scatter(df_3_states, 
                   x = i,
                   y="price",
                   trendline="ols",
                   color='price',
                   title= f"{i} and price")
  fig.show()


What's interesting from this result is that this dataset does not just including single family homes, but also multi-homes. The initial assumptions that I did have that the higher the features, the higher the price does seem to hold up. Although I will assume that the bed and bath numbers are correct. I will remove the egregious outliers in price, house size, bath, and acre_lot

In [93]:
df_3_states = df_3_states.sort_values(by='house_size',ascending = True)
#df_3_states.tail()
df_3_states = df_3_states[:-1]
df_3_states.tail()

Unnamed: 0,price,bed,bath,acre_lot,city,state,house_size
103034,4999000.0,42.0,42.0,0.23,Somerville,Massachusetts,26942.0
14323,7995000.0,4.0,4.0,0.17,San Juan,Puerto Rico,29679.0
14324,7995000.0,3.0,3.0,0.17,San Juan,Puerto Rico,29679.0
80518,15150000.0,86.0,56.0,1.32,Framingham,Massachusetts,35666.0
108951,14950000.0,60.0,51.0,1.01,Boston,Massachusetts,38442.0


In [94]:
df_3_states = df_3_states.sort_values(by='acre_lot',ascending = True)
#df_3_states.tail()
df_3_states = df_3_states[:-2]
df_3_states.tail()

Unnamed: 0,price,bed,bath,acre_lot,city,state,house_size
8535,95000.0,3.0,2.0,1825.0,Ponce,Puerto Rico,1800.0
14352,1100000.0,3.0,3.0,1954.0,San Juan,Puerto Rico,1954.0
110404,2750000.0,3.0,3.0,2295.68,Boston,Massachusetts,1872.0
8486,795000.0,4.0,4.0,2500.0,Caguas,Puerto Rico,4400.0
8525,95000.0,3.0,10.0,2960.0,Ponce,Puerto Rico,638.0


In [95]:
df_3_states = df_3_states.sort_values(by='bath',ascending = True)
#df_3_states.tail()
df_3_states = df_3_states[:-1]
df_3_states.tail()

Unnamed: 0,price,bed,bath,acre_lot,city,state,house_size
111644,5200000.0,32.0,28.0,0.51,Stoughton,Massachusetts,22456.0
14343,13995000.0,33.0,35.0,0.09,San Juan,Puerto Rico,15000.0
103034,4999000.0,42.0,42.0,0.23,Somerville,Massachusetts,26942.0
108951,14950000.0,60.0,51.0,1.01,Boston,Massachusetts,38442.0
80518,15150000.0,86.0,56.0,1.32,Framingham,Massachusetts,35666.0


In [96]:
df_3_states = df_3_states.sort_values(by='price',ascending = True)
#df_3_states.tail()
df_3_states = df_3_states[:-1]
df_3_states.tail()

Unnamed: 0,price,bed,bath,acre_lot,city,state,house_size
198913,22000000.0,14.0,8.0,4.06,Woods Hole,Massachusetts,11996.0
199509,23750000.0,6.0,9.0,6.24,Nantucket,Massachusetts,7609.0
199512,26500000.0,8.0,6.0,6.42,Nantucket,Massachusetts,6815.0
199504,28000000.0,8.0,12.0,1.07,Nantucket,Massachusetts,8817.0
200979,30000000.0,7.0,14.0,3.66,Barnstable,Massachusetts,15500.0


Finally, the last thing I will test is to see if the housing market for the 3 states are similar to each other.

In [97]:
order = df_3_states.groupby(by=['state'])['price'].median().sort_values(ascending=False).index

fig = px.box(df_3_states, x="state", y="price", points='all', color='state', title='Highest State House Prizes Ranking by Median')
fig.update_xaxes(categoryorder='array', categoryarray= list(order))
fig.show()

After removing egregious outliers, I see that the Massachusetts housing market is significantly more expensive, with the median price at around 650k, while Connecticut is only hovering around 300k and Puerto Rico is well below 200k.
This gives me less confidence that a model based on Massachusetts data could predict the price of a house located in either Connecticut or Puerto Rico, given the limitations of the dataset.

In [None]:
# calculate outliers using z-scores for acre_lot
#acre_lot_outliers = df_3_states['acre_lot'].values.tolist()
#sorted(acre_lot_outliers)

#q1,q3 = np.percentile(sorted(acre_lot_outliers),[25.,75])
#iqr = q3-q1

#lower_bound = q1 - (1.5 * iqr)
#upper_bound = q3 + (1.5 * iqr)

#print(f"The lower bound is: {lower_bound}")
#print(f"The upper bound is: {upper_bound}")

### Standardize the data_set

I will standardize the data_set, but not before I split the data up into the 3 states

In [144]:
df_massachusetts = df3[df.state == 'Massachusetts']
df_connecticut = df3[df.state == 'Connecticut']
df_puerto_rico = df3[df.state == 'Puerto Rico']


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.



In [145]:
def standardize(df):
  df = df.drop(columns=['state','city'])
  df = (df-df.mean())/df.std()
  return df

In [146]:
df_massachusetts = standardize(df_massachusetts)
df_connecticut = standardize(df_connecticut)
df_puerto_rico = standardize(df_puerto_rico)

### Model Training

I will experiment with different models to find the the highest score.

In [116]:
X = df_massachusetts.drop(columns='price')
y = df_massachusetts['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [117]:
ml_models = {}
def train_validate_predict(regressor, x_train, y_train, x_test, y_test, index):
    model = regressor
    model.fit(x_train, y_train)
    
    y_pred = model.predict(x_test)

    r2 = r2_score(y_test, y_pred)
    ml_models[index] = r2

In [118]:
model_list = [LinearRegression, Lasso, Ridge, BayesianRidge, DecisionTreeRegressor, LinearSVR, KNeighborsRegressor,
              RandomForestRegressor, GradientBoostingRegressor, ElasticNet, SGDRegressor, XGBRegressor,
             LGBMRegressor]
model_names = ['Linear Regression', 'Lasso', 'Ridge', 'Bayesian Ridge', 'Decision Tree Regressor', 'Linear SVR', 
               'KNeighbors Regressor', 'Random Forest Regressor', 'Gradient Boosting Regressor', 'Elastic Net', 'SGD Regressor',
               'XGB Regressor', 'LGBM Regressor']

index = 0
for regressor in model_list:
    train_validate_predict(regressor(), X_train, y_train, X_test, y_test, model_names[index])
    index+=1



In [124]:
for i in sorted(ml_models,key = ml_models.get, reverse = True):
  print(i, ml_models[i])

XGB Regressor 0.38169990711956825
LGBM Regressor 0.37919306860516333
Gradient Boosting Regressor 0.37660939351666556
Random Forest Regressor 0.3681066050932231
KNeighbors Regressor 0.36580077000201516
Bayesian Ridge 0.29776528119846124
Ridge 0.29731959070318836
Linear Regression 0.2971390426714876
SGD Regressor 0.2826068497491314
Linear SVR 0.2447446211356108
Elastic Net 0.06050882671201496
Lasso -0.000998078230080024
Decision Tree Regressor -0.020701753541440215


Looks like the winner is Extreme Gradient Boostin (XGBoost).

In [137]:
# Implement XGB Regressor and Evaluate the model
model = XGBRegressor()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)



In [164]:
def model_evaluation(y_test,y_pred):
  print(f'Mean Absolute Error: {mean_absolute_error(y_test, y_pred):.5f}')
  print(f'Root Mean Squared Error: {math.sqrt(mean_squared_error(y_test, y_pred)):.5f}')
  print(f'r2 score: {r2_score(y_test, y_pred):.5f}')

In [165]:
model_evaluation(y_test,y_pred)

Mean Absolute Error: 0.32391
Root Mean Squared Error: 0.85969
r2 score: 0.38170


In [139]:
predictions = pd.DataFrame({'y_pred': y_pred, 'y_test':y_test})
predictions = predictions.sort_values(by='y_test')
predictions = predictions.reset_index()

In [None]:
fig = px.line(predictions, x=predictions.index, y=predictions.columns[1::], title='Predictions vs Actual Value')
fig.show()

In [141]:
y_pred

array([-0.17650121, -0.10647774, -0.39055717, ..., -0.3135444 ,
       -0.324629  , -0.13071734], dtype=float32)

In [153]:
# Compare Connecticut and Puerto Rico data sets.
X_c = df_connecticut.drop(columns='price')
X_pr = df_puerto_rico.drop(columns='price')

y_c = df_connecticut['price']
y_pr = df_puerto_rico['price']

y_pred_c = model.predict(X_c)
y_pred_pr = model.predict(X_pr)

In [166]:
print("Massachusetts")
model_evaluation(y_test,y_pred)
print("\n")
print("Connecticut")
model_evaluation(y_c,y_pred_c)
print("\n")
print("Puerto Rico")
model_evaluation(y_pr,y_pred_pr)

Massachusetts
Mean Absolute Error: 0.32391
Root Mean Squared Error: 0.85969
r2 score: 0.38170


Connecticut
Mean Absolute Error: 0.50645
Root Mean Squared Error: 1.17130
r2 score: -0.37242


Puerto Rico
Mean Absolute Error: 0.60839
Root Mean Squared Error: 0.90788
r2 score: 0.17522


Unfortunately, my model R2 score isn't very high, with only 38% of dependent variability can be explained by the model, suggesting a lot of variability. The model gets worse when trying to predict housing prices in Connecticut and Purto Rico.

I have two conclusions.


1.   I believe that the dataset is to narrow and that there may be other features that would affect a house price more than just the size, acre, bed amount, and bath amount. I could be as simple as median income, or something more complex like shortages vs. surplus of house available.

2.   It doesn't seem like one can apply a model trained on one state to another state/U.S. territory. There are far to many factors that are affecting the price that what was available in this dataset. 


