# COGS 118B - Final Project

# Home Scout : Machine Learning Meets Real Estate

## Group members
- Nhathan Nguyen
- Kristoffer Alejo 
- Elizabeth Lee
- Colin Sutedja

# Abstract 
Our objective is to enhance housing solutions for a broad audience by employing real estate data and thorough market analysis to match individuals with their ideal housing based on affordability, location, and crime rate. The core of our approach involves creating an algorithm designed to sift through extensive information to locate suitable properties, effectively bridging the gap between diverse housing needs and available offers. The effectiveness of our solution is determined by its ability to accurately align users with properties that meet their specific criteria. To ensure our model's practicality and adaptability in real-world scenarios, we focus on managing a sizable yet workable dataset that allows for realistic application and flexibility in adjusting to changing market conditions. Our major achievement lies in the development of a tool that not only meets but anticipates the needs of potential homeowners or renters, ensuring they are presented with options that truly resonate with their preferences and necessities. The approach we have decided to go with is an unsupervised machine learning model which will allow us to cluster information at an effective rate and curate solutions quickly with scattered datasets.

# Background
The exploration of unsupervised machine learning in the real estate domain has been given significant attention due to its potential to revolutionize property search, valuation, and investment strategies. Prior to the rise of AI, anything related to ML and real estate primarily focused on supervised learning models to predict house prices based on a set of predefined features. However, these models often require extensive labeled datasets, which are time-consuming and costly to produce. 
Recent advancements have shifted towards unsupervised learning, which does not require labeled data and can discover hidden patterns within the real estate market, offering insights into customer preferences, market segmentation, and predictive analysis of property values. For example, clustering algorithms have been employed to segment properties into distinct categories, enhancing personalized property recommendations.


Furthermore, dimensionality reduction techniques, such as Principal Component Analysis, have promoted the visualization and understanding of complex real estate datasets, enabling investors to make informed decisions by identifying key factors influencing the market. These innovative approaches underscore the versatility and power of unsupervised machine learning in transforming the real estate sector. By leveraging unsupervised models, the industry can tap into previously unexplored data dimensions, offering a more nuanced understanding of market dynamics and consumer behavior. Despite the promising applications, challenges such as data privacy, model interpretability, and the integration of domain expertise remain. Addressing these issues is crucial for the development of robust, ethical, and practical AI solutions in real estate.


Therefore, we have decided to design a project that can speak about the further potential of unsupervised models & the scalability of these systems
<a name="wachter"></a>[<sup>[1]</sup>](#wachternote)<a name="choy"></a>[<sup>[2]</sup>](#choynote)<a name="soltani"></a>[<sup>[3]</sup>](#soltani).

# Problem Statement
The problem we are looking to solve is the significant inefficiency in pairing potential renters or buyers with the most suitable housing options available on the real estate market. There is a noticeable gap in the market for a unified system that can accurately align consumer preferences with available properties, taking into account factors such as pricing, location, and specific user desires. The impact of this inefficiency is not only time-consuming for consumers but can also lead to less than optimal housing matches, affecting the overall quality of life and financial well-being. This challenge can be quantified and measured through various data points including, but not limited to, price points, geographical data, housing features, and individual consumer preferences. Success metrics can be established through the rate of successful matches, user satisfaction ratings, time saved in the housing search process, and the overall cost-effectiveness for both consumers and sellers.

# Data
### US Real Estate Dataset
Dataset: https://www.kaggle.com/datasets/febinphilips/us-house-listings-2023

This dataset consists of 22,681 observations in addition to 14 overall variables. The list of variables in which this dataset has measures include the following: State, City, Street, Zipcode, Number of Bedrooms, Number of Bathrooms, Area, Price Per Square Foot, Lot Area, Market Estimate, Latitude, Longitude, and Listed Price. The variables consist of the following data types: float and int. In terms of variables in which we plan to incorporate into our model, we plan to include the following, Bedroom, Bathroom, Area, Price Per Square Foot, Lot Area, Market Estimate, Rent Estimate, and Listed Price. The rest of the variables of which include State, City, Street, Zipcode, Latitude and Longitude can be used in how we plan to measure location. We plan to measure location in a series of different ways. One measurement is distance to nearest metropolitan city. For this we plan to use the latitude and Longitude coordinates in order to determine this distance. We can also find other measurements in terms of location, some of which include crime rates in that particular city, as well as socioeconomic status, as well as employment rates. These measurements can be obtained by using the zipcode variable or city variable and cross reference this data with other datasets that measure the corresponding variables. In terms of data transformations and cleaning, we have first dropped all N/A values. After dropping N/A values we have maintained 14,843 observations. For data transformation, we have chosen to drop the following columns: State, City, Street, Zipcode, Latitude, and Longitude. We will then use this data in order to generate new variables of which include: distance to nearest Metropolitan city, crime rate, socioeconomic status score, and employment rate, which we will then append to the remaining variables. To transform the data, we must define how we measure socioeconomic status.


### United States Cities Database
Dataset: https://simplemaps.com/data/us-cities

This dataset consists of 31,120 observations and 17 different variables. Each observation represents a particular city in the United States. We plan to merge this dataset to the US real estate dataset on the city of the particular apartment. The 17 variables in which this dataset includes includes the following: city, city_ascii, state_id, state_name, county_fips, county_name, lat, lng, population, density, source, military, incorporated, timezone, ranking, zips, id. We plan to only utilize the following variables: city, population, density, and rankings. The rankings variable is a measure of the importance of a city with 1 being the highest of importance level and 5 being the lowest. These variables will then be utilized in order for us to define and measure the location variable. For example, we know that cities with large populations and density usually means that that area is much more desirable, meaning that prices will be much higher. The ranking of the city will also play a factor in how we quantify the city variable. From there we will be able to quantify if the price of the apartment is worth it in that particular location.


### Crime Database
Dataset: https://ucr.fbi.gov/crime-in-the-u.s/2019/crime-in-the-u.s.-2019/tables/table-8/table-8.xls/view

Contains brief data of crime rates in all listed states of the U.S, and has subsections of data in major U.S cities. 



# Proposed Solution

Our proposed solution is to develop a machine learning system that dynamically evaluates consumer profiles against current market listings. This system will utilize an algorithm to propose the most suitable housing matches, thus simplifying the search process and improving the fit between consumers’ needs and the housing options available. With this approach, we want to be able to be adaptable and scalable, with potential applications across various markets and gather more data for future circumstances.

# Evaluation Metrics

For this project, we propose using the Silhouette Coefficient and parts of Mean clustering as primary metrics for evaluation. This is because we are able to get the strongest gauge for our work given their relevance and applicability to clustering performance assessment.


### Silhouette Coefficient
The Silhouette Coefficient is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). For each data point, the Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for that point. 
A high Silhouette Coefficient suggested that the object is well matched to its own cluster and poorly matched to neighboring clusters. This metric was useful for determining the optimal number of clusters, as it provided a way to assess the trade-off between maximizing inter-cluster distance and minimizing intra-cluster distance.



### Mean Cluster Centroid Points
The second method we used to measure our results involved calculating the geometric center of each cluster, which represents the mean position of all the points in the cluster or home that had highest and lowest prices. This metric was essential for understanding the central tendency of each cluster, allowing for a more intuitive grasp of cluster composition and distribution. 
By analyzing the centroids, we can evaluate how well the clustering algorithm has grouped similar entities together and how distinct the clusters are from each other. This evaluation can also aid in identifying any potential anomalies or outliers within clusters. By looking at the mean values of each cluster, we will also be able to see what each cluster is composed of, and using our intuition see how each cluster is organized.


# Results
By leveraging unsupervised machine learning models to uncover hidden patterns within a comprehensive real estate dataset, our aim was to enhance the property search experience by effectively matching properties to potential buyers' preferences and the lowest offered price.

### Subsection 1: Dataset and Problem Analysis
Our project is built on the `housing_data_df`, a dataset that underwent rigorous cleaning and preprocessing to uphold the integrity of our analysis. It contains 10,000 observations through various datasets all merged together. It encompasses diverse property attributes—ranging from location specifics (like state and city) to physical characteristics (such as bedrooms and square footage) and financial details (including listing prices and tax values)—and even local crime rates, offering a comprehensive view of what influences real estate values.

Some key steps in our data preparation included imputing missing values for critical attributes based on median values of similar properties, identifying and adjusting outliers using statistical methods to prevent skewing, and normalizing attributes to ensure equitable influence across all data points. The exploratory data analysis (EDA) phase was invaluable, giving us significant patterns and trends within the data, such as the distribution of property attributes indicating market segments and location-based trends affecting property values. This phase also included correlation analyses to better understand property value drivers.

This groundwork ensured our dataset was prepared for unsupervised learning models, enhancing the relevance and applicability of our findings to the real-world market. The insights gained not only informed our feature selection and dimensionality reduction efforts but also helped set realistic benchmarks for model performance, laying a good foundation for effective real estate market segmentation.


### Subsection 2: Feature Selection and Data Transformation
For our feature selection and data transformation to streamline our analysis, we decided to prepare a couple of different methods, mainly Principal Component Analysis for dimensionality reduction.  This step was crucial for preserving the dataset's core structure while enhancing our analysis's computational efficiency. By focusing on key property features, PCA allowed us to refine our clustering efforts, ensuring that our analysis was both manageable and interpretable. PCA played a vital role in this phase by reducing the dataset's complexity without sacrificing its variability. It transforms the data into a set of uncorrelated variables, or principal components, which capture the essence of the data. The process involves standardizing the data, computing the covariance matrix, and then deriving the eigenvalues and eigenvectors to identify the principal components. These components are then ordered and selected based on the variance they capture, allowing us to retain the most informative aspects of the dataset while discarding the rest. We went with this because clustering can be very computationally heavy. It not only reduces computational demands by simplifying the dataset but also enhances the interpretability of clustering results by focusing on the most significant features. 

As a result, through using PCA for data transformation, we were able to improve the efficiency and effectiveness of our clustering analysis. Below is how we obtained the elbow point using sum of square within clusters vs number of dimensions

```
# find elbow point for optimal number of dimensions
dimred = PCA()
dimred.fit(housing_df_matrix_scaled)
sns.lineplot(np.cumsum(dimred.explained_variance_ratio_))
location = KneeLocator(range(1, len(dimred.explained_variance_ratio_) + 1),
                       np.cumsum(dimred.explained_variance_ratio_),
                       curve="concave", direction="increasing")
plt.axvline(location.elbow, color="red", linestyle="--")
plt.xlim((0,16));

```

![image.png](attachment:image.png)


### Subsection 3: Base Model performance
- GMM
    - silhouette scores seem to be relatively consistent when we run the GMM multiple times
    - elbow point however fluctuates significantly when we run the elbow point graph multiple times
        - this may be a result of a different initialization point every time
    - In GMM we used 3 clusters where it had a silhouette score of 0.20, meanwhile the elbow point indicated that we should have used 4 clusters
    - the umap visualiztion seemed very homogenous between 2 clusters, however there was a clear separation with the third cluster. Note since the GMM was ran in 6 dimensiosn, projecting it onto a 2 dimensional umap may not be indicative of how well our clustering actually is.
    - Looking at means of each cluster, we see a clear distinction between all 3 clusters
    - however we see an exponental decrease in size of clusters as follows
        - Cluster 0: 7597 observations
        - Cluster 1: 1266 observations
        - Cluster 2: 561 observations
- K means
    - with K means we see consistency among the silhouette scores and the elbow point as both indicate that 4 clusters seems optimal. 4 clusters indicated a silhouette score of 0.319, which is better than that compared to GMM
    - when running the silhouette scores and elbow point multiple times, we get roughly the same result every time, indicating that K means may allow us to reproduce the data quite well
    - the umap visualization shows clear distinction between the 4 groups
    - Looking at the mean of each cluster, I believe that we see a much clearer distinction between clusters compared to GMM
    - however when we look at the cluster sizes, we still see this exponential decrease, but it seems a lot more gradual compared to GMM as follows:
        - Cluster 0: 2714 observations
        - Cluster 1: 6317 observations
        - Cluster 2: 377 observations
        - Cluster 3: 16 observations
    - Cluster size of 377 may be problematic, but a cluster size of 16 will definitely be problematic
    - another issue is the underlying issue in regards to kmeans, that being that we must assume spherial covariance with roughly equal variance among all clusters
    - Since we do not know the clear shape of our data, K means may not be ideal
- DBSCAN
    - DBSCAN is the best in terms of reproducibilty of clusters, as running the grid search heatmap producted the same results of same number of clusters, same number of noise points, and same silhouette score every time
    - Using DBSCAN paramters of eps = 1, and min_samples = 15, we got 4 clusters and a silhouette score of 0.45 which is the highest out of the other 3 algorithms.
    - The issue with DBSCAN however lies in how it generates noise points that does not cluster with the other points. Using the above paramters, we have generated 619 noise points which is around 6.6% of our data.
    - Looking at the umap visualization however, we do not see clear separation of clusters, however since we are projecting clustering from 6 dimensions down to 2 dimensions, our visualization may not be very indicative of how well our clustering actually is.
    - Looking at the mean of each cluster we see a distinguishable relationship between clusters
    - however there is one major issue that will rule us out of using DBSCAN, that being that it classifies the expensive houses as noise instead of clusters
    - looking at cluster sizes, we see this exponential decrease in size, more drastic that that of GMM as follows:
        - Cluster 0: 8447 observations
        - Cluster 1: 314 observations
        - Cluster 2: 16 observations
        - Cluster 3: 28 observations
    - Cluster 1 may be problematic based on its size, but cluster 2 and 3 most definitely will cause issues as it is too small
    
- HDBSCAN
    - HDBSCAN is quite similar to that of DBSCAN however with our hyperparameters we were able to reduce the noise points by more than half and still maintain a fairly good silhousette score. 
    - Using parameters min_clusters_size = 8, and min_samples = 9, we generated 3 clusters with 287 noise points and a silhouette score of 0.39, which is higher compared to GMM and K-means
    - however it still has the same issues as DBSCAN, that be classifying expensive homes as outliers
    - the cluster sizes are as follows:
        - Cluster 0: 8787 observations
        - Cluster 1: 20 observations
        - Cluster 2: 330 observations
    - Cluster 1 has too little obervations and may not generalize well when using the training and testing set


### Subsection 4: Model Selection and Hyperparameter Tuning 

It seems that we will likely use pca reduced k means in order to train the model. It seems that 4 clusters seems to be better for classificaiton compared to the 3 used in both GMM and Kmeans. Kmeans has the most clear distinction between cluster out of all 3. In terms of silhouette scores, Kmeans is very similar to that of GMM, but because of the issue of noise, that being that DBSCAN classifies the expensive houses as noise, we will have to rule it out as we would like to include expensive homes as its own cluster, even though DBSCAN has significantly the best silhouette score. Another point is that K means has the least exponential decrease in cluster size. However the fact that one of the clusters was of size 17, may pose an issue when splitting up the data into our training and testing model. We would also like to try HDBSCAN to see how it compares to kmeans, not only by silhouette score, but also how each algorithm distingiusies between clusters.

We chose K-means clustering and HDBSCAN as our models to use for our training and testing datasets and we compared the different silhouette scores to determine if the number of clusters we had was sufficient for our data. We also took into account the mean of the different clusters for both models to see if our clusters make sense. Something that we noticed in the performance of HDBSCAN is that we got the highest silhouette score out of all the different clustering models we worked through but when we looked at the mean of the clusters, we found that it was hard to differentiate between what cluster contained which houses/apartments. But when we saw the mean of the clusters for K-means, we could see that one cluster contained the more expensive/bigger properties which we wanted to take into account. We took the tradeoff between a higher silhouette score for clusters that seemed to make more sense such as clusters that are more similar based on their price/area which would tend to be the more expensive houses clustered together. Thus giving us the final selection of K-means as our model to use. 


#### K-Means hyperparameter selection: silhouette scores and elbow point
```
For n_clusters = 2 The average silhouette_score is : 0.30107733546841303

For n_clusters = 3 The average silhouette_score is : 0.3155014042386957

For n_clusters = 4 The average silhouette_score is : 0.31779166993293845

For n_clusters = 5 The average silhouette_score is : 0.33178193246403226

For n_clusters = 6 The average silhouette_score is : 0.2889691588627358

For n_clusters = 7 The average silhouette_score is : 0.2447831316658193

For n_clusters = 8 The average silhouette_score is : 0.24340154560755334

For n_clusters = 9 The average silhouette_score is : 0.2459223199976946

For n_clusters = 10 The average silhouette_score is : 0.23605072721070436 
```
![image-2.png](attachment:image-2.png)

#### HDBSCAN hyperparameters: sihoutte score and noise levels

```
min cluster size: 8, min_samples: 12
number of cluster : 3
number of noise : 292
silhouette score: 0.37751624163772035
min cluster size: 8, min_samples: 13
number of cluster : 5
number of noise : 1177
silhouette score: 0.20663968256013557
min cluster size: 8, min_samples: 14
number of cluster : 5
number of noise : 1213
silhouette score: 0.20547934053731656
min cluster size: 8, min_samples: 15
number of cluster : 4
number of noise : 1259
silhouette score: 0.21462886031149633
min cluster size: 8, min_samples: 16
number of cluster : 4
number of noise : 1312
silhouette score: 0.21049778287571863
min cluster size: 9, min_samples: 12
number of cluster : 5
number of noise : 1213
silhouette score: 0.20585845204936898
min cluster size: 9, min_samples: 13
number of cluster : 5
number of noise : 1177
silhouette score: 0.20663968256013557
min cluster size: 9, min_samples: 14
number of cluster : 5
number of noise : 1213
silhouette score: 0.20547934053731656
min cluster size: 9, min_samples: 15
number of cluster : 4
number of noise : 1259
silhouette score: 0.21462886031149633
min cluster size: 9, min_samples: 16
number of cluster : 3
number of noise : 738
silhouette score: 0.43600920074599986
min cluster size: 10, min_samples: 12
number of cluster : 5
number of noise : 1213
silhouette score: 0.20585845204936898
min cluster size: 10, min_samples: 13
number of cluster : 5
number of noise : 1177
silhouette score: 0.20663968256013557
min cluster size: 10, min_samples: 14
number of cluster : 4
number of noise : 1212
silhouette score: 0.21829436769946892
min cluster size: 10, min_samples: 15
number of cluster : 4
number of noise : 1259
silhouette score: 0.21462886031149633
min cluster size: 10, min_samples: 16
number of cluster : 3
number of noise : 738
silhouette score: 0.43600920074599986
min cluster size: 11, min_samples: 12
number of cluster : 5
number of noise : 1213
silhouette score: 0.20585845204936898
min cluster size: 11, min_samples: 13
number of cluster : 5
number of noise : 1177
silhouette score: 0.20663968256013557
min cluster size: 11, min_samples: 14
number of cluster : 4
number of noise : 1212
silhouette score: 0.21829436769946892
min cluster size: 11, min_samples: 15
number of cluster : 4
number of noise : 1259
silhouette score: 0.21462886031149633
min cluster size: 11, min_samples: 16
number of cluster : 3
number of noise : 738
silhouette score: 0.43600920074599986
min cluster size: 12, min_samples: 12
number of cluster : 5
number of noise : 1213
silhouette score: 0.20585845204936898
min cluster size: 12, min_samples: 13
number of cluster : 4
number of noise : 1175
silhouette score: 0.21900595238601828
min cluster size: 12, min_samples: 14
number of cluster : 4
number of noise : 1212
silhouette score: 0.21829436769946892
min cluster size: 12, min_samples: 15
number of cluster : 4
number of noise : 1259
silhouette score: 0.21462886031149633
min cluster size: 12, min_samples: 16
number of cluster : 3
number of noise : 738
silhouette score: 0.43600920074599986
```
![image-3.png](attachment:image-3.png)


### Subsection 5: Final Model and Insights
The culmination of our modeling efforts resulted in the selection of a final model that excelled in segmenting the real estate market into meaningful clusters. This model, though only uses a small amount of data, has good performance and alignment with the dataset's characteristics, and completes what it was designed to do. 

#### Kmeans: Training

```
clusterer_kmeans_train = cluster.KMeans(n_clusters=4, random_state=42)
cluster_labels_kmeans_train = clusterer_kmeans_train.fit_predict(X_train)
umap_model_train = umap.UMAP(n_neighbors=8, min_dist=0.1, metric='euclidean', random_state = 99, init = 'spectral')
umap_result_train = umap_model_train.fit_transform(X_train)
plt.scatter(umap_result_train[:, 0], umap_result_train[:, 1], c = cluster_labels_kmeans_train, cmap='viridis', s = 1)
plt.title('UMAP Embedding of housing_df_matrix_scaled: training data')
plt.legend()
plt.tight_layout()
plt.show()
```

![image-4.png](attachment:image-4.png)


```
# training df groupby clusters
# add column to housing_df_revised_train with kmeans cluster label for each point
housing_df_revised_train['kmeans clusters'] = cluster_labels_kmeans_train
# run groupby function by 'kmeans clusters' to see mean of each cluster
housing_df_revised_train.groupby('kmeans clusters').mean()
```


#### Kmeans: Testing
```
cluster_labels_kmeans_test = clusterer_kmeans_train.fit_predict(X_test)
umap_result_test = umap_model_train.fit_transform(X_test)
plt.scatter(umap_result_test[:, 0], umap_result_test[:, 1], c = cluster_labels_kmeans_test, cmap='viridis', s = 1)
plt.title('UMAP Embedding of housing_df_matrix_scaled: testing data')
plt.legend()
plt.tight_layout()
plt.show()
```
![image-5.png](attachment:image-5.png)

```
Silhouette score for testing set: 0.3221086468122243
```


####
The final model we will use is K-means since we wanted to have more clusters compared to HDBSCAN which in our training set gave us 3 clusters but when using our testing set, gives us 2 clusters which shows inconsistency. (refer to figure 3 and 4 in ‘Training and Testing.ipynb’) It also includes a good amount of noise points in our dataset which we would not want to exclude since many of those houses could be valid data points to cluster with. We believe that K-means does the best job in clustering our data points together since it gives us a relatively decent silhouette score without the tradeoff of generalizing too much of our data points together which is what HDBSCAN did. Also in HDBSCAN we see that there was one very big cluster (refer to figure 3 and 4 in ‘Training and Testing.ipynb’) of properties which would not be as helpful compared to K-means which has a little more even spread of points within each cluster. When looking at K-means, we can see that the means of each cluster don’t overlap as much, showing distinction between each cluster. (refer to figure 1 and 2 in ‘Training and Testing.ipynb’). Despite the fact that HDBSCAN had a silhouette score of 0.56, we do not think that the tradeoff is worth it because it wouldn’t cluster our data points as well. And our final silhouette score for our K-means with 4 clusters was 0.32.(refer to ‘Train and test using HDBSCAN’ in ‘Training and Testing.ipynb’ for more HDBSCAN details)




# Discussion

### Interpreting the Results


*Main Point:*

- The application of unsupervised machine learning, particularly clustering algorithms, effectively segments the real estate market into distinct categories that reflect varying consumer preferences and market conditions. This was shown through the clear differentiation in clusters based on properties' features, such as location, price, and crime rate, which align with specific buyer segments (luxury, affordable housing, rentals).

*Secondary Points:*
- *Housing Preferences:* The variation in silhouette scores across different numbers of clusters underscores the diverse range of housing preferences and needs, highlighting the model's ability to adapt and cater to varied consumer profiles.
- *Insights into Market Dynamics:* The clustering results offer insights into underlying market dynamics, such as the concentration of luxury properties in specific locales or the distribution of affordable homes, which can inform stakeholders about investment and development opportunities.
- *Model Versatility and Scalability:* The comparison of different clustering algorithms and their performance on the dataset illustrated the versatility and scalability of our approach, showing that it can be adapted to various datasets and market conditions.


### Limitations
One limitation of our study is the small scope of the dataset, which, while comprehensive, may not capture all nuances of the global real estate market. In addition, we had to merge many sets of data together to obtain 9500 observations which was quite difficult.

Expanding the dataset could lead to more granular insights into regional preferences and trends. Additionally, exploring a wider range of hyperparameters could potentially enhance model accuracy, however we understood that this was simply made to show the potential of more data. A result of this as well was having very big clusters that did not split up very well. We observed that they lacked due to the amount of data and observations, which could have held us back.

Other major limitations were our small tradeoffs: <br>
For instance: <br> 

- *Visualization*

    - A limitation we found when reducing with UMAP first to 2 dimensions, was that when we ran our clustering after this dimension reduction, it seemed as if our clustering algorithm such as K-means and GMM worked very well since the cluster visualization looks very well separated. But then we took a look at the means of each of the clusters (refer to ‘umap reduction with model selection.ipynb’) after UMAP reduction and saw that we could not differentiate which clusters contained which data points (i.e. not being able to tell which cluster contained the more expensive houses). 
    - Then when we reduced with PCA to 5 dimensions, we saw that it was hard to visualize how well our clustering algorithms did when projecting onto UMAP in 2d since our clustering algorithm would be working in the higher dimensional space which is hard to visualize using a 2d graph such as UMAP. This caused us to be skeptical about using PCA but also checked the means of each cluster and saw that it made more sense to not fully reduce down to 2 dimensions but at least a lower dimension that what we started. This is where we used an elbow plot (refer to ‘Dimensions reduction using PCA’ in ‘Training and Testing.ipynb’) to see how many dimensions we should reduce down to which is why we used 5 dimensions

- *Data Issues:*
    - Apartment complex that has different apartments but has all the same variables such as bedroom, price.
    - It would pick up on the duplicate data point in our data set which would affect our evaluation of training and testing 
    - But this only accounted for 212 data points which should not affect much of our model selection
    - In addition, our computing power made these algorithms function extremely slow, which did delay us a lot of time and testing when it came to having to train the model, conduct analysis and run simulations.


### Ethics & Privacy
Understanding that real estate data collection and analysis, privacy concerns are bound to happen, especially regarding the personal information of property owners. 

The potential for data misuse, such as unethical targeting or discrimination, necessitates strict adherence to privacy norms and the securing of explicit consent for data usage. 

To address these ethical considerations, our project will adhere to established guidelines focused on ensuring fairness, accountability, and transparency. These measures include anonymizing data and only using things that are publicly sourced such as zillow, kaggle and other sites. 

We want to show that obtaining clear consent, underpinned by principles like those recommended by ethical frameworks, to safeguard against privacy infringements and ensure ethical data handling and model application<a name="zook"></a>[<sup>[4]</sup>](#zooknote).

### Conclusion
This project demonstrates that unsupervised machine learning can significantly enhance our understanding of the real estate market, offering a unique perspective & views of consumer preferences and market dynamics through effective data segmentation. 
The results not only validate our approach but also provide actionable insights for various stakeholders. We wanted to display that with more data and computational power, there is more potential in the space. In the context of previous work, our findings contribute to the growing body of knowledge on applying machine learning in real estate, highlighting the potential for more personalized, efficient property matchmaking. 
We hope to see this in the future work on expanding the datasets, refining the model through deeper hyperparameter exploration, and addressing the ethical considerations inherent in leveraging personal and commercial data in machine learning projects. 

# Footnotes
<a name="wachter"></a>[<sup>[1]</sup>](#wachternote): Sarah Wachter and Akash Mittal, "The Future of Real Estate Transactions: Insights into Machine Learning Applications," Journal of Property Investment & Finance (2020), https://www.mdpi.com/2073-445X/12/4/740.  

<a name="choy"></a>[<sup>[2]</sup>](#choynote): L. H. T. Choy and W. K. O. Ho, "The Use of Machine Learning in Real Estate Research," Land (2023), https://www.researchgate.net/publication/369538750_The_Use_of_Machine_Learning_in_Real_Estate_Research.

<a name="soltani"></a>[<sup>[3]</sup>](#soltani): Soltani, A., Heydari, M., Aghaei, F., & Pettit, C. J. (2022). Housing price prediction incorporating spatio-temporal dependency into machine learning algorithms. Journal of Cities, 131, 103941. https://doi.org/10.1016/j.cities.2022.103941

<a name="zook"></a>[<sup>[4]</sup>](#zooknote): Matthew Zook et al., "Ten Simple Rules for Responsible Big Data Research," PLOS Computational Biology 13, no. 3 (2017): e1005399, https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005399. 
