# COGS 118B - Final Project

# Home Scout : Machine Learning Meets Real Estate

## Group members
- Nhathan Nguyen
- Kris Alejo 
- Elizabeth Lee
- Colin Sutedja

# Abstract 
Our objective is to enhance housing solutions for a broad audience by employing real estate data and thorough market analysis to match individuals with their ideal housing based on affordability, location, and crime rate. The core of our approach involves creating an algorithm designed to sift through extensive information to locate suitable properties, effectively bridging the gap between diverse housing needs and available offers. The effectiveness of our solution is determined by its ability to accurately align users with properties that meet their specific criteria. To ensure our model's practicality and adaptability in real-world scenarios, we focus on managing a sizable yet workable dataset that allows for realistic application and flexibility in adjusting to changing market conditions. Our major achievement lies in the development of a tool that not only meets but anticipates the needs of potential homeowners or renters, ensuring they are presented with options that truly resonate with their preferences and necessities. The approach we have decided to go with is an unsupervised machine learning model which will allow us to cluster information at an effective rate and curate solutions quickly with scattered datasets. 

# Background
The exploration of unsupervised machine learning in the real estate domain has been given significant attention due to its potential to revolutionize property search, valuation, and investment strategies. Prior to the rise of AI, anything related to ML and real estate primarily focused on supervised learning models to predict house prices based on a set of predefined features. However, these models often require extensive labeled datasets, which are time-consuming and costly to produce. 

Recent advancements have shifted towards unsupervised learning, which does not require labeled data and can discover hidden patterns within the real estate market, offering insights into customer preferences, market segmentation, and predictive analysis of property values. For example, clustering algorithms have been employed to segment properties into distinct categories, enhancing personalized property recommendations.


Furthermore, dimensionality reduction techniques, such as Principal Component Analysis, have promoted the visualization and understanding of complex real estate datasets, enabling investors to make informed decisions by identifying key factors influencing the market. These innovative approaches underscore the versatility and power of unsupervised machine learning in transforming the real estate sector. By leveraging unsupervised models, the industry can tap into previously unexplored data dimensions, offering a more nuanced understanding of market dynamics and consumer behavior. Despite the promising applications, challenges such as data privacy, model interpretability, and the integration of domain expertise remain. Addressing these issues is crucial for the development of robust, ethical, and practical AI solutions in real estate.

Therefore, we have decided to design a project that can speak about the further potential of unsupervised models & the scalability of these systems<a name="wachter"></a>[<sup>[1]</sup>](#wachternote)<a name="choy"></a>[<sup>[2]</sup>](#choynote)<a name="soltani"></a>[<sup>[3]</sup>](#soltani).

# Problem Statement
The problem we are looking to solve is the significant inefficiency in pairing potential renters or buyers with the most suitable housing options available on the real estate market. There is a noticeable gap in the market for a unified system that can accurately align consumer preferences with available properties, taking into account factors such as pricing, location, and specific user desires. The impact of this inefficiency is not only time-consuming for consumers but can also lead to less than optimal housing matches, affecting the overall quality of life and financial well-being. This challenge can be quantified and measured through various data points including, but not limited to, price points, geographical data, housing features, and individual consumer preferences. Success metrics can be established through the rate of successful matches, user satisfaction ratings, time saved in the housing search process, and the overall cost-effectiveness for both consumers and sellers.

# Data
### US Real Estate Dataset
Dataset: https://www.kaggle.com/datasets/febinphilips/us-house-listings-2023

This dataset consists of 22,681 observations in addition to 14 overall variables. The list of variables in which this dataset has measures include the following: State, City, Street, Zipcode, Number of Bedrooms, Number of Bathrooms, Area, Price Per Square Foot, Lot Area, Market Estimate, Latitude, Longitude, and Listed Price. The variables consist of the following data types: float and int. In terms of variables in which we plan to incorporate into our model, we plan to include the following, Bedroom, Bathroom, Area, Price Per Square Foot, Lot Area, Market Estimate, Rent Estimate, and Listed Price. The rest of the variables of which include State, City, Street, Zipcode, Latitude and Longitude can be used in how we plan to measure location. We plan to measure location in a series of different ways. One measurement is distance to nearest metropolitan city. For this we plan to use the latitude and Longitude coordinates in order to determine this distance. We can also find other measurements in terms of location, some of which include crime rates in that particular city, as well as socioeconomic status, as well as employment rates. These measurements can be obtained by using the zipcode variable or city variable and cross reference this data with other datasets that measure the corresponding variables. In terms of data transformations and cleaning, we have first dropped all N/A values. After dropping N/A values we have maintained 14,843 observations. For data transformation, we have chosen to drop the following columns: State, City, Street, Zipcode, Latitude, and Longitude. We will then use this data in order to generate new variables of which include: distance to nearest Metropolitan city, crime rate, socioeconomic status score, and employment rate, which we will then append to the remaining variables. To transform the data, we must define how we measure socioeconomic status.

### United States Cities Database
Dataset: https://simplemaps.com/data/us-cities

This dataset consists of 31,120 observations and 17 different variables. Each observation represents a particular city in the United States. We plan to merge this dataset to the US real estate dataset on the city of the particular apartment. The 17 variables in which this dataset includes includes the following: city, city_ascii, state_id, state_name, county_fips, county_name, lat, lng, population, density, source, military, incorporated, timezone, ranking, zips, id. We plan to only utilize the following variables: city, population, density, and rankings. The rankings variable is a measure of the importance of a city with 1 being the highest of importance level and 5 being the lowest. These variables will then be utilized in order for us to define and measure the location variable. For example, we know that cities with large populations and density usually means that that area is much more desirable, meaning that prices will be much higher. The ranking of the city will also play a factor in how we quantify the city variable. From there we will be able to quantify if the price of the apartment is worth it in that particular location.

### Crime Database
Dataset: https://ucr.fbi.gov/crime-in-the-u.s/2019/crime-in-the-u.s.-2019/tables/table-8/table-8.xls/view 

### *** NEED TO ADD TO THIS ****



# Proposed Solution

Our proposed solution is to develop a machine learning system that dynamically evaluates consumer profiles against current market listings. This system will utilize an algorithm to propose the most suitable housing matches, thus simplifying the search process and improving the fit between consumers’ needs and the housing options available. With this approach, we want to be able to be adaptable and scalable, with potential applications across various markets and gather more data for future circumstances.

# Evaluation Metrics

### *** NEED TO ADD TO THIS *** 

# Results
By leveraging unsupervised machine learning models to uncover hidden patterns within a comprehensive real estate dataset, our aim was to enhance the property search experience by effectively matching properties to potential buyers' preferences and the lowest offered price.

### Subsection 1: Dataset and Problem Analysis
Our project is built on the `housing_data_df`, a dataset that underwent rigorous cleaning and preprocessing to uphold the integrity of our analysis. It contains 10,000 observations through various datasets all merged together. It encompasses diverse property attributes—ranging from location specifics (like state and city) to physical characteristics (such as bedrooms and square footage) and financial details (including listing prices and tax values)—and even local crime rates, offering a comprehensive view of what influences real estate values.

Some key steps in our data preparation included imputing missing values for critical attributes based on median values of similar properties, identifying and adjusting outliers using statistical methods to prevent skewing, and normalizing attributes to ensure equitable influence across all data points. The exploratory data analysis (EDA) phase was invaluable, giving us significant patterns and trends within the data, such as the distribution of property attributes indicating market segments and location-based trends affecting property values. This phase also included correlation analyses to better understand property value drivers.

This groundwork ensured our dataset was prepared for unsupervised learning models, enhancing the relevance and applicability of our findings to the real-world market. The insights gained not only informed our feature selection and dimensionality reduction efforts but also helped set realistic benchmarks for model performance, laying a good foundation for effective real estate market segmentation.


### Subsection 2: Feature Selection and Data Transformation
For our feature selection and data transformation to streamline our analysis, we decided to prepare a couple of different methods, mainly Principal Component Analysis for dimensionality reduction.  This step was crucial for preserving the dataset's core structure while enhancing our analysis's computational efficiency. By focusing on key property features, PCA allowed us to refine our clustering efforts, ensuring that our analysis was both manageable and interpretable. PCA played a vital role in this phase by reducing the dataset's complexity without sacrificing its variability. It transforms the data into a set of uncorrelated variables, or principal components, which capture the essence of the data. The process involves standardizing the data, computing the covariance matrix, and then deriving the eigenvalues and eigenvectors to identify the principal components. These components are then ordered and selected based on the variance they capture, allowing us to retain the most informative aspects of the dataset while discarding the rest. We went with this because clustering can be very computationally heavy. It not only reduces computational demands by simplifying the dataset but also enhances the interpretability of clustering results by focusing on the most significant features. 
As a result, through using PCA for data transformation, we were able to improve the efficiency and effectiveness of our clustering analysis.

### Subsection 3
Our base model utilized KMeans clustering and HDB Scan, guided by silhouette scores to determine the optimal number of clusters. This approach facilitated the initial grouping of properties, revealing distinct categories based on their features. The silhouette score, a measure of how similar an object is to its own cluster compared to other clusters, served as a primary metric for assessing the effectiveness of our clustering approach.

#### *PASTE RESULTS*

### Subsection 4: Model Selection and Hyperparameter Tuning 
Next, we decided to run an  HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) and Spectral Clustering. 

These methods were evaluated for their ability to uncover the underlying structure of the real estate market data, each offering unique advantages in dealing with the complexities and subtleties of the dataset.

HDBSCAN excels in identifying clusters of varying densities, making it particularly suited for real estate data where market segments can vastly differ in density and distribution. Its minimal need for hyperparameter tuning, aside from the minimum cluster size, simplifies the process of model selection.
#### *PASTE RESULTS*
Spectral Clustering, leveraging the eigenvalues of a similarity matrix, is adept at identifying clusters based on the connectivity of the data points, potentially offering insights into more complex, non-linear relationships between properties.
#### *PASTE RESULTS*


### Subsection 5: Final Model and Insights
The culmination of our modeling efforts resulted in the selection of a final model that excelled in segmenting the real estate market into meaningful clusters. This model, though only uses a small amount of data, has good performance and alignment with the dataset's characteristics, and completes what it was designed to do. 

#### *PASTE RESULTS*
Some key insights involved a detailed analysis of silhouette scores alongside a qualitative "eyeball test" to evaluate the practical coherence of the clusters. Although HDBSCAN generally favored a configuration with fewer, more generalized clusters, it achieved high silhouette scores, indicating strong cluster separation and cohesion.



# Discussion

### Interpreting the Results


#### *Main Point:*
The application of unsupervised machine learning, particularly clustering algorithms, effectively segments the real estate market into distinct categories that reflect varying consumer preferences and market conditions. This was shown through the clear differentiation in clusters based on properties' features, such as location, price, and crime rate, which align with specific buyer segments (luxury, affordable housing, rentals).

#### *Secondary Points:*
Housing Preferences: The variation in silhouette scores across different numbers of clusters underscores the diverse range of housing preferences and needs, highlighting the model's ability to adapt and cater to varied consumer profiles.
Insights into Market Dynamics: The clustering results offer insights into underlying market dynamics, such as the concentration of luxury properties in specific locales or the distribution of affordable homes, which can inform stakeholders about investment and development opportunities.
Model Versatility and Scalability: The comparison of different clustering algorithms and their performance on the dataset illustrated the versatility and scalability of our approach, showing that it can be adapted to various datasets and market conditions.

### Limitations
Expanding the dataset could lead to more granular insights into regional preferences and trends. Additionally, exploring a wider range of hyperparameters could potentially enhance model accuracy, however we understood that this was simply made to show the potential of more data. 

In addition, our computing power made these algorithms function extremely slow, which did delay us a lot of time and testing when it came to having to train the model, conduct analysis and run simulations. 

### Ethics & Privacy
Understanding that real estate data collection and analysis, privacy concerns are bound to happen, especially regarding the personal information of property owners. 

The potential for data misuse, such as unethical targeting or discrimination, necessitates strict adherence to privacy norms and the securing of explicit consent for data usage. 

To address these ethical considerations, our project will adhere to established guidelines focused on ensuring fairness, accountability, and transparency. These measures include anonymizing data and only using things that are publicly sourced such as zillow, kaggle and other sites. 

We want to show that obtaining clear consent, underpinned by principles like those recommended by ethical frameworks, to safeguard against privacy infringements and ensure ethical data handling and model application<a name="zook"></a>[<sup>[4]</sup>](#zooknote).

### Conclusion
This project demonstrates that unsupervised machine learning can significantly enhance our understanding of the real estate market, offering a unique perspective & views of consumer preferences and market dynamics through effective data segmentation. 
The results not only validate our approach but also provide actionable insights for various stakeholders. We wanted to display that with more data and computational power, there is more potential in the space. In the context of previous work, our findings contribute to the growing body of knowledge on applying machine learning in real estate, highlighting the potential for more personalized, efficient property matchmaking. 
We hope to see this in the future work on expanding the datasets, refining the model through deeper hyperparameter exploration, and addressing the ethical considerations inherent in leveraging personal and commercial data in machine learning projects. 

# Footnotes
<a name="wachter"></a>[<sup>[1]</sup>](#wachternote): Sarah Wachter and Akash Mittal, "The Future of Real Estate Transactions: Insights into Machine Learning Applications," Journal of Property Investment & Finance (2020), https://www.mdpi.com/2073-445X/12/4/740.  

<a name="choy"></a>[<sup>[2]</sup>](#choynote): L. H. T. Choy and W. K. O. Ho, "The Use of Machine Learning in Real Estate Research," Land (2023), https://www.researchgate.net/publication/369538750_The_Use_of_Machine_Learning_in_Real_Estate_Research.

<a name="soltani"></a>[<sup>[3]</sup>](#soltani): Soltani, A., Heydari, M., Aghaei, F., & Pettit, C. J. (2022). Housing price prediction incorporating spatio-temporal dependency into machine learning algorithms. Journal of Cities, 131, 103941. https://doi.org/10.1016/j.cities.2022.103941

<a name="zook"></a>[<sup>[4]</sup>](#zooknote): Matthew Zook et al., "Ten Simple Rules for Responsible Big Data Research," PLOS Computational Biology 13, no. 3 (2017): e1005399, https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005399. 
