This code leverages data from this kaggle competition. <br>
[Real Estate Price Predictions](https://www.kaggle.com/datasets/quantbruce/real-estate-price-prediction)

In [None]:
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt


In [None]:

# Set the path to the file you'd like to load
file_path = "Real estate.csv"

# Load the latest version
df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "quantbruce/real-estate-price-prediction",
  file_path,
  pandas_kwargs={"index_col": 'No'}
  # Provide any additional arguments like 
  # sql_query or pandas_kwargs. See the 
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

print("First 5 records:", df.head())

In [None]:
df.head()

In [None]:
df.describe()

# Investigation of the data.
Given that I want to figure out price of a house based on the location, a regression based algorithm doesn't make sense. So, we will investigate various clustering options for this dataset and see how things work. <br>

The cluster algorithms that we use is based on the patterns in the data that we notice, size of the dataset, etc. So, given these factors we will need to know the following before choose the appropriate cluster algorithm. <br>

1. This isn't a large dataset (414 rows) so various models that require a large dataset likely won't perform very well, unless we generate data. 
2. The documentation for this dataset is very lacking. So, we should do a little digging on some of this information to determine what we are dealing with. 
3. There isn't any missing data in this dataset (HUGE PLUS because we don't have to deal with nulls). However, we don't have a ton of features that we can play with, is there any features that we can generate or expand from this dataset that we can leverage? 
4. What is the distribution of the prices? (It is common to have some outliars when dealing with numbers wihtout a cap. So, how might we handle this?)


Number one above was more of a statement and not something to investigate, but it does affect our end decision. <br>
So, I am going to start with investigating #2. <br>
Exploring some basic information about this class, it looks like these coordinates are in Taipei City. This affects currency and various other factors of this dataset as there isn't any additional data to this dataset. However, we can still do analysis on this dataset and I am hoping to cover other concepts around this that can be applied to other datasets. 

In [None]:


# Assume df is your housing dataset
features = ["X1 transaction date",
            "X2 house age",
            "X3 distance to the nearest MRT station",
            "X4 number of convenience stores",
            "X5 latitude",
            "X6 longitude"]
X = df[features]

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)


In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
df['pca1'], df['pca2'] = X_pca[:, 0], X_pca[:, 1]

plt.scatter(df['pca1'], df['pca2'], c=df['cluster'], cmap='viridis')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.title('Housing Clusters')
plt.show()


In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print(pca.explained_variance_ratio_)

loadings = pd.DataFrame(
    pca.components_.T,
    columns=['PCA1', 'PCA2'],
    index=features
)

print(loadings)

In [None]:
df_pca = pd.DataFrame(X_pca, columns=['PCA1', 'PCA2'])
df_pca['price'] = df['Y house price of unit area']  # append price

# Compute correlations
correlations = df_pca.corr()
print(correlations['price'])