# AI/ML Assessment 2

## Imports

In [6]:
# To retrieve requirements to run this notebook, run: !pip freexe > requirements.txt, for a user to install the requirements, run: !pip install -r requirements.txt
import pandas as pd
import seaborn as sns
import plotly.express as px
import sklearn as sk
from sklearn.datasets import fetch_california_housing


# California Housing dataset 

### Summary of California Housing Dataset:

The California Housing dataset contains features describing various geographical locations in California and targets the median house value for districts. Each data instance represents a district, and the dataset includes the following features:

- Median Income: Median income of households in the district.
- Housing Median Age: Median age of houses in the district.
- Average Rooms: Average number of rooms in houses in the district.
- Average Bedrooms: Average number of bedrooms in houses in the district.
- Population: Population of the district.
- Households: Number of households in the district.
- Latitude and Longitude: Geographic coordinates of the district.
- Median House Value: Median house value for houses in the district (target variable).

### Application of Supervised Machine Learning:

1. **Regression Analysis:**
   - The dataset can be used for regression analysis to predict the median house value based on the geographical and demographic features.
   - Various regression algorithms such as linear regression, decision trees, random forests, gradient boosting, or neural networks can be applied.
   - Performance metrics such as mean squared error (MSE), mean absolute error (MAE), root mean squared error (RMSE), and R-squared can be used to evaluate model performance.

### Application of Unsupervised Machine Learning:

1. **Clustering Analysis:**
   - Apply clustering algorithms such as K-means or DBSCAN to group similar geographical regions together based on demographic features.
   - Identify clusters of regions with similar housing characteristics, which can provide insights into spatial patterns and regional disparities.
   - Visualization techniques such as heatmaps or choropleth maps can be used to visualize cluster densities and spatial patterns of housing characteristics.

2. **Dimensionality Reduction:**
   - Use techniques like Principal Component Analysis (PCA) to reduce the dimensionality of the dataset and visualize high-dimensional data in lower dimensions.
   - Explore underlying patterns or relationships between features and identify the most informative features that distinguish between districts.

3. **Anomaly Detection:**
   - Identify outliers or unusual patterns in the data, such as districts with significantly different housing characteristics compared to neighboring regions.
   - Anomaly detection techniques can help identify regions with unexpected housing trends or anomalies in the data, which may require further investigation.

### Summary:
The California Housing dataset offers opportunities for both supervised and unsupervised machine learning techniques. Supervised learning can be applied to predict house prices based on demographic and geographical features, while unsupervised learning can provide insights into spatial patterns, regional disparities, and anomalies within California districts. These techniques can inform urban planning, real estate development, and policy-making decisions by understanding housing market dynamics and socio-economic patterns across different regions of California.

In [10]:
# California housing data set
housing_df = fetch_california_housing()

# convert to DataFrame if needed
housing_df = pd.DataFrame(housing_df.data, columns=housing_df.feature_names)


# Display the first few rows of the DataFrame
housing_df


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


# Supervised ML Question

### Question: Can we predict the median house value (target variable) in California districts based on various demographic and geographical features?

# Unsupervised ML Questions
### Can the California Housing data be clustered into economic regions based on median_income?