#**California Housing Data Set**

#**INTRODUCTION**

# California Housing Market Analysis 🏡

This project explores and analyzes housing data from California to uncover patterns that can help predict housing prices.  
Key areas include **data preprocessing**, **visualization**, and **correlation analysis** using Python.

---

## 📊 Dataset Overview

We use 10 different variables in this dataset, each representing important characteristics of housing blocks in California:

* **Longitude** – A measure of how far west a house is; more negative values indicate farther west.  
* **Latitude** – A measure of how far north a house is; higher values indicate farther north.  
* **Housing Median Age** – Median age of a house within a block; lower values represent newer buildings.  
* **Total Rooms** – Total number of rooms within a housing block.  
* **Total Bedrooms** – Total number of bedrooms within a housing block.  
* **Population** – Total number of people residing within a housing block.  
* **Households** – Total number of households (a group of people living in a housing unit) in the block.  
* **Median Income** – Median income for households in the block (measured in tens of thousands of USD).  
* **Median House Value** – Median house value for households in the block (measured in USD).  
* **Ocean Proximity** – The location of the house in relation to the ocean/sea.

---


#Types of data for each variable,whether it is Nominal or Ordinal or descrete or continous

## 🔍 Variable Types

Each variable in the dataset falls into one of the following data type categories:

* **Longitude** – Continuous  
* **Latitude** – Continuous  
* **Housing Median Age** – Discrete  
* **Total Rooms** – Discrete  
* **Total Bedrooms** – Discrete  
* **Population** – Discrete  
* **Households** – Discrete  
* **Median Income** – Continuous  
* **Median House Value** – Discrete  
* **Ocean Proximity** – Nominal


#**Questions**

##**1. What is the average median income of the data set and check the dustribution of data using appropriate plots? Please explain the distribution of the plot**


###**Answer**  :

###Distribution of the data : 

###**explaination of the distribution** :

In the above Histogram distribution X-axis is the median_income  and Y-axis is number of families.

Here the distribution is towards right i.e positivly right skewed(skewness = 1.6466567021344465 which is positive value)

While the median_income is increasing then the number of families is decreasing.

##**2.Draw an apprpriate plot for housing_median_age and explain your observations.**

###**Answer** :

###**Inference** :

In the above Histogram distribution X-axis is housing_median_age and Y-axis is number of families.

Both are fairly symmetrical as its skweness range is 0.06 which exists  between -0.5 and 0.5

##**3.Show with the help of visualization,how median_income and median_house_values are related?**

###**Answer** :

###**Inference** : 

In the above scatter distribution X-axis is median_income and Y-axis is median_house_value

Here As the median_income increasing then then median_house_value  i.e both are direct proportional.

##**4.Create a data set by deleting the corresponding examples from the data set for which total_bedrooms are not available.**

###**Answer** :



##**5.Create a data set by filling the missing data with the mean value of the total_bedrooms in the original data set.**

###**Answer** :


##**6.Write a programming construct to calculate the median value of the data set wherever required.**

###**Answer** :

##**7.Plot latitude versus longitude and explain your observations.**

###**Answer** :

###**Inference** :
 
In the above Scatter Distribution X-axis is latitude and Y-axis is Longitude.

As the latitude is decreasing then the longitude is also decreasing
 i.e both directly proportional to each other

##**8.Create a data set for which the ocean_proximity is 'Near ocean'.**

###**Answer** :

##**9.Find the mean and median of the median income for the data set created in question 8**

###**Answer** :

##**10.Please create a new column named total_bedroom_size.if the size is 10 or less,it should be quoted as small.if the size is 11  or more but less than 1000, it should be medium.otherwise it should be considered large**

###**Answer** :

##**Adding a new cloumn "total_bedroom_size to the previous dataset**

In [None]:
# Here we import 2 libraries called "pandas" and another library called matplotlib
# one libray for analaysis and another for visulization
import pandas as pd
from matplotlib import pyplot as plt   

In [None]:
#to read the file
data1 = pd.read_csv("housing.csv")   # here our file is csv that's why we write csv
df = pd.DataFrame(data1)  # to create Data Frame

In [None]:
# to import seaborn package 
import seaborn as sns   # to use boxplot and scatter plot we have to import seaborn package

In [None]:
df.info()  # to get the information of the given dataset

In [None]:
df.head()   # .head will give 5 rows from the first by default

In [None]:
df.tail()   #.tail will give 5 rows from the down by default

In [None]:
df["median_income"].mean()

In [None]:
# Here we are using histogram for the following distribution
plt.hist(df['median_income'], color = "cyan" ,bins = 10, edgecolor = "black") #color = color in the bin and bins = bin size, edgecolor = color for the bin edge
plt.xlabel("median income")   # xlabel
plt.ylabel("number of families")  # ylabel
plt.title("Distribution for median_income")  # title for the distribution

In [None]:
df['median_income'].skew()

In [None]:
# Here we are using histogram for the following distribution
plt.hist(df['housing_median_age'], color = "red" ,bins = 10, edgecolor = "black")  #color = color in the bin and bins = bin size, edgecolor = color for the bin edge
plt.xlabel("housing_median_age")  # xlabel
plt.ylabel("number of families")  # ylabel
plt.title("Distribution for housing_median_age")  # title for the distribution

In [None]:
df['housing_median_age'].skew()

In [None]:
# Here we are taking scatterplotn for the distribution as follows
sns.scatterplot(x = "median_income", y = "median_house_value", hue = "ocean_proximity"  ,data = df )
plt.title("median_income vs median_house_values ")  # it is used for the title of the distribution

In [None]:
data1=df.dropna(how = "any")   # dropna = to drop the missing values
new_data = pd.DataFrame(data1)  
new_data

In [None]:
# to fill the data set by fiiling the misssing values
df["total_bedrooms"] = df["total_bedrooms"].fillna(df["total_bedrooms"].mean())
df

In [None]:
def median_of_column(column):   # it checks ,median of the given columns 
  return column.median()        # it returns the median of the mentioned column
  # In the given dataset we can calculate median value for 9 columns as mentioned below

In [None]:
median_of_cloumn(df["latitude"])   # median of the latitude


In [None]:
median_of_cloumn(df["longitude"])  # median of the longitude

In [None]:
median_of_cloumn(df["housing_median_age"])   # median of the housing_median_age

In [None]:
median_of_cloumn(df["total_rooms"])    # median of the total_rooms

In [None]:
median_of_cloumn(df["total_bedrooms"])    # median of the total_bedrooms

In [None]:
median_of_cloumn(df["population"])      # median of the population

In [None]:
median_of_cloumn(df["households"])           # median of the households

In [None]:
median_of_cloumn(df["median_income"])          #  median of the  median_income

In [None]:
median_of_cloumn(df["median_house_value"])     #   median of the median_house_value

In [None]:
# Here we are taking scatterplotn for the distribution as follows
sns.scatterplot(x = "latitude", y = "longitude", hue = "ocean_proximity"  ,data = df )
plt.title("latitude vs longitude ")   # it is used for the title of the distribution

In [None]:
new_data = df.loc[(df["ocean_proximity"] == "NEAR OCEAN")]   # Here we are giving new name as new_data for the data set which is ocean proximity is "Near Ocean"

In [None]:
new_data

In [None]:
new_data["median_income"].mean()   # to find mean of the median_income from the new_data

In [None]:
new_data["median_income"].median()   #to find median of the median_income from the new_data

In [None]:
def total_bedroomssize(column):   # here defining a column "bedroomsize "
  if  column <=10:   #  if bedroomsize is < 10 then it is "small"
    return "small"
  elif (column >= 11 and column <1000):   # if bedroomsize is > 11 and < 1000 then it is "medium" 
    return "medium"
  else:                                   # else it is "large"
    return "large"

In [None]:
df["total_bedroomssize"] = df["total_bedrooms"].apply(total_bedroomssize)
df

In [None]:

# Optimized loop for median values
columns = ["latitude", "longitude", "housing_median_age", "total_rooms", 
           "total_bedrooms", "population", "households", 
           "median_income", "median_house_value"]

for col in columns:
    print(f"Median of {col}: {median_of_column(df[col])}")


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Correlation matrix
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix of Numeric Features")
plt.show()

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

X = df[['median_income']]
y = df['median_house_value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(f"R² Score: {r2_score(y_test, y_pred):.2f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.2f}")

## 📌 Final Summary & Insights

- 📈 **Income is a strong predictor** of housing value (R² ≈ 0.6–0.7).
- 🌊 **Houses near the ocean** tend to have higher values and incomes.
- 🛠️ Custom bedroom size categories revealed useful distribution patterns.

> ✅ This analysis demonstrates practical EDA skills, data cleaning, visualization, and even basic modeling — ideal for a data analyst portfolio!
