# King County Housing Data – Exploratory Data Analysis (EDA)

This notebook explores the King County Housing dataset.  
The goal is to understand the data, find patterns, validate hypotheses, and provide insights and recommendations for a client.

We will follow the EDA checklist:

1. Understanding  
2. Hypothesis  
3. Explore  
4. Clean  
5. Relationships  
6. Back to the Hypothesis  
7. Fine Tune  
8. Explain  


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings (optional, simple)
pd.set_option('display.max_columns', None)
sns.set(style='whitegrid')


In [None]:
# Replace filenames if needed
sales = pd.read_csv("data/king_county_house_sales_202512091732.csv")
details = pd.read_csv("data/king_county_house_details_202512091731.csv")

sales.head(), details.head()


## 1. Understanding the Data

We check the structure, column names, and data types.


In [None]:
sales.info()
details.info()


## 2. Hypotheses

Before exploring the data, here are some assumptions:

1. Houses closer to the water are more expensive.
2. Houses with more bedrooms and bathrooms have higher prices.
3. Some ZIP codes might form “rich neighborhoods”.
4. Newer, renovated houses should have higher prices.

These will be tested during EDA.


## 3. Explore the Data

We look for:
- Missing values  
- Outliers  
- Strange values  
- Distribution of key columns  


In [None]:
sales.isna().sum(), details.isna().sum()


In [None]:
sales.describe()


In [None]:
df = sales.merge(details, on="id", how="left")
df.head()


In [None]:
sns.histplot(df["price"], bins=50)
plt.title("Price Distribution")
plt.xlabel("Price")
plt.ylabel("Count")
plt.show()


## 4. Cleaning the Data

We check:
- Incorrect values  
- Missing values  
- Outliers  
- Whether we need to transform variables (log, categories, etc.)


In [None]:
sns.boxplot(x=df["price"])
plt.title("Price Outliers")
plt.xlabel("Price")
plt.show()


## 5. Relationships

We explore correlations between variables.


In [None]:
plt.figure(figsize=(12,8))
# Select numeric columns to avoid errors
numeric_cols = df.select_dtypes(include=[np.number])
sns.heatmap(numeric_cols.corr(), annot=False, cmap="coolwarm")
plt.title("Correlation Heatmap (Numeric Columns)")
plt.show()


## 6. Back to the Hypotheses

We check if our assumptions are true or not and update them if necessary.


In [None]:
sns.scatterplot(data=df, x="sqft_living", y="price")
plt.title("Living Area vs Price")
plt.xlabel("Square Feet Living")
plt.ylabel("Price")
plt.show()


## 7. Fine Tune

Remove unnecessary plots, make visuals clear, add labels, and prepare clean results for the client.


## 8. Explain – Insights & Recommendations

### Insights (at least 3)
- Insight 1  
- Insight 2  
- Insight 3  

### Geographic Insight
- ZIP code or location-based finding  

### Recommendations (at least 3)
- Rec 1  
- Rec 2  
- Rec 3  

Client chosen: **(your choice)**
