We first start with importing essential libraries for data analysis and visualization. The `pandas` library is used for data manipulation and analysis, `matplotlib.pyplot` and `seaborn` are used for plotting and visualizing data in various formats. This setup is crucial as it prepares our environment to handle and visualize the dataset effectively.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

We first start with loading the dataset from a CSV file named '8_counties_housing.csv' into a DataFrame called `df`. This step is crucial as it sets up our data in a structured format, allowing us to perform further analysis and visualization on housing data across eight counties.

In [None]:
df = pd.read_csv("8_counties_housing.csv")

We first start with displaying the first few rows of our dataset using `df.head()`. This helps us quickly check the structure of the data and ensure it's loaded correctly. It's a simple yet effective way to get a glimpse of the data types and values we will be working with.

In [None]:
df.head()

In the following code cell, we will check for any missing data in our dataset. The command `df.isnull().sum()` calculates the total number of missing values in each column of the DataFrame `df`. This step is crucial as it helps us identify columns that may require cleaning or further investigation before proceeding with data analysis.

In [None]:
df.isnull().sum()

Now, we will clean our dataset by removing any rows that contain missing values. This step ensures that the subsequent analysis, such as plotting distributions and correlations, is based on complete and accurate data. By doing this, we will see that our visualizations and statistical summaries are more reliable, setting a solid foundation for deeper insights.

In [None]:
df_cleaned = df.dropna()

Having done the initial data cleaning by removing missing values, we are now going to explore the statistical summary of the cleaned dataset using `df_cleaned.describe()`. This function provides a quick overview of the central tendencies, dispersion, and shape of the dataset's distribution, excluding NaN values. It's a crucial step to understand the data's properties, such as mean, median, standard deviation, and quartiles, before proceeding with further analysis and visualization.

In [None]:
df_cleaned.describe()

In the following code cell, we will visually explore the distribution of housing prices from the cleaned dataset. By using a histogram with a kernel density estimate (KDE), we can see both the frequency of different price points and the overall distribution shape. This visualization helps in understanding how housing prices are spread out, which is crucial for further analysis and decision making.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df_cleaned["price"], kde=True)
plt.title("Distribution of Housing Prices")
plt.show()

Now, we will visualize the relationships between different housing features using a correlation matrix. This heatmap displays how closely the attributes of the cleaned dataset are related to each other, with color intensity indicating the strength of the correlation. By doing this, we can easily identify which features have strong positive or negative correlations, aiding in further analysis or feature selection.

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(df_cleaned.corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()

Now, we will visually compare housing prices across different counties using a boxplot. This plot, created with Seaborn and Matplotlib, displays the distribution of housing prices within each county, highlighting the median, quartiles, and potential outliers. By examining this visualization, we can easily identify which counties have higher or lower median housing prices and observe the variability of prices within each county.

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x="county", y="price", data=df_cleaned)
plt.title("Housing Prices by County")
plt.show()

Now, we will visually explore the relationship between the square footage of properties and their prices using a scatter plot. This plot helps us see if larger properties tend to be more expensive, which is a common hypothesis in real estate analysis. By plotting each property as a point with its square footage on the x-axis and its price on the y-axis, we can observe trends and outliers in the data.

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x="sqft", y="price", data=df_cleaned)
plt.title("Price vs. Square Footage")
plt.show()

Having done various visualizations to understand the distribution and relationships in the housing data, we are now going to focus on the average housing prices per county. In this code cell, we calculate the mean price of houses in each county and then sort these averages in descending order to see which counties are the most expensive. Finally, we visualize these average prices using a bar chart, making it easy to compare the counties at a glance.

In [None]:
mean_price_per_county = (
    df_cleaned.groupby("county")["price"].mean().sort_values(ascending=False)
)
mean_price_per_county.plot(
    kind="bar", figsize=(10, 6), title="Mean Housing Prices by County"
)
plt.show()

Having done various analyses and visualizations on the cleaned housing data, we are now going to identify the top 5 most expensive houses. This code extracts the five highest-priced entries from the cleaned dataset and displays them. This helps us understand which properties are at the top of the market in terms of price.

In [None]:
top_5_expensive = df_cleaned.nlargest(5, "price")
top_5_expensive

Having analyzed various aspects of housing data, we are now going to focus on the average square footage of houses across different counties. This code calculates the mean square footage for each county using the cleaned dataset and then sorts these averages in descending order. Finally, it visualizes this data in a bar chart, making it easy to compare the spatial dimensions of properties across counties.

In [None]:
avg_sqft_per_county = (
    df_cleaned.groupby("county")["sqft"].mean().sort_values(ascending=False)
)
avg_sqft_per_county.plot(
    kind="bar", figsize=(10, 6), title="Average Square Footage by County"
)
plt.show()