<a href="https://colab.research.google.com/github/dareoyeleke/python_scripting/blob/main/Pandas1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Pandas Data Exploration â€” California Housing Dataset

This notebook provides a hands-on walkthrough of essential pandas operations using the California Housing Test dataset. It focuses on practical data manipulation techniques including filtering, descriptive statistics, grouping, and aggregation. The goal is to build a solid foundation before moving into larger data analysis and machine learning projects.

What's Covered

Importing and inspecting data

Using .loc and boolean indexing for filtering

Generating descriptive statistics with .describe()

Comparing housing prices below vs. above $500,000

Grouping values by housing_median_age

Aggregating with sum, count, mean, min, and max

Sorting and reshaping results for interpretation

In [None]:
# to import pandas first of all
import pandas as pd

# calling the csv file from the file location with the file path and converting it to a dataframe
cali_housing_data = pd.read_csv("/content/sample_data/california_housing_test.csv")
cali_housing_data


In [None]:
# To find the total population for home owners according to the data in the table
home_owner_med = cali_housing_data['population'].sum()
print("The total population of homeowners in Cali per county for this year is", home_owner_med, "home owners across all counties")


print("\n\n") # To make results easier to view

# To get a full picture of the numbers we're looking at using for home owner numbers we'll use the describe method
home_owner_stats = cali_housing_data['population'].describe()
print("All measures for california home owner numbers for the counties in california are \n", home_owner_stats)

In [None]:
# WE will be looking at housing costs per county and looking at the numbers for houses that cost less and more than $500,000

standard_house_price = cali_housing_data.loc[cali_housing_data['median_house_value'] < 500000, 'median_house_value']
standard_house_price
# To get the full picture of the stats associated with houses below 500,000
stats_standard_housing = standard_house_price.describe()
print("The all inclusive stats for houses below $500,000 are as follows \n", stats_standard_housing)

print('\n\n') # to make the results easier to read

# Next for houses over $500,000
high_house_price = cali_housing_data.loc[cali_housing_data['median_house_value'] > 500000, 'median_house_value']
stats_high_house_price = high_house_price.describe()
print("The statistical measures for houses with a price tag above $500,000 is shown as \n", stats_high_house_price)
high_house_price

In [40]:
# Here we will also be pulling data on housing prices and grouping median housing prices by median_age, to see how age relates to housing prices
age_stats = cali_housing_data['housing_median_age'].describe()
print ("Age stats are as follows", age_stats)

print("\n\n")

# To give us a full picture of how age interacts with housing prices, we have the minimum, maximum, total sum and count of houses for every age of home owners across counties in california
house_price_by_age = cali_housing_data.groupby("housing_median_age", as_index= False)["median_house_value"].agg(['sum', 'count', 'mean', 'min', 'max'])
house_price_by_age # to then print out the results as a regular dataframe with indexing

table_house_price_by_age = house_price_by_age.reset_index()
table_house_price_by_age # the grouped age by house price data above, printed as an index

# To sort the data according to which ages account for the highest house purchases in descending order

house_price_by_age_table_oldest = table_house_price_by_age.sort_values(by='count', ascending= False)
house_price_by_age_table_oldest # here we see the median age 52, have the highest count of houses purchased




Age stats are as follows count   3,000
mean       29
std        13
min         1
25%        18
50%        29
75%        37
max        52
Name: housing_median_age, dtype: float64





Unnamed: 0,index,housing_median_age,sum,count,mean,min,max
51,51,52,45323513,173,261986,22500,500001
34,34,35,24750906,118,209753,47500,500001
35,35,36,23573606,115,204988,39800,500001
15,15,16,21482102,107,200767,55000,500001
33,33,34,22582504,102,221397,55000,500001
16,16,17,18734401,100,187344,55000,500001
31,31,32,16961802,91,186393,41500,500001
25,25,26,20789803,88,236248,59000,500001
36,36,37,17699807,88,201134,46300,500001
24,24,25,18945406,86,220295,52600,500001


In [39]:
import pandas as pd

# Set display option to show floats as integers (no decimals) with a thousands separator
pd.set_option('display.float_format', '{:,.0f}'.format)

# To find the median income by population by first finding the median income and then median population per county and then the divide
cali_housing_data['income'] = cali_housing_data['median_income'] * 1000000 # to convert the number from millionths to closest million
# Removed the redundant pd.DataFrame conversion, as it's not needed for a column assignment

cali_housing_data["income_per_household"] = cali_housing_data['income'] / cali_housing_data['households'] # to get the income per household value for all counties in california
income_household_df = pd.DataFrame(cali_housing_data[['income_per_household', 'income', 'households']]) # convert it to a DataFrame
sorted_household_income = income_household_df.sort_values(by = 'income_per_household', ascending=False) # sort the value descending by the income per household
sorted_household_income
stats_sorted_household_income = sorted_household_income.describe()
stats_sorted_household_income

Unnamed: 0,income_per_household,income,households
count,3000,3000,3000
mean,15651,3807272,490
std,53229,1854512,365
min,869,499900,2
25%,5019,2544000,273
50%,8440,3487150,410
75%,13720,4656475,597
max,1666678,15000100,4930
