# Data manipulation

The **California Housing dataset** is a popular dataset used for regression tasks, particularly in predicting housing prices based on various features. It was originally extracted from the 1990 U.S. Census and was compiled for the purpose of predicting median house values in California's various districts. This dataset is available through `scikit-learn` and contains information on several key factors that can influence housing prices.

### Key Features of the Dataset:
- **MedInc**: Median income in the district (in tens of thousands of dollars).
- **HouseAge**: Median age of houses in the district (in years).
- **AveRooms**: Average number of rooms per household.
- **AveBedrms**: Average number of bedrooms per household.
- **Population**: Total population of the district.
- **AveOccup**: Average number of occupants per household.
- **Latitude**: Geographical latitude of the district.
- **Longitude**: Geographical longitude of the district.

### Target Variable:
- **MedHouseVal**: Median house value for households in the district (in hundreds of thousands of dollars).

### Use Cases:
The dataset is primarily used for:
1. **Regression analysis**: Predicting median house prices based on features like income, house age, and population density.
2. **Data exploration and visualization**: Understanding relationships between housing characteristics and geographical regions.
3. **Machine learning models**: Developing models to predict house prices, such as linear regression, decision trees, or neural networks.

The dataset contains 20,640 instances, making it a moderately sized dataset suitable for various data science projects.

It is widely used for teaching purposes in courses and tutorials on data analysis, data visualization, and machine learning. The dataset provides rich opportunities for data exploration, feature engineering, and predictive modeling.

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing

# Loading the dataset
california_housing = fetch_california_housing(as_frame=True)
df = california_housing.frame

# Viewing the first few rows of the dataset
# df.head() # UNCOMMENT

## 1. Querying the dataset: selecting rows where 'MedInc' (Median Income) is greater than 5

In [2]:
df_query = df.query('MedInc > 5')
#print("Query Example: \n", ); df_query.head(10)  # UNCOMMENT

## 2. Using loc: selecting rows where the index is between 0 and 10 and specific columns


In [3]:
df_loc = df.loc[0:10, ['MedInc', 'HouseAge', 'AveRooms']]
#print("loc Example: \n"); df_loc # UNCOMMENT

## 3. Using iloc: selecting rows and columns by index position


In [4]:
df_iloc = df.iloc[0:10, [0, 1, 2]]
#print("loc Example: \n"); df_iloc # UNCOMMENT

## 4. Sorting the dataframe by 'MedInc' (Median Income)


In [5]:
df_sorted = df.sort_values(by='MedInc', ascending=False)
# print("Sorted Data Example: \n"); df_sorted.head() # UNCOMMENT

## 5. Renaming columns: renaming 'MedInc' to 'MedianIncome'


In [6]:
df_renamed = df.rename(columns={'MedInc': 'MedianIncome'})
#print("Renamed Columns Example: \n"); df_renamed.head() # UNCOMMENT

## 6. Finding unique values in a column ('HouseAge')


In [7]:
unique_ages = df['HouseAge'].unique()
#print("Unique HouseAge Values: \n"); unique_ages # UNCOMMENT

## 7. Dropping duplicate rows (if any)


In [8]:
df_deduplicated = df.drop_duplicates()
#print("Duplicates Dropped: \n"); df_deduplicated # UNCOMMENT

## 8. Assigning a new column: creating a new column 'PricePerRoom' by dividing 'MedHouseVal' by 'AveRooms'

In [9]:
df_assigned = df.assign(PricePerRoom=df['MedHouseVal'] / df['AveRooms'])
#print("Assigned New Column Example: \n");  df_assigned.head() # UNCOMMENT

## 9. Describing the dataset: summary statistics of numerical columns

In [10]:
df_description = df.describe()
#print("Description Example: \n"); df_description # UNCOMMENT

## 10. Calculating the mean of the 'MedHouseVal' column

In [11]:
mean_house_value = df['MedHouseVal'].mean()
#print("Mean House Value: \n");  mean_house_value # UNCOMMENT

## 11. Finding the maximum value in the 'AveOccup' column

In [12]:

max_occupants = df['AveOccup'].max()
# print("Max Occupants: \n") ; max_occupants # UNCOMMENT

## 12. Sampling 5 random rows from the dataset

In [13]:
df_sample = df.sample(5)
#print("Sample Example: \n");  df_sample # UNCOMMENT

## 13. Group by 'HouseAge' and counting the occurrences


In [14]:
df_grouped_count = df.groupby('HouseAge').count()
#print("Group by Count Example: \n"); df_grouped_count[['MedInc']].head() # UNCOMMENT

## 14. Group by 'HouseAge' and summing the 'MedHouseVal'



In [15]:
df_grouped_sum = df.groupby('HouseAge')['MedHouseVal'].sum()
#print("Group by Sum Example: \n"); df_grouped_sum.head() #UNCOMMENT

## 15. Dropping a column: removing 'AveRooms' column




In [16]:
df_dropped = df.drop(columns=['AveRooms'])
#print("Dropped Column Example: \n"); df_dropped.head() #UNCOMMENT

## 16. Masking: replace values where 'MedInc' is less than 2 with NaN

In [17]:
df_masked = df.mask(df['MedInc'] < 2)
#print("Masked Data Example: \n"); df_masked.head() # UNCOMMENT
