# Tokyo Housing Database & Price Forecasting

> *This project analyzes Tokyo’s rental housing landscape to support strategic decisions at Student Mobilization, Inc. The process involves scraping over 1,000 listings from SUUMO.jp, storing and organizing the data in an SQLite database, and applying regression models to forecast rental prices based on features such as floor plan, area, and building age. The goal is to streamline housing logistics for new field staff by simplifying the search for affordable and well-located housing options.*
>
> **Key Questions**
> - Can we predict rental prices based on key features?
> - How does proximity to train stations influence prices?
> - What numerical features are most correlated with rent?
> - What is the 95% confidence interval for the mean rental price?
> - Are new buildings consistently priced higher than old ones?

In [None]:
#Install the 'ipython-sql' and 'prettytable' libraries using pip
!pip install ipython-sql prettytable

# Import necessary Python modules for API calls, JSON handling, SQLite, datetime, and data analysis
import requests, json, sqlite3, sys, re
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import prettytable 
from bs4 import BeautifulSoup
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from scipy import stats
from matplotlib import font_manager, rc
from sklearn.preprocessing import OneHotEncoder
import warnings
from scipy.stats import norm
prettytable.DEFAULT = 'DEFAULT'

# Load SQL magic extension to run SQL queries directly in notebook cells
%load_ext sql

## Housing Data Collection & Preparation

>In this section, we build an **object-oriented pipeline** to:  
>1. **Collect raw HTML** from *SUUMO.jp*.  
>2. **Transform the HTML** into structured, readable data and housing metrics.  
>3. **Store the processed data** in an SQLite database.

In [10]:
# Path to the SQLite database where scraped housing data will be stored
db = 'tokyo_housing.db'

# Base URL of Suumo (Japanese housing site)
base_url = 'https://suumo.jp/'

# URLs for initial listings pages 
starting_url = 'https://suumo.jp/jj/chintai/ichiran/FR301FC001/?url=%2Fchintai%2Fichiran%2FFR301FC001%2F&ar=030&bs=040&pc=50&smk=&po1=25&po2=99&shkr1=03&shkr2=03&shkr3=03&shkr4=03&cb=0.0&ct=25.0&md=01&md=02&md=03&md=04&md=05&md=06&md=07&md=08&md=09&md=10&et=20&mb=0&mt=9999999&cn=9999999&ra=013&ek=035017990&ek=035026830&rn=0350&ae=03501'

### Data Collection Class: `TokyoHousingScraper`
> This class handles all **data ingestion** tasks, including:
> - Initializing local SQLite database connection
> - Scraping Tokyo listings HTML from *SUUMO.jp*
> - Parsing station information and other housing metrics from raw HTML
> - Building a robust housing dataset which includes features such as: `title`, `floor`, `area`, `rent`, `deposit`, etc.
> - Storing the resulting dataset in local SQLite table

In [11]:
# Define TokyoHousingScraper to:
# - Scrape Tokyo housing listings from Suumo.jp
# - Collect listings, parse property details
# - Store the results in SQLite database

class TokyoHousingScraper:
    
    def __init__(self, db, base_url, url):
        # Initialize DB connection
        self.db = db
        self.conn = sqlite3.connect(self.db)
        self.cursor = self.conn.cursor()

        # Base URL and starting page
        self.base_url = base_url
        self.url = url
    
    def scrape_listings(self):
		# Define list for storing HTML
        self.listings = list()
		
		# Iterate through all pages of listings
        while True:
            try:
                response = requests.get(next_page) #this will only work after the first page
            except:
                response = requests.get(self.url) # starting url 
            soup = BeautifulSoup(response.text, 'lxml')
			
			# Each listing = cassetteitem div
            cassettes = soup.select('div.cassetteitem')
            self.listings.extend(cassettes)
            print(len(self.listings))

			# Find next page link (pagination)
            try:
                current_page = soup.find('li', class_ = 'pagination-current')
                next_page_path = current_page.find_next_siblings('li')[1]
                next_page = self.base_url + next_page_path.select_one('a').get('href')
            except: break #no more pages to comb through

        print(f'{len(self.listings)} listings were successfully gathered!')

    def parse_station_info(self, item):
		
        # Extract station information (names, distances, nearest, average).
        # Returns tuple: 
            # (stations_str, nearest_station, distance_to_nearest_station, avg_distance).

		# Get raw station blocks
        stations_list = item.select('li.cassetteitem_detail-col2 div.cassetteitem_detail-text'
		) if item.select(
			'li.cassetteitem_detail-col2 div.cassetteitem_detail-text'
		) else None

		# If there is no station information, return None
        if stations_list == None:
            return (None, None, None, None)
        else: pass

		# Remove empty tags
        stations_list = [s for s in stations_list if s != '']

		# All stations as a single string (for DB storage)
        self.stations_str = ",".join([station.get_text().strip() for station in stations_list])
	
		# Extract stations and distances with regex
        stations_dict = {
			# All listed stations
			'stations': [
				re.findall(r'/(?P<station>.*?)\s*歩', station.get_text().strip())[0] 
				for station in stations_list
				if re.findall(r'/(?P<station>.*?)\s*歩', station.get_text().strip())
			],
		
			'distances': [
				re.findall(r'\d+', station.get_text().strip())[0]
				for station in stations_list
				if re.findall(r'\d+', station.get_text().strip())
			]
		}
	
		# Compute distance to nearest station
        self.distance_to_nearest_station = min([int(dist) for dist in stations_dict['distances']])
        nearest_idx = stations_dict['distances'].index(str(self.distance_to_nearest_station))
        self.nearest_station = stations_dict['stations'][nearest_idx]
	
		# Compute average distance to surrounding stations
        self.avg_distance = np.mean([float(dist) for dist in stations_dict['distances']])

        return self.stations_str, self.nearest_station, self.distance_to_nearest_station, self.avg_distance

    def build_housing_dataset(self):
		
		# Extract housing data (title, rent, floor, area, stations, etc.)
            # and save into SQLite as a DataFrame.

        self.housing_data = [
		{
			'img': item.select_one(
                'div.cassetteitem_object img'
            ).get('rel') if item.select_one(
                'div.cassetteitem_object img'
            ) else None,
            
            'title': item.select_one(
				'div.cassetteitem_content-title'
			).get_text().strip() if item.select_one(
				'div.cassetteitem_content-title'
			) else None,
		
			'address': item.select_one(
				'li.cassetteitem_detail-col1'
			).get_text().strip() if item.select_one(
				'li.cassetteitem_detail-col1'
			) else None,
		
			'rent': item.select_one(
				'span.cassetteitem_price.cassetteitem_price--rent'
			).get_text().strip() if item.select_one(
				'div.cassetteitem-item span.cassetteitem_price.cassetteitem_price--rent'
			) else None,
		
			'management_fee': item.select_one(
				'span.cassetteitem_price.cassetteitem_price--administration'
			).get_text().strip() if item.select_one(
				'span.cassetteitem_price.cassetteitem_price--administration'
			) else None,
		
			'deposit': item.select_one(
				'span.cassetteitem_price.cassetteitem_price--deposit'
			).get_text().strip() if item.select_one(
				'span.cassetteitem_price.cassetteitem_price--deposit'
			) else None,

			'key_money': item.select_one(
				'span.cassetteitem_price.cassetteitem_price--gratuity'
			).get_text().strip() if item.select_one(
				'span.cassetteitem_price.cassetteitem_price--gratuity'
			) else None,

			'floor': item.select(
				'div.cassetteitem-item tr.js-cassette_link td'
			)[2].get_text().strip() if item.select(
				'div.cassetteitem-item tr.js-cassette_link td'
			)[2] else None,
		
			'floor_plan': item.select_one(
				'span.cassetteitem_madori'
			).get_text().strip() if item.select_one(
				'span.cassetteitem_madori'
			) else None,

			'area': item.select_one(
				'span.cassetteitem_menseki'
			).get_text().strip() if item.select_one(
				'span.cassetteitem_menseki'
			) else None,

			'building_age': item.select(
				'li.cassetteitem_detail-col3 div'
			)[0].get_text().strip() if item.select(
				'li.cassetteitem_detail-col3'
			) else None,

			'building_size': item.select(
				'li.cassetteitem_detail-col3 div'
			)[1].get_text().strip() if item.select(
				'li.cassetteitem_detail-col3'
			) else None,

			'stations': self.parse_station_info(item)[0],

			'nearest_station': self.parse_station_info(item)[1],

			'distance_to_nearest_station': self.parse_station_info(item)[2],

			'avg_distance_to_stations': self.parse_station_info(item)[3]
		}
			for item in self.listings
		]

		# Save to DataFrame + SQLite
        self.housing_data_df = pd.DataFrame(self.housing_data)
        self.housing_data_df.to_sql(name = 'HOUSING_DATA', con = self.conn, if_exists = 'replace', index = False)

		# Close the DB connection
        self.conn.close()

### Initialize Scraper & Gather Data
>In this section, we **initialize the TokyoHousingScraper**, **scrape rental listings**, and **build a structured housing dataset** stored in SQLite.

In [12]:
# Initialize scraper
scraper = TokyoHousingScraper(db, base_url, starting_url)

# Scrape housing listings
scraper.scrape_listings()

# Parse listing details and save dataset to SQLite
scraper.build_housing_dataset()

50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950
1000
1050
1100
1150
1200
1211
1211 listings were successfully gathered!


## Extracting & Engineering Tokyo Housing Metrics 
>- Connect to local SQLite database `db` containing listing information and housing metrics.
>- Initialize SQL Magic (`%sql`) to run queries directly from the notebook. 

In [18]:
# Connect to SQLite database for querying listings 
conn = sqlite3.connect(db)
cursor = conn.cursor()

# Initialize SQL Magic with database connection
%sql sqlite:///tokyo_housing.db

### Create SQL View

>**Step 1: Standardize core listing fields**  
>- `img`, `title`, `address`: Basic identifiers  
>- `rent`, `management_fee`, `deposit`, `key_money`: Convert to numeric values  
>- `floor`: Convert floor labels to integers  
>- `floor_plan`: Normalize labels (e.g., 'ワンルーム' → '1R')  
>- `area`: Convert to numeric (square meters)  
>- `building_age`: Extract age in years  
>- `building_size`: Standardize number of floors  
>- `stations`, `nearest_station`, `distance_to_nearest_station`, `avg_distance_to_stations`: Station-related features  
>
>**Step 2: Handle missing or invalid values**  
>- Replace 0 or invalid values in `management_fee`, `deposit`, `key_money` with NULL  
>
>**Step 3: Feature engineering**  
>- `avg_rent_by_station`: Average rent per nearest station  
>- `avg_rent_by_floor_plan`: Average rent per floor plan  
>- `price_rank_by_station`: Rank rent relative to other listings near the same station  
>
>**Step 4: Build final view**  
>- Combine standardized fields and engineered features into `FEATURED_LISTINGS`  
>- Output all listings in `TOKYO_HOUSING` view

In [None]:
%%sql 
-- Remove the view if it already exists
DROP VIEW IF EXISTS TOKYO_HOUSING;

-- Create a cleaned + feature-engineered housing view
CREATE VIEW TOKYO_HOUSING AS

WITH STANDARDIZED_LISTINGS AS (
    SELECT 
        -- Basic identifiers
        img, title, address, 
        
        -- Convert rent/deposit/key money into numeric
        CAST(RTRIM(rent, '万円') AS FLOAT) * 10000 AS rent,
        CAST(RTRIM(management_fee, '円') AS INTEGER) AS management_fee,
        CAST(RTRIM(deposit, '万円') AS FLOAT) * 10000 AS deposit,
        CAST(RTRIM(key_money, '万円') AS FLOAT) * 10000 AS key_money,
        
        -- Convert floor to integer
        CAST(RTRIM(floor, '階') AS INTEGER) AS floor,
        
        -- Normalize floor plan labels 
        CASE
            WHEN floor_plan = 'ワンルーム' THEN '1R'
            ELSE floor_plan
        END AS floor_plan,
        
        -- Convert area to numeric (square meters)
        CAST(RTRIM(area, 'm2') AS FLOAT) AS area,
        
        -- Extract building age in years
        CAST(LTRIM(RTRIM(building_age, '年'), '築') AS INTEGER) AS building_age,
        
        -- Standardize building size
        CASE
            WHEN building_size LIKE '地下%' THEN 
                CAST(SUBSTR(building_size, 3, 1) AS INTEGER) +
                CAST(SUBSTR(building_size, 6, 1) AS INTEGER)
            WHEN building_size LIKE '地上%' THEN
                CAST(SUBSTR(building_size, 3, 1) AS INTEGER)
            ELSE CAST(RTRIM(building_size, '階建') AS INTEGER)
        END AS building_size,
        
        -- Station-related features
        stations,
        nearest_station,
        distance_to_nearest_station,
        ROUND(avg_distance_to_stations, 2) AS avg_distance_to_stations
    FROM HOUSING_DATA
),

FEATURED_LISTINGS AS (
    SELECT 
        img, title, address, rent, 
        
        -- Replace 0/invalid values with NULLs
        NULLIF(management_fee, 0) AS management_fee,
        NULLIF(deposit, -0.0) AS deposit,
        NULLIF(key_money, 0.0) AS key_money,
        floor, floor_plan, area, building_age,
        building_size, nearest_station,
        distance_to_nearest_station, avg_distance_to_stations,
        
        -- Feature engineering: average rents by station, floor plan, and distance to nearest station
        ROUND(AVG(rent) 
            OVER (PARTITION BY nearest_station), 2) 
            AS avg_rent_by_station, 
        ROUND(AVG(rent)
            OVER (PARTITION BY floor_plan), 2) 
            AS avg_rent_by_floor_plan,
        
        -- Price rank relative to other listings near the same station
        DENSE_RANK() 
            OVER (PARTITION BY nearest_station ORDER BY rent DESC)
            AS price_rank_by_station
    FROM STANDARDIZED_LISTINGS
)

-- Final output 
SELECT * FROM FEATURED_LISTINGS

### Load SQL View Into DataFrame
>- Use `%sql` to query `TOKYO_HOUSING` and convert results to a Dataframe for further analysis.
>- Once data is in pandas, we close the database connection. 

In [None]:
# Query the engineered SQL view into a pandas DataFrame for analysis
tokyo_housing = %sql SELECT * FROM TOKYO_HOUSING 
tokyo_housing_df = tokyo_housing.DataFrame()

# Close the DB connection 
conn.close()

## Data Cleaning & Overview

In [16]:
# Drop duplicate listings 
tokyo_housing_df.drop_duplicates(inplace = True)

In [3]:
tokyo_housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1207 entries, 0 to 1206
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   img                          1202 non-null   object 
 1   title                        1207 non-null   object 
 2   address                      1207 non-null   object 
 3   rent                         1207 non-null   float64
 4   management_fee               1034 non-null   float64
 5   deposit                      825 non-null    float64
 6   key_money                    788 non-null    float64
 7   floor                        1207 non-null   int64  
 8   floor_plan                   1207 non-null   object 
 9   area                         1207 non-null   float64
 10  building_age                 1207 non-null   int64  
 11  building_size                1207 non-null   int64  
 12  nearest_station              1207 non-null   object 
 13  distance_to_neares

In [4]:
tokyo_housing_df.describe(include = 'all')

Unnamed: 0,img,title,address,rent,management_fee,deposit,key_money,floor,floor_plan,area,building_age,building_size,nearest_station,distance_to_nearest_station,avg_distance_to_stations,avg_rent_by_station,avg_rent_by_floor_plan,price_rank_by_station
count,1202,1207,1207,1207.0,1034.0,825.0,788.0,1207.0,1207,1207.0,1207.0,1207.0,1207,1207.0,1207.0,1207.0,1207.0,1207.0
unique,1202,1156,45,,,,,,15,,,,11,,,,,
top,https://img01.suumo.com/front/gazo/fr/bukken/4...,ＪＲ山手線 高田馬場駅 4階建 築3年,東京都新宿区高田馬場３,,,,,,1K,,,,中井駅,,,,,
freq,1,4,100,,,,,,517,,,,235,,,,,
mean,,,,101696.520298,6964.119923,106328.121212,115307.233503,2.717481,,26.720017,22.607291,4.649544,,5.35377,9.407026,101620.114391,101677.018028,43.293289
std,,,,42696.128135,4196.799459,59970.635488,64419.889365,2.200682,,13.045711,15.697647,3.054339,,2.520552,2.298308,7978.506092,34782.337937,24.968109
min,,,,30000.0,200.0,30000.0,30000.0,0.0,,0.0,0.0,1.0,,1.0,2.0,85730.77,63000.0,1.0
25%,,,,69000.0,3000.0,69000.0,74000.0,1.0,,19.03,9.0,2.0,,3.0,7.67,97854.65,74262.69,22.0
50%,,,,89000.0,6000.0,87000.0,94000.0,2.0,,24.61,21.0,4.0,,5.0,9.67,104994.06,87425.19,43.0
75%,,,,125000.0,10000.0,125000.0,138000.0,3.0,,31.05,35.0,6.0,,7.0,11.0,106420.47,108070.42,63.0


## Exploring Patterns & Insights from Tokyo Rentals

In [None]:
# Compute correlation matrix for all numeric features
housing_corr = tokyo_housing_df.select_dtypes(['int64', 'float64']).corr()

# Initialize figure
fig = plt.figure(figsize = (15, 8))

# Plot heatmap of correlations
sns.heatmap(data = housing_corr, cmap = 'vlag', annot = True, linecolor = 'black', linewidths = 0.5, fmt = '.2f', cbar_kws = {'label': 'Correlation Coefficient'})

# Set figure title
fig.suptitle('Correlation Heatmap of Housing Metrics', fontweight = 'bold', fontsize = 18)

plt.tight_layout()
plt.savefig('correlation.png')

![correlation.png](attachment:332183bc-8ae2-4581-9b25-2ab153e9fdfd.png)

In [None]:
# Set grid style 
sns.set_style("whitegrid")

# Set color palette 
palette = sns.color_palette("muted")

In [None]:
mean_rent = tokyo_housing_df['rent'].mean()

fig = plt.figure(figsize = (15, 8))

sns.histplot(data = tokyo_housing_df, x = 'rent', color = palette[0], stat = 'density', alpha = 0.4)
sns.kdeplot(data = tokyo_housing_df, x = 'rent', color = palette[3], fill = True, linewidth = 1.5)

plt.axvline(x = mean_rent, color = 'black', linestyle = '--', label = 'Mean Rental Price')

plt.annotate(f'Mean Rent = ¥{mean_rent:.2f}', xy = (mean_rent, 0.00001), xytext = (mean_rent + 15000, 0.00001), arrowprops = dict(facecolor = 'black', shrink = 0.05), fontweight = 'bold')


plt.title('Distribution of Rental Prices', fontsize = 18, fontweight = 'bold')
plt.xlabel('Rent', fontsize = 12)
plt.ylabel('Density', fontsize = 12)

plt.savefig('rent_dist.png')

![rent_dist.png](attachment:b5254e54-13b7-4c8f-98fe-6c70f2e7ff9c.png)

### Average Rent: Confidence Interval Analysis

In [69]:
# Confidence level -- 95%
# Significance level -- alpha = 0.05
alpha = 0.05

rent = tokyo_housing_df['rent']

# Compute the sample mean
x_bar = np.mean(rent)

# Compute the sample size
n = len(rent)
print(f'The sample size is {n}')

The sample size is 1207


>**Error source**: Random, but may not be evenly distributed (more density around smaller sized homes)

In [70]:
# Compute the standard error (SE)
se = np.std(rent)/np.sqrt(n)

# Compute the z critical value
# Compute the margin of error (MOE)
z = norm.ppf(1 - (alpha/2))
moe = z*se 
print(f'The 95% confidence interval is {x_bar:.3f} +/- {moe:.3f}')

The 95% confidence interval is 101696.520 +/- 2407.703


In [None]:
# Pearson correlation between rent and area
pearson_coeff, p_value = stats.pearsonr(x = tokyo_housing_df['area'], y = tokyo_housing_df['rent'])

# Initialize fig, ax
fig, ax = plt.subplots(figsize = (15, 8))

# Scatter plot of rent vs. area with pearson coefficient & p-value
plt.scatter(tokyo_housing_df['area'], tokyo_housing_df['rent'], color = palette[0])
plt.text(0.10, 0.95, f'Rent & Area:\npearson_coeff = {pearson_coeff}\np_value = {p_value}', transform = ax.transAxes, verticalalignment = 'top', fontweight = 'bold')

# Set title, axis labels
plt.title('Correlation of Rent with Floor Area', fontsize = 18, fontweight = 'bold')
plt.xlabel('Area (m^2)', fontsize = 12)
plt.ylabel('Rent (¥)', fontsize = 12)

plt.savefig('rent_vs_area.png')

![rent_vs_area.png](attachment:7e06c416-b18d-43ab-963c-614dc0485ea2.png)

>`building_age` is a quantitative discrete feature.

In [None]:
# Pearson correlation between rent and area
pearson_coeff, p_value = stats.pearsonr(x = tokyo_housing_df['building_age'], y = tokyo_housing_df['rent'])

# Initialize fig, ax
fig, ax = plt.subplots(figsize = (15, 8))

# Scatter plot of rent vs. area with pearson coefficient & p-value
plt.scatter(tokyo_housing_df['building_age'], tokyo_housing_df['rent'], color = palette[1])
plt.text(0.70, 0.95, f'Rent & Building Age:\npearson_coeff = {pearson_coeff}\np_value = {p_value}', transform = ax.transAxes, verticalalignment = 'top', fontweight = 'bold')

# Set title, axis labels
plt.title('Correlation of Rent with Building Age', fontsize = 18, fontweight = 'bold')
plt.xlabel('Age (Years)', fontsize = 12)
plt.ylabel('Rent (¥)', fontsize = 12)

plt.savefig('age_vs_area.png')

![age_vs_area.png](attachment:952ff537-4078-454e-976a-90ff03f2213d.png)

In [None]:
# Initialize figure
fig = plt.figure(figsize = (15, 8))

# Boxplot distribution of rent by floor plan
sns.boxplot(data = tokyo_housing_df, x = 'floor_plan', y = 'rent', palette = palette, flierprops = {'mfc' : 'black', 'marker': 'D'})

# Set title, axis labels
plt.title('Distribution of Rent by Floor Plan', fontsize = 18, fontweight = 'bold')
plt.xlabel('Floor Plan', fontsize = 12)
plt.ylabel('Rent (¥)', fontsize = 12)

plt.savefig('floor_plan_box.png')

![floor_plan_box.png](attachment:3b2b939a-1672-4c78-b591-bc2e4399089c.png)

In [None]:
# Initialize figure
fig = plt.figure(figsize = (15, 8))

# Boxplot distribution of rent by distance to nearest station
sns.boxplot(data = tokyo_housing_df, x = 'distance_to_nearest_station', y = 'rent', palette = palette, flierprops = {'mfc' : 'black', 'marker': 'D'})

# Set title, axis labels
plt.title('Distribution of Rent by Station Proximity', fontsize = 18, fontweight = 'bold')
plt.xlabel('Distance to Nearest Station (min)', fontsize = 12)
plt.ylabel('Rent (¥)', fontsize = 12)

plt.savefig('distance_box.png')

![distance_box.png](attachment:c4a7cb70-ce9a-45c0-a05d-f37a9ff1f893.png)

In [None]:
# Initialize figure
fig = plt.figure(figsize = (15, 8))

# Set a Japanese-capable font 
rc('font', family = 'Hiragino Sans')

# Boxplot distribution of rent by nearest station
sns.boxplot(data = tokyo_housing_df, x = 'nearest_station', y = 'rent', palette = palette, flierprops = {'mfc' : 'black', 'marker': 'D'})

# Set title, axis labels
plt.title('Distribution of Rent by Nearest Station', fontsize = 18, fontweight = 'bold')
plt.xlabel('Station', fontsize = 12)
plt.ylabel('Rent (¥)', fontsize = 12)

plt.savefig('nearest_station.png')

![nearest_station.png](attachment:be4ec82a-9e34-4ac6-9df9-7100805bde4b.png)

>`building_size` is a quantitative discrete feature.

In [None]:
# Initialize figure
fig = plt.figure(figsize = (15, 8))

# Boxplot distribution of rent by building_size
sns.boxplot(data = tokyo_housing_df, x = 'building_size', y = 'rent', palette = palette, flierprops = {'mfc' : 'black', 'marker': 'D'})

# Set title, axis labels
plt.title('Distribution of Rent by Building Size', fontsize = 18, fontweight = 'bold')
plt.xlabel('Building Size (# of Floors)', fontsize = 12)
plt.ylabel('Rent (¥)', fontsize = 12)

plt.savefig('building_size_box.png')

![building_size_box.png](attachment:aa8af957-7442-43e9-96b1-fed73186c45e.png)

In [None]:
# Initialize figure
fig = plt.figure(figsize = (15, 8))

# Boxplot distribution of rent by floor
sns.boxplot(data = tokyo_housing_df, x = 'floor', y = 'rent', palette = palette, flierprops = {'mfc' : 'black', 'marker': 'D'})

# Set title, axis labels
plt.title('Distribution of Rent by Floor', fontsize = 18, fontweight = 'bold')
plt.xlabel('Floor', fontsize = 12)
plt.ylabel('Rent (¥)', fontsize = 12)

plt.savefig('floor_box.png')

![floor_box.png](attachment:f9306e65-e4bb-4cf4-b915-32a7de7bcfa7.png)

## Linear Regression Modeling

In [4]:
# Define X, y for LinearRegression
X = tokyo_housing_df.drop(columns = ['rent'])
y = tokyo_housing_df['rent']

# Split data into training & testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Initialize OneHotEncoder for categorical variables (floor_plan)
# sparse_output returns a numpy array, which is easier to inspect & convert back to DataFrame
encoder = OneHotEncoder(drop = 'first', handle_unknown = 'ignore', sparse_output = False)

# OneHotEncoder may throw a UserWarning error from 'unknown categories' -- these will be encoded as zeros
warnings.filterwarnings('ignore', category = UserWarning)

# Pass DataFrame to the encoder to preserve the column name
X_train_enc = encoder.fit_transform(X_train[['floor_plan']])
X_test_enc = encoder.transform(X_test[['floor_plan']])

# Get the encoded feature names to for columns in the DataFrame
encoded_cols = list(encoder.get_feature_names_out(['floor_plan']))

# Use the index = X_train.index parameter to ensure that the indexing matches 
X_train_enc = pd.DataFrame(X_train_enc, columns = encoded_cols, index = X_train.index)
X_test_enc = pd.DataFrame(X_test_enc, columns = encoded_cols, index = X_test.index)

# Final train/test sets include area and floor_plan encoded features
X_train = pd.concat([X_train[['area', 'building_age', 'building_size', 'floor']], X_train_enc], axis = 1)
X_test = pd.concat([X_test[['area', 'building_age', 'building_size', 'floor']], X_test_enc], axis = 1)

In [5]:
# Initialize model 
lr0 = LinearRegression()

# Simple Linear Regression
X_train_0 = X_train[['area']]

# Compute scores for 5-fold cross validation on training dataset
score_0 = cross_val_score(lr0, X_train_0, y_train, cv = 5).mean()
print(f'The average of the R-squared values is {score_0: .3f}')

The average of the R-squared values is  0.718


In [6]:
# Initialize model                            
lr1 = LinearRegression()

# Initialize scaler object & standardize data for regression
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

# Multiple Linear Regression
score_1 = cross_val_score(lr1, X_train_scaled, y_train, cv = 5).mean()
print(f'The average of the R-squared values is {score_1: .3f}')

The average of the R-squared values is  0.849


In [7]:
# Initialize Linear Regression model & fit to the training data
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)

# Make predictions on training and testing sets
y_train_pred = lr.predict(X_train_scaled)
y_test_pred = lr.predict(X_test_scaled)

# Evaluate on training set
r2_train = r2_score(y_train, y_train_pred)
print(f'R-squared value the training set: {r2_train: .3f}')

# Evaluate on testing set
r2_test = r2_score(y_test, y_test_pred)
print(f'R-squared value on the testing set: {r2_test: .3f}')

R-squared value the training set:  0.859
R-squared value on the testing set:  0.868


In [None]:
# Compute residuals
y_resid = y_test_pred - y_test

# Initialize fig, ax
fig, ax = plt.subplots(figsize = (15, 8))

# Scatter plot for residuals
ax.scatter(x = y_test_pred, y = y_resid, color = palette[0])

# Set title, axis labels
ax.set_title('Residual Plot of Predicted vs. Actual Values', fontsize = 18, fontweight = 'bold')
ax.set_xlabel('Predicted Rent (¥)', fontsize = 12)
ax.set_ylabel('Residuals (Predicted – Actual)', fontsize = 12)

plt.savefig('residuals.png')

![residuals.png](attachment:d9864415-158e-453e-ade0-326ace3e6678.png)

In [None]:
# Initialize fig, ax
fig, ax = plt.subplots(figsize = (15, 8))

# Histogram + KDE plots for predicted vs. actual rent values
sns.histplot(y_test_pred, palette = palette[1], kde = True, ax = ax, stat = 'density', alpha = 0.4)
sns.histplot(y_test, palette = palette[2], kde = True, ax = ax, stat = 'density', alpha = 0.4)

# Set title, axis labels
ax.set_title('Probability Density of Predicted vs. Actual Values', fontweight = 'bold', fontsize = 18)
ax.set_xlabel('Rent (¥)', fontsize = 12)
ax.set_ylabel('Density', fontsize = 12)

# Set legend
plt.legend(['Predicted Values', 'Actual Values'])

plt.savefig('predicted_vs_actual.png')

![predicted_vs_actual.png](attachment:68174b56-70c3-4181-8561-c9c38f22ce63.png)

In [2]:
tokyo_housing_df = pd.read_csv('tokyo_housing.csv', header = 0)

In [17]:
# Save DataFrame to CSV file
tokyo_housing_df.to_csv('tokyo_housing.csv', index = False)