# Insights into the Apartment Rental Market: Exploring Trends and Predicting Rental Prices

In [69]:
%matplotlib inline

In [70]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## 01. Data Collection

### Introduction - Data Collection:

In the ever-evolving real estate landscape, the availability of data plays a crucial role in understanding the dynamics of the rental market. This data science project aims to analyze and gain insights from a comprehensive dataset on apartments available for rent. To achieve this, we will collect data from the provided dataset hosted on the UCI Machine Learning Repository.

In [71]:
dataset_path = 'data/apartments_for_rent_classified_10K.csv'

try:
    apartments_df = pd.read_csv(dataset_path, sep=";", encoding='cp1252')
    print(apartments_df.head(5))
except Exception as e:
    print("Error reading the dataset:", e)

           id                category  \
0  5668626895  housing/rent/apartment   
1  5664597177  housing/rent/apartment   
2  5668626833  housing/rent/apartment   
3  5659918074  housing/rent/apartment   
4  5668626759  housing/rent/apartment   

                                               title  \
0  Studio apartment 2nd St NE, Uhland Terrace NE,...   
1                  Studio apartment 814 Schutte Road   
2  Studio apartment N Scott St, 14th St N, Arling...   
3                     Studio apartment 1717 12th Ave   
4  Studio apartment Washington Blvd, N Cleveland ...   

                                                body amenities  bathrooms  \
0  This unit is located at second St NE, Uhland T...       NaN        NaN   
1  This unit is located at 814 Schutte Road, Evan...       NaN        NaN   
2  This unit is located at N Scott St, 14th St N,...       NaN        1.0   
3  This unit is located at 1717 12th Ave, Seattle...       NaN        1.0   
4  This unit is located at Wash

### Dataset Source:

The dataset we will be working with is sourced from the UCI Machine Learning Repository and is specifically focused on apartment rentals. The dataset comprises a diverse range of attributes associated with apartments, including details about their location, physical characteristics, amenities, and, most importantly, their rental prices. This collection of information presents an exciting opportunity to explore and uncover patterns that influence rental pricing, identify key features that drive value, and ultimately, offer valuable insights to both landlords and prospective tenants.

### Data Collection Process:

The data collection process involves accessing the dataset from the UCI repository, ensuring its integrity, and loading it into our Python environment using the powerful data manipulation library, Pandas. By following stringent data collection procedures, we ensure that the dataset is clean, reliable, and appropriate for analysis.

## 02. Data Understanding and Exploration

Understanding and exploring the dataset are crucial steps in any data science project. In this phase, we will dive deep into the dataset obtained during the data collection phase. By thoroughly understanding the data's structure, quality, and relationships between variables, we can lay a solid foundation for further analysis and model building.

### 01.Data Overview

This is a load the dataset into a Pandas DataFrame and examine its dimensions (number of rows and columns).

In [72]:
apartments_df.dtypes

id                 int64
category          object
title             object
body              object
amenities         object
bathrooms        float64
bedrooms         float64
currency          object
fee               object
has_photo         object
pets_allowed      object
price              int64
price_display     object
price_type        object
square_feet        int64
address           object
cityname          object
state             object
latitude         float64
longitude        float64
source            object
time               int64
dtype: object

I need to see dimensions of my dataset

In [73]:
apartments_df.shape

(10000, 22)

This are first five rows of dataset to see data structure and content

In [74]:
apartments_df.head(5)

Unnamed: 0,id,category,title,body,amenities,bathrooms,bedrooms,currency,fee,has_photo,...,price_display,price_type,square_feet,address,cityname,state,latitude,longitude,source,time
0,5668626895,housing/rent/apartment,"Studio apartment 2nd St NE, Uhland Terrace NE,...","This unit is located at second St NE, Uhland T...",,,0.0,USD,No,Thumbnail,...,$790,Monthly,101,,Washington,DC,38.9057,-76.9861,RentLingo,1577359415
1,5664597177,housing/rent/apartment,Studio apartment 814 Schutte Road,"This unit is located at 814 Schutte Road, Evan...",,,1.0,USD,No,Thumbnail,...,$425,Monthly,106,814 Schutte Rd,Evansville,IN,37.968,-87.6621,RentLingo,1577017063
2,5668626833,housing/rent/apartment,"Studio apartment N Scott St, 14th St N, Arling...","This unit is located at N Scott St, 14th St N,...",,1.0,0.0,USD,No,Thumbnail,...,"$1,390",Monthly,107,,Arlington,VA,38.891,-77.0816,RentLingo,1577359410
3,5659918074,housing/rent/apartment,Studio apartment 1717 12th Ave,"This unit is located at 1717 12th Ave, Seattle...",,1.0,0.0,USD,No,Thumbnail,...,$925,Monthly,116,1717 12th Avenue,Seattle,WA,47.616,-122.3275,RentLingo,1576667743
4,5668626759,housing/rent/apartment,"Studio apartment Washington Blvd, N Cleveland ...","This unit is located at Washington Blvd, N Cle...",,,0.0,USD,No,Thumbnail,...,$880,Monthly,125,,Arlington,VA,38.8738,-77.1055,RentLingo,1577359401


My dataset column names

In [75]:
apartments_df.columns

Index(['id', 'category', 'title', 'body', 'amenities', 'bathrooms', 'bedrooms',
       'currency', 'fee', 'has_photo', 'pets_allowed', 'price',
       'price_display', 'price_type', 'square_feet', 'address', 'cityname',
       'state', 'latitude', 'longitude', 'source', 'time'],
      dtype='object')

I will use `df.info()` to obtain information about the data types of each column and check for any missing values.

In [76]:
apartments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             10000 non-null  int64  
 1   category       10000 non-null  object 
 2   title          10000 non-null  object 
 3   body           10000 non-null  object 
 4   amenities      6451 non-null   object 
 5   bathrooms      9966 non-null   float64
 6   bedrooms       9993 non-null   float64
 7   currency       10000 non-null  object 
 8   fee            10000 non-null  object 
 9   has_photo      10000 non-null  object 
 10  pets_allowed   5837 non-null   object 
 11  price          10000 non-null  int64  
 12  price_display  10000 non-null  object 
 13  price_type     10000 non-null  object 
 14  square_feet    10000 non-null  int64  
 15  address        6673 non-null   object 
 16  cityname       9923 non-null   object 
 17  state          9923 non-null   object 
 18  latitud

I will generate descriptive statistics using `df.describe()` to gain insights into the central tendencies, spread, and distributions of numerical variables.

In [77]:
apartments_df.describe()

Unnamed: 0,id,bathrooms,bedrooms,price,square_feet,latitude,longitude,time
count,10000.0,9966.0,9993.0,10000.0,10000.0,9990.0,9990.0,10000.0
mean,5623396000.0,1.380544,1.744021,1486.2775,945.8105,37.695162,-94.652247,1574891000.0
std,70210250.0,0.61541,0.942354,1076.507968,655.755736,5.495851,15.759805,3762395.0
min,5508654000.0,1.0,0.0,200.0,101.0,21.3155,-158.0221,1568744000.0
25%,5509248000.0,1.0,1.0,949.0,649.0,33.67985,-101.3017,1568781000.0
50%,5668610000.0,1.0,2.0,1270.0,802.0,38.8098,-93.6516,1577358000.0
75%,5668626000.0,2.0,2.0,1695.0,1100.0,41.3498,-82.209975,1577359000.0
max,5668663000.0,8.5,9.0,52500.0,40000.0,61.594,-70.1916,1577362000.0


In [78]:
apartments_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,10000.0,5623396000.0,70210250.0,5508654000.0,5509248000.0,5668610000.0,5668626000.0,5668663000.0
bathrooms,9966.0,1.380544,0.6154099,1.0,1.0,1.0,2.0,8.5
bedrooms,9993.0,1.744021,0.9423539,0.0,1.0,2.0,2.0,9.0
price,10000.0,1486.277,1076.508,200.0,949.0,1270.0,1695.0,52500.0
square_feet,10000.0,945.8105,655.7557,101.0,649.0,802.0,1100.0,40000.0
latitude,9990.0,37.69516,5.495851,21.3155,33.67985,38.8098,41.3498,61.594
longitude,9990.0,-94.65225,15.7598,-158.0221,-101.3017,-93.6516,-82.20998,-70.1916
time,10000.0,1574891000.0,3762395.0,1568744000.0,1568781000.0,1577358000.0,1577359000.0,1577362000.0


## 03. Data Cleaning and Preprocessing

Data cleaning and preprocessing are essential steps to ensure that the dataset is suitable for analysis and model building. In this phase, we will address any data quality issues, handle missing values, and prepare the dataset for further exploration and modeling.

Here I will Identify columns with missing values using `df.isnull().sum()` to understand the extent of missing data.

In [79]:
apartments_df.isnull().sum()

id                  0
category            0
title               0
body                0
amenities        3549
bathrooms          34
bedrooms            7
currency            0
fee                 0
has_photo           0
pets_allowed     4163
price               0
price_display       0
price_type          0
square_feet         0
address          3327
cityname           77
state              77
latitude           10
longitude          10
source              0
time                0
dtype: int64

In this phase, we focus on ensuring the dataset is well-structured, relevant, and free from unnecessary clutter. By removing specific columns that add little value or are redundant, we streamline the data for further analysis and model building. Based on the objectives and context of your analysis, i will remove some columns that are not relevant or contribute minimally to the project. This include columns with excessive missing values or redundant information.

In [80]:
columns_to_remove = ['id', 'category', 'body', 'fee', 'has_photo', 'price_display', 'address', 'time']
new_apartments_df = apartments_df.drop(columns=columns_to_remove)

In [81]:
new_apartments_df

Unnamed: 0,title,amenities,bathrooms,bedrooms,currency,pets_allowed,price,price_type,square_feet,cityname,state,latitude,longitude,source
0,"Studio apartment 2nd St NE, Uhland Terrace NE,...",,,0.0,USD,,790,Monthly,101,Washington,DC,38.9057,-76.9861,RentLingo
1,Studio apartment 814 Schutte Road,,,1.0,USD,,425,Monthly,106,Evansville,IN,37.9680,-87.6621,RentLingo
2,"Studio apartment N Scott St, 14th St N, Arling...",,1.0,0.0,USD,,1390,Monthly,107,Arlington,VA,38.8910,-77.0816,RentLingo
3,Studio apartment 1717 12th Ave,,1.0,0.0,USD,,925,Monthly,116,Seattle,WA,47.6160,-122.3275,RentLingo
4,"Studio apartment Washington Blvd, N Cleveland ...",,,0.0,USD,,880,Monthly,125,Arlington,VA,38.8738,-77.1055,RentLingo
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,Five BR 5407 Abbott Place - Abbott,,4.0,5.0,USD,,6000,Monthly,6300,Edina,MN,44.9000,-93.3233,RentLingo
9996,Six BR 256 Las Entradas,,8.0,6.0,USD,,25000,Monthly,8716,Montecito,CA,34.4331,-119.6331,RentLingo
9997,Six BR 9908 Bentcross Drive,,8.5,6.0,USD,,11000,Monthly,11318,Potomac,MD,39.0287,-77.2409,RentLingo
9998,One BR in New York NY 10069,"Basketball,Cable or Satellite,Doorman,Hot Tub,...",,1.0,USD,,4790,Monthly,40000,New York,NY,40.7716,-73.9876,Listanza


Checking columns type

In [93]:
new_apartments_df.dtypes

title            object
amenities        object
bathrooms       float64
bedrooms        float64
currency         object
pets_allowed     object
price             int64
price_type       object
square_feet       int64
cityname         object
state            object
latitude        float64
longitude       float64
source           object
dtype: object

I will use the `astype()` method to explicitly convert the column to the "string" data type (str).

In [83]:
new_apartments_df['source'] = new_apartments_df['source'].astype('str')

In [84]:
new_apartments_df

Unnamed: 0,title,amenities,bathrooms,bedrooms,currency,pets_allowed,price,price_type,square_feet,cityname,state,latitude,longitude,source
0,"Studio apartment 2nd St NE, Uhland Terrace NE,...",,,0.0,USD,,790,Monthly,101,Washington,DC,38.9057,-76.9861,RentLingo
1,Studio apartment 814 Schutte Road,,,1.0,USD,,425,Monthly,106,Evansville,IN,37.9680,-87.6621,RentLingo
2,"Studio apartment N Scott St, 14th St N, Arling...",,1.0,0.0,USD,,1390,Monthly,107,Arlington,VA,38.8910,-77.0816,RentLingo
3,Studio apartment 1717 12th Ave,,1.0,0.0,USD,,925,Monthly,116,Seattle,WA,47.6160,-122.3275,RentLingo
4,"Studio apartment Washington Blvd, N Cleveland ...",,,0.0,USD,,880,Monthly,125,Arlington,VA,38.8738,-77.1055,RentLingo
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,Five BR 5407 Abbott Place - Abbott,,4.0,5.0,USD,,6000,Monthly,6300,Edina,MN,44.9000,-93.3233,RentLingo
9996,Six BR 256 Las Entradas,,8.0,6.0,USD,,25000,Monthly,8716,Montecito,CA,34.4331,-119.6331,RentLingo
9997,Six BR 9908 Bentcross Drive,,8.5,6.0,USD,,11000,Monthly,11318,Potomac,MD,39.0287,-77.2409,RentLingo
9998,One BR in New York NY 10069,"Basketball,Cable or Satellite,Doorman,Hot Tub,...",,1.0,USD,,4790,Monthly,40000,New York,NY,40.7716,-73.9876,Listanza


In [87]:
new_apartments_df.dtypes

title            object
amenities        object
bathrooms       float64
bedrooms        float64
currency         object
pets_allowed     object
price             int64
price_type       object
square_feet       int64
cityname         object
state            object
latitude        float64
longitude       float64
source           object
dtype: object

## 04. Feature Engineering:

I must create new meaningful features based on existing ones that might improve the model's performance.
Extract relevant information from text fields (e.g., description of the apartment) using NLP techniques.



## 05. Data Visualization:

I will visualize the geographical distribution of apartments on a map using libraries like Matplotlib or Plotly.
Plot rental prices against various features to identify trends and patterns.


## 06. Model Building:

I will define a clear objective for the analysis, such as predicting rental prices.
Split the dataset into training and testing sets.
Select appropriate regression models like Linear Regression, Random Forest Regression, or XGBoost.
Train the models on the training data and evaluate their performance using metrics like Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), etc.
Optimize hyperparameters using techniques like cross-validation and grid search.


## 07. Model Evaluation and Interpretation:

Evaluate the model's performance on the test set and analyze the results.
Interpret the model coefficients or feature importances to understand which factors most strongly influence rental prices.


## 08. Insights and Recommendations:

Summarize the key insights gained from the analysis.
Provide recommendations to property owners, renters, or real estate agents based on the findings.


## 09. Visualization of Results:

Create interactive visualizations and dashboards to present the results effectively.
