# Lab 03: Feature Engineering for Regression using Short Rental Dataset


**Objective:**
In this lab, you will extract and process different types of features (numerical, categorical, and textual) from the provided datasets (`train.csv.gz`, `test.csv.gz`) for your Program 2 assignment. You will construct feature vectors to use in a regression model to predict the price of listings.

The datasets are on the HPC, under `/WAVE/projects/CSEN-140-Sp25/data/pr2`

In [2]:

# Import required libraries
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load datasets
train_df = pd.read_csv("train.csv.gz", compression='gzip')
test_df = pd.read_csv("test.csv.gz", compression='gzip')

# Show a sample of the datasets
print("Train Data Sample")
display(train_df.head())

print("Test Data Sample")
display(test_df.head())

Train Data Sample


Unnamed: 0,name,description,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,...,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,availability_eoy,number_of_reviews_ly,estimated_occupancy_l365d,estimated_revenue_l365d,reviews,price
0,1 bedroom apt 10469,You'll have a great time at this comfortable p...,Jose,2019-10-01,"New York, NY",,within an hour,86%,81%,t,...,5,0,0,,271.0,0.0,0.0,0.0,,70.0
1,Spacious 3x2 for Visiting ATX,"This is a 3 bedroom, 2 bathroom condo in the h...",Susana,2019-07-17,"Austin, TX",Hey y'all! My name is Susana & I'm so happy th...,within an hour,96%,100%,f,...,47,0,0,1.08,0.0,1.0,18.0,1584.0,,88.0
2,Spacious place Palm Culver City,Bring the whole family to this great place wit...,Dee,2012-07-30,"Los Angeles, CA",I am a mental health professional in addition ...,within an hour,100%,99%,f,...,21,0,0,,304.0,0.0,0.0,0.0,,130.0
3,2-bedroom Mission Beach home with private patio,"Cute 2-bedroom, 1-bath, downstairs unit in dup...",Tracy,2014-01-14,"San Diego, CA",,within an hour,100%,100%,t,...,3,0,0,2.72,,,,,Great place to stay with little caveats. Ye wa...,231.0
4,334-cozy apt 5 mins to beach,Discover comfort in Prime Fort Lauderdale Loca...,Michael,2024-06-06,"Hollywood, FL","Hello! \nI'm a proud Floridian, having lived h...",within an hour,100%,67%,t,...,16,2,0,1.0,,,,,,129.0


Test Data Sample


Unnamed: 0,name,description,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,...,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,availability_eoy,number_of_reviews_ly,estimated_occupancy_l365d,estimated_revenue_l365d,reviews
0,Luxury private room near JFK Int. Airport New ...,Luxury spacious private room with attached ful...,Mohammed,2022-07-24,"New York, NY",Knowledge Comes Before Speech and Action ( الع...,within an hour,100%,93%,t,...,20,1,19,0,0.5,305.0,3.0,180.0,8460.0,
1,Beautiful resort 1 bedroom full kitchen,Make some memories at this unique and family-f...,Ronnie,2019-08-30,,Love people and loves God,within a day,100%,0%,f,...,2,2,0,0,0.05,,,,,
2,Specious large alcove studio,Super large specious studio loft style located...,Yael,2015-08-04,"New York, NY",,within an hour,100%,100%,f,...,17,17,0,0,0.08,276.0,1.0,0.0,0.0,
3,Quiet bedroom with private bathroom,"4 bedroom home, 3 upstairs, 1 downstairs. Mod...",Natallia,2012-02-19,"San Francisco, CA",We enjoy the Airbnb experience--what cultures ...,within an hour,100%,100%,f,...,5,2,3,0,0.63,136.0,10.0,255.0,21420.0,Highly recommend the property! The Sonder was ...
4,Sunny Good Vibes with View & Work from Home,Welcome to Sunny Good Vibes in the historic Mi...,Garret,2017-04-27,United States,I’m the owner and operator of Sunny Good Vibes...,within an hour,100%,99%,t,...,2,2,0,0,3.05,,,,,


## Step 1: Data Preprocessing

In [3]:

# Check for missing values in train and test sets
print(train_df.isnull().sum())
print(test_df.isnull().sum())


name                             0
description                   1608
host_name                      113
host_since                     115
host_location                20394
                             ...  
number_of_reviews_ly         29176
estimated_occupancy_l365d    29176
estimated_revenue_l365d      29176
reviews                      85757
price                            0
Length: 66, dtype: int64
name                             0
description                    382
host_name                       33
host_since                      34
host_location                 4962
                             ...  
availability_eoy              7140
number_of_reviews_ly          7140
estimated_occupancy_l365d     7140
estimated_revenue_l365d       7140
reviews                      21441
Length: 65, dtype: int64


In [4]:

# Decide which columns to drop or fill in with default values based on your analysis
# Should you do it based on training data or test data statistics, or both?


## Step 2: Feature Extraction

In [6]:

# Numerical columns should likely be standardized
# Categorical columns should be one-hot encoded or label encoded
# Text columns should be vectorized (e.g., using TF-IDF)

# Define categorical and text features
categorical_features = ['neighbourhood_group', 'room_type', 'host_response_time', 'host_is_superhost', # etc (many more categorical features)
                        ]
text_features = ['description', 'host_about', 'reviews'] # etc (there may be others)
# Define numerical features
numerical_features = ['minimum_nights', 'number_of_reviews'] # etc (many mote numerical features)

# Standardize numerical features - here's an example using StandardScaler; You can choose other options too
scaler = StandardScaler()
train_df[numerical_features] = scaler.fit_transform(train_df[numerical_features])
test_df[numerical_features] = scaler.transform(test_df[numerical_features])


In [7]:

# You may want to one-hot encode categorical features (I'll let you figure out how to do this)


In [9]:
# And figure out what to do with text features (e.g., TF-IDF vectorization, key word extraction, sentiment analysis, etc.)
# You could use TfidfVectorizer for text features, which does the same thing we did in lab02, or take a different approach
# Remember to do some text preprocessing (remove punctuation, lowercase, etc.)

# Or you could extract sentiment based features from the reviews (# positive, # negative, neutral, etc.)


## Step 4: Put All Features Together or Decide How to Process the Samples

In [None]:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Example below assumes your train_df and test_df only contain numerical features

# Split the data into features and target variable
X_train = train_df.drop(columns=['price'])  # Drop the target variable
y_train = train_df['price']  # Target variable
X_test = test_df

# Example regression model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
print(f"Root Mean Squared Error on Training Data: {rmse}")
