## 1) Exploratory Data Analysis

- First, there were columns that had characters in them (%, \$) so I filtered cleaned the columns to be numeric. 
- There were different variations to quantify the number of listings a host has, so using correlations and some trial and error only one of the host listing count predictors was selected. 
- Review scores are averaged because they are very highly correlated.
- There was an extremely high outlier for price that was extremely skewing the model so I filter out the top  and bottom 0.7\% of price observations.
- Similairly there was a large outlier in minimum nights that was skewing it's significance, so the top 0.1\% of minimum nights observations were dropped
- There were some columns that were dates so they were converted to datetime objects and also a column was created for 'months since' that date to make the date columns easier to utilize
- Property types and Neighbourhoods had some observations with low occurances so those were grouped into 'Other'
- host_is_superhost, acceptance_rate, and resposne_rate were imputed using simple models
    - the remaining missing values were imputed naively with the columns median value


## 2) Data Cleaning/Preparation

In [56]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from datetime import date, datetime

In [57]:
raw_train = pd.read_csv('datasets/train_regression.csv')
raw_test = pd.read_csv('datasets/test_regression.csv')

## Clean and Process

### General Cleaning

In [58]:
# Create copies of the raw datasets
train = raw_train.copy()
test = raw_test.copy()

# Clean 'price' column: remove '$' and ',' characters, and convert to float
train['price'] = train['price'].str.replace(',', '').str.replace('$', '', regex=False).astype(float)

In [59]:
# Convert 'host_acceptance_rate' and 'host_response_rate' columns to float and scale by dividing by 100
train['acceptance_rate'] = train['host_acceptance_rate'].str.replace('%', '').astype(float) / 100
train['response_rate'] = train['host_response_rate'].str.replace('%', '').astype(float) / 100

test['acceptance_rate'] = test['host_acceptance_rate'].str.replace('%', '').astype(float) / 100
test['response_rate'] = test['host_response_rate'].str.replace('%', '').astype(float) / 100

# Drop unnecessary columns
train.drop(columns=['host_acceptance_rate', 'host_response_rate'], inplace=True)
test.drop(columns=['host_acceptance_rate', 'host_response_rate'], inplace=True)



# Extract numeric values from 'bathrooms_text' column and convert to float
train['bathrooms_num'] = train['bathrooms_text'].str.extract('(\d+)').astype(float)
test['bathrooms_num'] = test['bathrooms_text'].str.extract('(\d+)').astype(float)

# Fill missing values in 'bathrooms_num' where 'Half-bath' is mentioned in 'bathrooms_text' with 0.5
train.loc[train['bathrooms_text'].str.contains('Half-bath', case=False, na=False) & train['bathrooms_num'].isna(), 'bathrooms_num'] = 0.5
test.loc[test['bathrooms_text'].str.contains('Half-bath', case=False, na=False) & test['bathrooms_num'].isna(), 'bathrooms_num'] = 0.5


In [60]:
# Convert date columns to datetime format
def strip_date(row):
    if isinstance(row, str):
        row = datetime.strptime(row, '%Y-%m-%d').date()
    return row

# Apply date conversion to train dataset
train['host_since'] = train['host_since'].apply(strip_date)
train['first_review'] = train['first_review'].apply(strip_date)
train['last_review'] = train['last_review'].apply(strip_date)

# Apply date conversion to test dataset
test['host_since'] = test['host_since'].apply(strip_date)
test['first_review'] = test['first_review'].apply(strip_date)
test['last_review'] = test['last_review'].apply(strip_date)

# ----- #

# Calculate months since various dates for train dataset
train['host_since_in_months'] = round(((datetime.now().date() - train['host_since']).dt.days) / 30, 2)
train['first_review_in_months'] = round(((datetime.now().date() - train['first_review']).dt.days) / 30, 2)
train['last_review_in_months'] = round(((datetime.now().date() - train['last_review']).dt.days) / 30, 2)

# Calculate months since various dates for test dataset
test['host_since_in_months'] = round(((datetime.now().date() - test['host_since']).dt.days) / 30,  2)
test['first_review_in_months'] = round(((datetime.now().date() - test['first_review']).dt.days) / 30, 2)
test['last_review_in_months'] = round(((datetime.now().date() - test['last_review']).dt.days) / 30, 2)


# Because the review values are extremely collinear, calculate average review scores and fill missing values with 0
train['review_scores_avg'] = np.mean(train[['review_scores_rating', 'review_scores_value', 'review_scores_location', 'review_scores_cleanliness']], axis=1)
test['review_scores_avg'] = np.mean(test[['review_scores_rating', 'review_scores_value', 'review_scores_location', 'review_scores_cleanliness']], axis=1)

train['review_scores_avg'].fillna(value=0)
test['review_scores_avg'].fillna(value=0)

0       0.0000
1       0.0000
2       4.8200
3       4.8350
4       4.7525
         ...  
3333    4.6875
3334    4.7100
3335    4.6350
3336    4.8125
3337    4.7425
Name: review_scores_avg, Length: 3338, dtype: float64

In [61]:
## Identify outliers in 'price' and 'minimum_nights'

# top and bottom 0.04% of price
lower_val = np.percentile(train[['price']], 0.07)
upper_val = np.percentile(train[['price']], 99.93)
outliers_idx_price = list(train[(train['price'] >= upper_val) | (train['price'] <= lower_val)].index)
print("Price outliers:", list(outliers_idx_price))


# top 0.1% of minimum_nights
upper_lim = np.percentile(train[['minimum_nights']], 99.9)
outliers_idx_nights = list(train[train['minimum_nights'] >= upper_lim].index)
outliers_idx = list(outliers_idx_price) + list(outliers_idx_nights)
print("Min nights outliers:", list(train[train['minimum_nights'] >= upper_lim].index))


print(f"\n{len(train.iloc[outliers_idx, :]['price'])} observations dropped\n")
train.loc[outliers_idx_price, :]['price'].sort_values()

Price outliers: [523, 1626, 1823, 1848, 2067, 2380, 3129, 4865]
Min nights outliers: [227, 707, 1748, 2963, 3607]

13 observations dropped



4865       14.0
2067       15.0
1848       16.0
2380       16.0
1823     3319.0
1626     4500.0
523      5000.0
3129    99998.0
Name: price, dtype: float64

In [62]:
unnecessary_cols = ['minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights',
                   'availability_365', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 
                   'host_location', 'host_response_time', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'has_availability', 'first_review', 'last_review', 'host_since', 'host_neighbourhood', 
                    'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms', ]

train_clean = train.drop(outliers_idx).reset_index(drop=True)
train_clean.drop(columns=unnecessary_cols, inplace=True)

test_clean = test.drop(columns=unnecessary_cols)


### Clean/Transform Variables 

In [63]:
def clean_vars(row):
    # Check if 'shared' is in 'bathrooms_text' to identify shared bathrooms
    if 'shared' in str(row['bathrooms_text']):
        row['bathrooms_shared'] = "t"
        
    # Check if 'bathrooms_text' is empty and 'room_type' is 'Shared' to identify shared bathrooms
    elif pd.isna(row['bathrooms_text']):
        if 'Shared' in row['room_type']:
            row['bathrooms_shared'] = "t"              
        else:
            row['bathrooms_shared'] = "f"
    else: 
        row['bathrooms_shared'] = "f"
        
    # Convert 'Hotel room' room type to 'Private room'
    if row.loc['room_type'] == 'Hotel room':
        row['room_type'] = 'Private room'
        
    return row

# Apply the function to clean variables to train and test datasets
train_clean = train_clean.apply(clean_vars, axis=1)
test_clean = test_clean.apply(clean_vars, axis=1)


##### Clean Neighbourhoods 
grouping small occurances into the group with the closest mean

In [64]:
# Group small occurrences into 'Other'
neighbourhood_counts = train_clean['neighbourhood_cleansed'].value_counts()

other_hoods = [i for i in neighbourhood_counts.index if neighbourhood_counts[i] < 100]

test_only_hoods = [i for i in test_clean['neighbourhood_cleansed'].unique() 
                   if i not in neighbourhood_counts 
                   and i != 'Other']
    

In [65]:
# Create DataFrame with unique neighbourhoods
hood_df = pd.DataFrame(index=train_clean['neighbourhood_cleansed'].unique())

# Compute mean and standard deviation for each neighbourhood
grouped = train_clean.groupby('neighbourhood_cleansed')['price']
all_mean = grouped.mean()
all_std = grouped.std()

# Add mean and std to DataFrame
hood_df['mean_price'] = all_mean
hood_df['std_price'] = all_std

# Merge with counts
hood_df = hood_df.merge(neighbourhood_counts, left_index=True, right_index=True)
hood_df.rename(columns={'neighbourhood_cleansed': 'count'}, inplace=True)


# Get the 10th percentile of standard deviations
std_90 = np.percentile(hood_df.dropna(how='any')['std_price'], 10)


# Filter DataFrame
filtered_df = hood_df[((hood_df['std_price'] < std_90) | (hood_df['count'] > 100)) & (hood_df['count'] > 20)]

keep_hoods = filtered_df.index.tolist()


In [66]:
# if neighbourhood has small std or more than 100 but no neighbourhoods with less than 20
def clean_hoods(row):
    if row.loc['neighbourhood_cleansed'] not in keep_hoods:
        row['neighbourhood_grouped'] = 'Other'
        
    else:    
        row['neighbourhood_grouped'] = row.loc['neighbourhood_cleansed']
        
    return row

train_clean = train_clean.apply(clean_hoods, axis=1)
test_clean = test_clean.apply(clean_hoods, axis=1)

##### Clean Property Type
grouping small occurances into the group with the closest mean

In [67]:
words_to_remove = ['place', 'room', 'private', 'shared', 'entire', ' in', ' room', ' private', ' shared', ' entire', ' in',]

# remove filler and unnecessary words from property
def remove_words(text):
    text=text.lower()
    for word in words_to_remove:
        word = word.lower()
        text = text.replace(word, '')
    return text.strip()


train_clean['property_type'] = train_clean['property_type'].apply(remove_words)
test_clean['property_type'] = test_clean['property_type'].apply(remove_words)


In [68]:
# identify value counts and make a list of neighbourhoods with more than 10
property_counts = train_clean['property_type'].value_counts()
keep = [i for i in property_counts.index if property_counts[i] > 10]

def clean_property(row):
    if row not in keep or row == "":
        row = 'Other'
      
    return row


train_clean['property_type_cleansed'] = train_clean['property_type'].apply(clean_property)
test_clean['property_type_cleansed'] = test_clean['property_type'].apply(clean_property)

train_filter_2 = train_clean.copy()
test_filter_2 = test_clean.copy()


### Inspect and impute columns with missing values

#### Model imputation

In [69]:
# Create a temporary dataframe to manipulate
train_filter_temp = train_filter_2.copy()
test_filter_temp = test_filter_2.copy()

# Change t/f to numeric 1/0
train_filter_temp['host_is_superhost'] = train_filter_2['host_is_superhost'].replace({'f': 0, 't': 1})
test_filter_temp['host_is_superhost'] = test_filter_2['host_is_superhost'].replace({'f': 0, 't': 1})

# Create model
superhost_model = smf.logit(formula="host_is_superhost ~ calculated_host_listings_count*number_of_reviews_ltm + response_rate", data=train_filter_temp).fit()

# Predict all values 
impute_superhost_train = (superhost_model.predict(train_filter_temp) > 0.5).replace({False:'f', True:'t'})
impute_superhost_test = (superhost_model.predict(test_filter_temp) > 0.5).replace({False:'f', True:'t'})

# fill na's with coordinating value from model imputation
train_filter_2['host_is_superhost'].fillna(impute_superhost_train, inplace=True)
test_filter_2['host_is_superhost'].fillna(impute_superhost_test, inplace=True)


train_filter_2['host_is_superhost']

Optimization terminated successfully.
         Current function value: 0.586516
         Iterations 8


0       f
1       f
2       f
3       f
4       t
       ..
4982    t
4983    t
4984    f
4985    f
4986    f
Name: host_is_superhost, Length: 4987, dtype: object

In [70]:
# create model to impute acceptance rate
acceptance_model = smf.logit(formula="acceptance_rate ~ calculated_host_listings_count + accommodates", data=train_filter_2).fit()


# fill in missing values with the predictions from the model
predicted_acceptance = acceptance_model.predict(train_filter_2)
train_filter_2['acceptance_rate'].fillna(predicted_acceptance)

predicted_acceptance_test = acceptance_model.predict(test_filter_2)
test_filter_2['acceptance_rate'].fillna(predicted_acceptance_test)


# ----- #


# Create model to impute response rate
response_model = smf.logit(formula="response_rate ~ accommodates", data=train_filter_2).fit()


# fill in missing values with the predictions from the model
predicted_response = response_model.predict(train_filter_2)
train_filter_2['response_rate'].fillna(predicted_response)

predicted_response_test = response_model.predict(test_filter_2)
test_filter_2['response_rate'].fillna(predicted_response_test)


Optimization terminated successfully.
         Current function value: 0.192038
         Iterations 7
Optimization terminated successfully.
         Current function value: 0.067442
         Iterations 8


0       0.88
1       1.00
2       0.99
3       1.00
4       1.00
        ... 
3333    0.99
3334    1.00
3335    1.00
3336    1.00
3337    1.00
Name: response_rate, Length: 3338, dtype: float64

#### Naive imputation

In [71]:
# Fill in remaining missing values with median for numerical columns
train_filter_2.fillna(train_filter_2.median(numeric_only=True), inplace=True)
test_filter_2.fillna(test_filter_2.median(numeric_only=True), inplace=True)

# Create final DataFrames
train_final = train_filter_2.copy()
test_final = test_filter_2.copy()


## 3) Developing the Model

- Created a dictionary of pairs that are correlated and use VIF to deduce which variables to keep. Some correlated variables were kept to create a more complex model. 
- For interactions each variable in last_review_in_months, reviews_per_month, number_of_reviews_ltm, review_scores_avg was interacted with each other. 
- For interactions related to accommodations; bathrooms were transformed by if bathrooms are shared to depend correlation on shared, accommodates interacts with beds as the number of beds and accomodates correlation can indicate if beds are meant to be shared, finally if bathrooms shared with accommodates as higher accommodation is more valuable if bathrooms are not shared.
- acceptance_rate, response_rate, and host_is_superhost all interact with each other to transform the acceptance and response rate whether the host is a superhost to quantify how a host being a superhost changes the slope/frequency of the rates
- Calculated host listings count interacts with number_of_reviews_ltm, review_scores_avg, and reviews_per_month to capture how the number of listings a host has and the number of reviews are closely connect as well as having more listings at a higher average is an indicator for higher price then few listings or worse reviews. 
- maximum nights, minimum nights, latitude, and longitude greatly predict price and squared and cubic terms were added to better capture the trend in the connection. calculated host listing count has a large range so the log was taken to standardize further and then higher order terms created. reviews per month and number of reviews ltm both have additional terms according to visualizations and trial and error.


## 4) Model

## Sklearn

In [72]:
# columns used in the model
cols_for_sklearn = ['acceptance_rate', 'accommodates', 'availability_30', 'availability_90',
                    'bathrooms_num', 'bathrooms_shared', 'beds', 
                    'calculated_host_listings_count', 'last_review_in_months',
                    'host_is_superhost','host_since_in_months', 'latitude', 'longitude', 
                    'maximum_nights', 'minimum_nights', 
                    'neighbourhood_grouped', 'number_of_reviews_ltm', 
                    'price', 'property_type_cleansed', 'response_rate',  
                    'review_scores_avg', 'reviews_per_month', 'room_type',
                    'maximum_nights_avg_ntm', 'minimum_nights_avg_ntm', 'last_review_in_months']


# create list of columns excluding price that can be used for the test data
cols_for_sklearn_test = [name for name in cols_for_sklearn if name != 'price']

# subset dataset with needed columns
subset_train = train_final[cols_for_sklearn].copy()
subset_test = test_final[cols_for_sklearn_test].copy()


In [73]:
# create response and predictor objects
X_train = subset_train.drop(columns='price')
y_train = np.log(subset_train.price)

X_test = subset_test.copy()

In [74]:
# create dataframes with dummy variables
X_train_preprocessed = pd.get_dummies(X_train, drop_first=True)
X_test_preprocessed = pd.get_dummies(X_test, drop_first=True)

In [75]:
# lists of the dummy columns created for respective categorical variables
room_type_cols = [name for name in X_train_preprocessed.columns if 'room_type' in name]
neighbourhood_groups = [name for name in X_train_preprocessed.columns if 'neighbourhood_grouped' in name]
property_groups = [name for name in X_train_preprocessed.columns if 'property' in name]


# list of pairs of variables to interact
interaction_pairs = [('bathrooms_num', 'bathrooms_shared_t'),
                     ('beds', 'accommodates'),
                     ('accommodates', 'bathrooms_shared_t'),
                     
                     ('accommodates', 'availability_30'),
                     ('accommodates', 'availability_90'),
                     ('acceptance_rate', 'host_is_superhost_t'), 
                     ('response_rate', 'host_is_superhost_t'),
                     ('acceptance_rate', 'response_rate'),
                     
                    ('last_review_in_months', 'reviews_per_month'),
                    ('last_review_in_months', 'number_of_reviews_ltm'),
                    ('reviews_per_month', 'number_of_reviews_ltm'),
                    ('reviews_per_month', 'review_scores_avg'), 
                    ('number_of_reviews_ltm', 'review_scores_avg'),
                     
                    ('number_of_reviews_ltm', 'calculated_host_listings_count'),
                    ('review_scores_avg', 'calculated_host_listings_count'),
                    ('reviews_per_month', 'calculated_host_listings_count')] 
        
        
# loop for adding categorical predictors to interaction pairs list
for i in room_type_cols:
    interaction_pairs.append((i, 'beds'))
    interaction_pairs.append((i, 'availability_30'))
    
    
    for j in neighbourhood_groups: 
        interaction_pairs.append((j, i))
    
    for k in property_groups:
        interaction_pairs.append((i, k))

        
for i in property_groups:
    for j in neighbourhood_groups:
        interaction_pairs.append((i, j))

        
        
### ----- ###

# list of all the columns used in interactions
interaction_cols = []
for t in interaction_pairs:
    for item in t:
        interaction_cols.append(item)
        
        
# initialize dataframe to store interactions
interaction_df = pd.DataFrame()
interaction_df_test = pd.DataFrame()

# for each pair, get the interaction values and add to interaction df
for pair in interaction_pairs:
    interaction_columns = pair
    
    X_train_interaction_subset = X_train_preprocessed[list(interaction_columns)]
    X_test_interaction_subset = X_test_preprocessed[list(interaction_columns)]
    
    poly_features = PolynomialFeatures(interaction_only=True, include_bias=False)
    interaction_terms_train = poly_features.fit_transform(X_train_interaction_subset)
    interaction_terms_test = poly_features.transform(X_test_interaction_subset)
    
    interaction_column_names = poly_features.get_feature_names_out()
    
    interaction_df_pair = pd.DataFrame(interaction_terms_train, columns=interaction_column_names)
    interaction_df_pair_test = pd.DataFrame(interaction_terms_test, columns=interaction_column_names)
    

    interaction_df = pd.concat([interaction_df, interaction_df_pair], axis=1)
    interaction_df_test = pd.concat([interaction_df_test, interaction_df_pair_test], axis=1)

    

In [76]:
# add interaction data frame to main dataset and drop any duplicated columns
X_train_processed = pd.concat([X_train_preprocessed.drop(columns=interaction_cols), interaction_df], axis=1)
X_test_processed = pd.concat([X_test_preprocessed.drop(columns=interaction_cols), interaction_df_test], axis=1)        

X_train_processed = X_train_processed.loc[:,~X_train_processed.columns.duplicated()].copy()
X_train_processed = X_train_processed.reindex(sorted(X_train_processed.columns), axis=1)
X_test_processed = X_test_processed.loc[:,~X_test_processed.columns.duplicated()].copy()
X_test_processed = X_test_processed.reindex(sorted(X_test_processed.columns), axis=1)


In [77]:
# add higher order terms for select predictors in both the test and training data
X_train_processed['maximum_nights^2'] = X_train_processed['maximum_nights'] ** 2
X_train_processed['maximum_nights^3'] = X_train_processed['maximum_nights'] ** 3
X_train_processed['minimum_nights^2'] = X_train_processed['minimum_nights'] ** 2
X_train_processed['minimum_nights^3'] = X_train_processed['minimum_nights'] ** 3
X_train_processed['longitude^2'] = X_train_processed['longitude'] ** 2
X_train_processed['latitude^2'] = X_train_processed['latitude'] ** 2
X_train_processed['longitude^3'] = X_train_processed['longitude'] ** 3
X_train_processed['latitude^3'] = X_train_processed['latitude'] ** 3
X_train_processed['calculated_host_listings_count_log'] = np.log(X_train_processed['calculated_host_listings_count'])
X_train_processed['calculated_host_listings_count_log^2'] = np.log(X_train_processed['calculated_host_listings_count'])**2
X_train_processed['calculated_host_listings_count_log^3'] = np.log(X_train_processed['calculated_host_listings_count'])**3
X_train_processed['reviews_per_month^2'] = X_train_processed['reviews_per_month'] ** 2
X_train_processed['number_of_reviews_ltm_root'] = np.sqrt(X_train_processed['number_of_reviews_ltm'])

X_test_processed['maximum_nights^2'] = X_test_processed['maximum_nights'] ** 2
X_test_processed['maximum_nights^3'] = X_test_processed['maximum_nights'] ** 3
X_test_processed['minimum_nights^2'] = X_test_processed['minimum_nights'] ** 2
X_test_processed['minimum_nights^3'] = X_test_processed['minimum_nights'] ** 3
X_test_processed['longitude^2'] = X_test_processed['longitude'] ** 2
X_test_processed['latitude^2'] = X_test_processed['latitude'] ** 2
X_test_processed['longitude^3'] = X_test_processed['longitude'] ** 3
X_test_processed['latitude^3'] = X_test_processed['latitude'] ** 3
X_test_processed['calculated_host_listings_count_log'] = np.log(X_test_processed['calculated_host_listings_count'])
X_test_processed['calculated_host_listings_count_log^2'] = np.log(X_test_processed['calculated_host_listings_count'])**2
X_test_processed['calculated_host_listings_count_log^3'] = np.log(X_test_processed['calculated_host_listings_count'])**3
X_test_processed['reviews_per_month^2'] = X_test_processed['reviews_per_month'] ** 2
X_test_processed['number_of_reviews_ltm_root'] = np.sqrt(X_test_processed['number_of_reviews_ltm'])


In [78]:
# series of loops to create new categorical column and numerical column lists to account for the new dummy and interaction columns
cat_cols = X_train.select_dtypes(exclude='number').columns
cat_dummy_dict = {}

# Creating dummy variables for each categorical feature
for cat in cat_cols:
    dummy_df = pd.get_dummies(X_train[cat], prefix=cat, drop_first=True)
    cat_dummy_dict[cat] = list(dummy_df.columns)

# Generating combinations of categorical dummy variables
possible_cat_cat = []
for k, v in cat_dummy_dict.items():
    for j, w in cat_dummy_dict.items():
        if k != j:
            possible_cat_cat.extend([f"{x} {y}" for x in v for y in w])

# Creating a list of all dummy variables
plain_dummies = list(pd.get_dummies(X_train[cat_cols], drop_first=True).columns)

# Combining all dummy variables
binary_cols = plain_dummies + possible_cat_cat

# Filtering categorical and numerical columns
new_cat_cols = [col for col in X_train_processed.columns if col in binary_cols]
new_num_cols = [col for col in X_train_processed.columns if col not in new_cat_cols]


In [79]:
# create and fit scaler object
sc = StandardScaler()
sc.fit(X_train_processed[new_num_cols])

# reassign the numerical columns to their scaled value
X_train_processed[new_num_cols] = sc.transform(X_train_processed[new_num_cols])
X_test_processed[new_num_cols] = sc.transform(X_test_processed[new_num_cols]) 

X_train_final = X_train_processed.copy()
X_test_final = X_test_processed.copy()

In [80]:
# Create and fit model using transformed and scaled data
lrm = LinearRegression()
lrm.fit(X_train_final, y_train)

In [81]:
# Predict the train data values with the model and calculate training rmse and mae 
y_pred_train = np.exp(lrm.predict(X_train_final))*1.05
rmse = mean_squared_error(train_final.price, y_pred_train, squared = False) 
mae = mean_absolute_error(train_final.price, y_pred_train) 

print(round(rmse, 4))
print(round(mae, 4))
print("Diff rmse-mae:", round(rmse-mae, 4))

112.7342
51.5415
Diff rmse-mae: 61.1927


In [82]:
# Predict price in the test data
## Convert to true value with np.exp() and adjust with 1.05 coefficient
predicted_values = pd.DataFrame(np.exp(lrm.predict(X_test_final))*1.05, columns=['predicted'])

# add listing id to the predicted values dataframe and set the index to the id value
predicted_values = predicted_values.merge(test_final['id'], left_index=True, right_index=True).set_index('id').rename(columns={0:'predicted'})
predicted_values

predicted_values.to_csv('liner_model_predicted_values.csv')