# PRICE PREDICTION

Predict rental price of Airbnb based on geographics & room details.

Dataset: [NYC Airbnb](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data)
<br>

In [95]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import math

from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score, accuracy_score, mean_squared_error
from sklearn.linear_model import LogisticRegression, Ridge

In [59]:
df = pd.read_csv("AB_NYC_2019.csv")
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [60]:
use_df = df[["neighbourhood_group", "room_type", "latitude", "longitude", "price",
            "minimum_nights", "number_of_reviews", "reviews_per_month",
             "calculated_host_listings_count", "availability_365"]]
use_df.head()

Unnamed: 0,neighbourhood_group,room_type,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,Private room,40.64749,-73.97237,149,1,9,0.21,6,365
1,Manhattan,Entire home/apt,40.75362,-73.98377,225,1,45,0.38,2,355
2,Manhattan,Private room,40.80902,-73.9419,150,3,0,,1,365
3,Brooklyn,Entire home/apt,40.68514,-73.95976,89,1,270,4.64,1,194
4,Manhattan,Entire home/apt,40.79851,-73.94399,80,10,9,0.1,1,0


In [61]:
use_df.isna().sum()

neighbourhood_group                   0
room_type                             0
latitude                              0
longitude                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

In [62]:
use_df["reviews_per_month"] = use_df["reviews_per_month"].fillna(0)
use_df.isna().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


neighbourhood_group               0
room_type                         0
latitude                          0
longitude                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
dtype: int64

# Q1

In [63]:
use_df["neighbourhood_group"].mode()

0    Manhattan
dtype: object

In [64]:
X = use_df.drop(["price"], axis=1)
y = use_df["price"]

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)
print(X_train.shape, X_val.shape, X_test.shape)

(29337, 9) (9779, 9) (9779, 9)


# Q2

In [65]:
X_train.corr()

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
latitude,1.0,0.089194,0.0335,-0.011515,-0.013651,0.019998,-0.011178
longitude,0.089194,1.0,-0.064367,0.057651,0.13944,-0.112306,0.082523
minimum_nights,0.0335,-0.064367,1.0,-0.086729,-0.133752,0.138567,0.146085
number_of_reviews,-0.011515,0.057651,-0.086729,1.0,0.586614,-0.07133,0.168134
reviews_per_month,-0.013651,0.13944,-0.133752,0.586614,1.0,-0.041719,0.162759
calculated_host_listings_count,0.019998,-0.112306,0.138567,-0.07133,-0.041719,1.0,0.223346
availability_365,-0.011178,0.082523,0.146085,0.168134,0.162759,0.223346,1.0


Biggest correlation: **number_of_reviews** & **reviews_per_month** in 0.598291

In [66]:
use_df["above_average"] = [1 if p >= 152 else 0 for p in use_df["price"]]
use_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,neighbourhood_group,room_type,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,above_average
0,Brooklyn,Private room,40.64749,-73.97237,149,1,9,0.21,6,365,0
1,Manhattan,Entire home/apt,40.75362,-73.98377,225,1,45,0.38,2,355,1
2,Manhattan,Private room,40.80902,-73.9419,150,3,0,0.0,1,365,0
3,Brooklyn,Entire home/apt,40.68514,-73.95976,89,1,270,4.64,1,194,0
4,Manhattan,Entire home/apt,40.79851,-73.94399,80,10,9,0.1,1,0,0


In [79]:
X = use_df.drop(["price", "above_average"], axis=1)
y = use_df["above_average"]

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)
print(X_train.shape, X_val.shape, X_test.shape)

(29337, 9) (9779, 9) (9779, 9)


# Q3

In [80]:
mutual_info_score(X_train.room_type, X_train.minimum_nights)

0.03884275657945132

In [84]:
def mutual_info_price_score(series):
    return round(mutual_info_score(series, use_df.above_average), 2)

categorical = use_df.select_dtypes(include=["object"]).columns
use_df[categorical].apply(mutual_info_price_score)

neighbourhood_group    0.05
room_type              0.14
dtype: float64

room_type has bigger mutual_info_score with **above_average** (binarized price)

# Q4

In [85]:
X_train = pd.get_dummies(data=X_train, columns=["neighbourhood_group","room_type"], 
                      prefix=["neighbourhood_group","room_type"])
X_val = pd.get_dummies(data=X_val, columns=["neighbourhood_group","room_type"], 
                      prefix=["neighbourhood_group","room_type"])
X_test = pd.get_dummies(data=X_test, columns=["neighbourhood_group","room_type"], 
                      prefix=["neighbourhood_group","room_type"])

In [86]:
model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
round(accuracy_score(y_pred, y_val), 2)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.79

# Q5

In [87]:
neighbourhood_group = [col for col in X_train.columns if col.startswith("neighbourhood_group")]
X_train_temp = X_train.drop(neighbourhood_group, axis=1)
X_val_temp = X_val.drop(neighbourhood_group, axis=1)
X_test_temp = X_test.drop(neighbourhood_group, axis=1)

model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
model.fit(X_train_temp, y_train)
y_pred = model.predict(X_val_temp)
round(accuracy_score(y_pred, y_val), 2)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.75

In [88]:
room_type = [col for col in X_train.columns if col.startswith("room_type")]
X_train_temp = X_train.drop(room_type, axis=1)
X_val_temp = X_val.drop(room_type, axis=1)
X_test_temp = X_test.drop(room_type, axis=1)

model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
model.fit(X_train_temp, y_train)
y_pred = model.predict(X_val_temp)
round(accuracy_score(y_pred, y_val), 2)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.72

In [89]:
X_train_temp = X_train.drop(["number_of_reviews"], axis=1)
X_val_temp = X_val.drop(["number_of_reviews"], axis=1)
X_test_temp = X_test.drop(["number_of_reviews"], axis=1)

model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
model.fit(X_train_temp, y_train)
y_pred = model.predict(X_val_temp)
round(accuracy_score(y_pred, y_val), 2)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.78

In [90]:
X_train_temp = X_train.drop(["reviews_per_month"], axis=1)
X_val_temp = X_val.drop(["reviews_per_month"], axis=1)
X_test_temp = X_test.drop(["reviews_per_month"], axis=1)

model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
model.fit(X_train_temp, y_train)
y_pred = model.predict(X_val_temp)
round(accuracy_score(y_pred, y_val), 2)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.79

Smallest difference gained when removing feature **reviews_per_month**

# Q6

In [92]:
X = use_df.drop(["price", "above_average"], axis=1)
y = use_df["price"]
y = np.log1p(y)

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)
print(X_train.shape, X_val.shape, X_test.shape)

(29337, 9) (9779, 9) (9779, 9)


In [97]:
X_train = pd.get_dummies(data=X_train, columns=["neighbourhood_group","room_type"], 
                      prefix=["neighbourhood_group","room_type"])
X_val = pd.get_dummies(data=X_val, columns=["neighbourhood_group","room_type"], 
                      prefix=["neighbourhood_group","room_type"])
X_test = pd.get_dummies(data=X_test, columns=["neighbourhood_group","room_type"], 
                      prefix=["neighbourhood_group","room_type"])

In [100]:
for a in [0, 0.01, 0.1, 1, 10]:
    ridge = Ridge(alpha=a)
    ridge.fit(X_train, y_train)
    y_pred = ridge.predict(X_val)
    print(a, round(math.sqrt(mean_squared_error(y_pred, y_val)), 3))

0 0.499
0.01 0.499
0.1 0.499
1 0.499
10 0.5


Best RMSE gained by alpha = **0**