In [1]:
import pandas as pd
import re
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np

The task of this project is to find a price that can be suggested when adding a new offer without a price. The data is taken from [Inside Airbnb](http://insideairbnb.com/get-the-data.html). I took the data for Praque in Chech Republic from 29 November, 2019. The columns I will be working with:
- host_response_rate: the response rate of the host
- host_acceptance_rate: number of requests to the host that convert to rentals
- host_listings_count: number of other listings the host has
- latitude: latitude dimension of the geographic coordinates
- longitude: longitude part of the coordinates
- city: the city the living space resides
- zipcode: the zip code the living space resides
- state: the state the living space resides
- accommodates: the number of guests the rental can accommodate
- room_type: the type of living space (Private room, Shared room or Entire home/apt
- bedrooms: number of bedrooms included in the rental
- bathrooms: number of bathrooms included in the rental
- beds: number of beds included in the rental
- price: nightly price for the rental
- cleaning_fee: additional fee used for cleaning the living space after the guest leaves
- security_deposit: refundable security deposit, in case of damages
- minimum_nights: minimum number of nights a guest can stay for the rental
- maximum_nights: maximum number of nights a guest can stay for the rental
- number_of_reviews: number of reviews that previous guests have left

#### Define columns to import

In [2]:
cols = ['host_response_rate','host_acceptance_rate','host_listings_count',
        'latitude','longitude','city','zipcode','state','accommodates',
        'room_type','bedrooms','bathrooms','beds','price','cleaning_fee',
        'security_deposit','minimum_nights','maximum_nights','number_of_reviews']

In [3]:
praque_listings = pd.read_csv('listings.csv', usecols=cols)
praque_listings.head(2)

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,city,state,zipcode,latitude,longitude,room_type,accommodates,bathrooms,bedrooms,beds,price,security_deposit,cleaning_fee,minimum_nights,maximum_nights,number_of_reviews
0,100%,,69.0,Prague,Czech Republic,11000,50.08295,14.41623,Entire home/apt,4,1.0,1.0,2.0,"$3,736.00","$2,555.00",,1,365,19
1,100%,,69.0,Prague,Czech Republic,11000,50.08983,14.42317,Entire home/apt,4,1.0,1.0,2.0,"$2,506.00","$2,555.00",,1,365,113


In [4]:
stripped_commas = praque_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
praque_listings['price'] = stripped_dollars.astype('float')

Strategy to achive target:
- find comparable offers
- calculate the price
- calculate the average price

While using k-nearest algorythm it is important to specify the k parameter. In this case k is the number of similar listing that will be compare.

#### Similarity metric

In [5]:
acc = 3
first_living_space_value = praque_listings.iloc[0]['accommodates']
first_distance = np.abs(first_living_space_value - acc)
first_distance

1

In [6]:
diffrences = praque_listings.accommodates.apply(lambda x: np.abs(x - acc))
diffrences[:3]

0    1
1    1
2    3
Name: accommodates, dtype: int64

In [7]:
praque_listings['distance'] = diffrences

In [8]:
praque_listings.distance.value_counts()

1     8336
0     1762
3     1394
2     1254
5      482
4      281
7      199
13     120
6      119
9       97
8       63
11      38
10      28
12      11
Name: distance, dtype: int64

#### Sorting by the distance

In [9]:
praque_listings.distance.value_counts().sort_index()

0     1762
1     8336
2     1254
3     1394
4      281
5      482
6      119
7      199
8       63
9       97
10      28
11      38
12      11
13     120
Name: distance, dtype: int64

There is 1762 spaces that have this same number of accomodation - 3.

In [10]:
praque_listings[praque_listings.distance == 0].head(3)

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,city,state,zipcode,latitude,longitude,room_type,accommodates,bathrooms,bedrooms,beds,price,security_deposit,cleaning_fee,minimum_nights,maximum_nights,number_of_reviews,distance
6,100%,,12.0,Prague,Hlavní město Praha,118 00,50.08871,14.40769,Entire home/apt,3,1.0,1.0,1.0,1624.0,$0.00,$383.00,3,365,272,0
9,100%,,3.0,Prague,Prague,130 00,50.07871,14.45213,Entire home/apt,3,1.0,1.0,3.0,1091.0,,"$1,226.00",2,1825,117,0
10,100%,,1.0,Prague,Hlavní město Praha,110 00,50.08365,14.41409,Entire home/apt,3,1.0,1.0,2.0,2181.0,"$3,194.00","$1,022.00",3,365,320,0


In [11]:
praque_listings[praque_listings.distance == 0]['accommodates'].head(3)

6     3
9     3
10    3
Name: accommodates, dtype: int64

In [12]:
praque_listings = praque_listings.sort_values('distance')
praque_listings.head(2)

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,city,state,zipcode,latitude,longitude,room_type,accommodates,bathrooms,bedrooms,beds,price,security_deposit,cleaning_fee,minimum_nights,maximum_nights,number_of_reviews,distance
3196,100%,,3.0,Praha 4,Prague,14000,50.05978,14.43548,Entire home/apt,3,1.0,3.0,3.0,2344.0,$0.00,"$1,500.00",30,1125,6,0
11792,100%,,2.0,Praha-Libuš,Hlavní město Praha,142 00,50.00329,14.46514,Hotel room,3,1.0,1.0,3.0,1300.0,,,1,1125,3,0


In [13]:
praque_listings.price = stripped_dollars.astype(float)

In [14]:
praque_listings.head(5)

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,city,state,zipcode,latitude,longitude,room_type,accommodates,bathrooms,bedrooms,beds,price,security_deposit,cleaning_fee,minimum_nights,maximum_nights,number_of_reviews,distance
3196,100%,,3.0,Praha 4,Prague,14000,50.05978,14.43548,Entire home/apt,3,1.0,3.0,3.0,2344.0,$0.00,"$1,500.00",30,1125,6,0
11792,100%,,2.0,Praha-Libuš,Hlavní město Praha,142 00,50.00329,14.46514,Hotel room,3,1.0,1.0,3.0,1300.0,,,1,1125,3,0
4907,90%,,5.0,Prague,,186 00,50.09315,14.45219,Entire home/apt,3,1.0,1.0,3.0,1183.0,,$190.00,1,1125,105,0
11784,100%,,1.0,Praha,Hlavní město Praha,150 00,50.06362,14.40943,Entire home/apt,3,1.0,1.0,2.0,1810.0,,$350.00,2,1125,17,0
11782,91%,,2.0,Praha 3,Hlavní město Praha,130 00,50.08666,14.46025,Entire home/apt,3,1.0,1.0,2.0,1810.0,$0.00,$850.00,1,1125,23,0


In [15]:
first_5 = praque_listings.price[:5]
mean_price = sum(first_5)/len(first_5)
print(mean_price)

1689.4


1318 is the mean of five different price that have the same accomodation as the given(0 distance).

Below there is the same what was above but in short version

In [16]:
praque_listings = pd.read_csv('listings.csv', usecols=cols)
stripped_commas = praque_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
praque_listings['price'] = stripped_dollars.astype('float')
praque_listings = praque_listings.loc[np.random.permutation(len(praque_listings))]

def predict_price(new_listing):
    temp_df = praque_listings.copy()
    temp_df['distance'] = temp_df.accommodates.apply(lambda x: np.abs(x- new_listing))
    temp_df = temp_df.sort_values('distance')                                                     
    predict_price = temp_df.price[:5].mean()
    return(predict_price)

acc_one = predict_price(1)
acc_two = predict_price(2)
acc_four = predict_price(4)

In [17]:
print('1: ' ,acc_one, '2: ', acc_two, '4: ', acc_four)

1:  2790.2 2:  1192.8 4:  1786.8


Of course those results are incorrect because there is a very large spread of prices, so this is not solution for this case, it can be just the start.