This notebook is part from the Data Scientist Nanodegrees from Udaicty. We will explore the calendar and listings from Airbnb homes of Seattle.

**1- The best time to visit**

**Business Understanding**

What are the most visited months? Are there any correlation between seasons and volume of bookings?


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Let's load the calendar and listings data and take a look at it.

In [None]:
listings = pd.read_csv('listings.csv')
calendar = pd.read_csv('calendar.csv')
calendar.head()

**Data Understanding**

We need to find the mean of the prices by month and compare them to see the best 

**Prepare Data**

By looking in the price column, we can see there are Nan's values, and Dolar sign that will impact in dealing with numeric type. We'll fix that next.

In [None]:
def fix_prices(column_name, dataset_name):
    ''' Take a column name and dataset that contain dolar sign and comma,
        The function will update the dataframe with cleaned data.
    '''
    dataset_name[column_name] = dataset_name[column_name].str.replace('$','').str.replace(',','');

def fix_numeric(column_name, dataset_name):
    ''' Take a column name and dataset that are object type,
        The function will change the type to numeric.
    '''
    dataset_name[column_name] = pd.to_numeric(dataset_name[column_name]);

In [None]:
fix_prices('price', calendar)
fix_numeric('price', calendar)
calendar.head()

We've fixed Dolar sign and we have changed type to numeric.
Now we'll fix the price column and see the prices by month.

In [None]:
calendar['date'] = pd.to_datetime(calendar['date'])

mean_of_month = calendar.groupby(calendar['date'].dt.strftime('%B'),
                                 sort=False)['price'].mean()

mean_of_month.plot(kind = 'barh' , figsize = (20,10));

**Results**

Prices hike during summer vacations and winter holidays. We can see a clear increase in the months of June, July and August. The month with the least amount of travel is January.

**2- Best Neighbourhoods**

**Business Understanding**

What are the 3 most expensive and least expensive neighbourhoods in Seattle for renting?

**Data Understanding**
Here I'll run an analysis to find out the 3 most expensive and least expensive neighborhoods in Seattle. For pricing First let's take a peek at the listings.

In [None]:
listings = pd.read_csv('listings.csv')
listings.head()

In [None]:
listings.columns

**Prepare Data**

There are 93 columns and a lot of features that will help us this analysis but we will focuses on following: neighbourhood, neighbourhood_cleansed, zipcode, property_type, room_type.

In [None]:
columns_loc = ['id', 'neighbourhood', 'neighbourhood_cleansed',
               'zipcode', 'property_type', 'room_type', 'price']

listings_loc = listings[columns_loc]
listings_loc.head()

In [None]:
listings_loc.describe(include='all')

In [None]:
count_values_cols = ['neighbourhood', 'neighbourhood_cleansed', 'property_type', 'room_type']

def plot_values_counts(col):
    print('\n', col ,'\n')
    listings_loc[col].value_counts().plot(kind='barh', figsize=(20,10));
    
plot_values_counts ('property_type')

Now we'll fix price as we did with calendar, merge Cabin, camper/rv etc since they have few values and drop neighbourhood, zipcode since neighbourhood_cleansed is more efficient and don't have any missing values.

In [None]:
fix_price('price', listings_loc)
fix_numeric('price', listings_loc)

In [None]:
# This will replace any value repated 30 times or less to "other"

prop_tp = listings_loc['property_type'].value_counts()
for c in range(len(prop_tp)):
    val = prop_tp.values[c] 
    ind = prop_tp.index[c]
    if val <= 30:
        listings_loc['property_type']=listings_loc['property_type'].replace(ind,'Other');

In [None]:
listings_loc = listings_loc.drop(['neighbourhood', 'zipcode'] , axis = 1)
listings_loc.head()

In [None]:
neighbourhood_count = pd.DataFrame()
neighbourhood_count['count'] = listings_loc['neighbourhood_cleansed'].value_counts()
neighbourhood_mean = pd.DataFrame()
neighbourhood_mean['mean_price'] = listings_loc.groupby(listings_loc['neighbourhood_cleansed'])['price'].mean()
neighbourhood_info = pd.merge(neighbourhood_mean, neighbourhood_count, left_on = neighbourhood_mean.index,right_on = neighbourhood_count.index)

In [None]:
neighbourhood_info.index = neighbourhood_info.key_0

neighbourhood_info.sort_values(by = 'mean_price' ).iloc[0:10].plot(kind = 'barh',figsize = (20,10));

In [None]:
neighbourhood_info.sort_values(by = 'mean_price').iloc[67:-10].plot(kind = 'barh',figsize = (20,10));

**Conclusion**

Above we see the highest neighbourhoods in Seattle (Pioneer Square, Central Business DistrictFauntleroy) and lowest neighbourhoods (Rainier Beach, Olympic Hills, South Delridge)

**3- Popular Airbnb Homes**

**Business Understanding**

Are hosts who have a high review score rating charging renters more? Figure out if Hosts, with higher review rating score, are charging renters a higher price.

**Data Understanding**

Since we need to do a price comparison between listings, we need a way to calculate a price value irrespective of the number of bedrooms, bathrooms etc. One way to do that would be to calculate price per accommodation. The data offers information on price and accommodation, therefore we'll create a new columns 'price_per_accommodation' and fill it with price/accommodation.

In [None]:
host_response_vals = df_listings['host_response_time'].value_counts()
(host_response_vals/df_listings.shape[0]).plot(kind="bar");

**Model Data**

Our analysis will be for the following review scores: 7, 8, 9, 10. Let's see if there is a correlation between hosts with high review rating and price of listing.

In [None]:
# Calculate price per accomnodation for each listing
listings['price_per_accommodation'] = listings['price_val'] / listings['accommodates']

def get_prices_for_review_score(df):
    """Return mean prices for review score values
    df - DataFrame on which to retrieve prices"""
    prices = np.zeros(4)
    
    for i in range(4):
        review_score = 7 + i
        prices[i] = df[(df['review_scores_value'] == review_score)]['price_per_accommodation'].mean()

    return prices

prices = get_prices_for_review_score(listings)
reviews = ('7', '8', '9', '10')
reviews_pos = np.arange(len(reviews))

plt.plot(reviews, prices)
plt.xticks(reviews_pos, reviews)
plt.xlabel('Review scores value')
plt.ylabel('Price per accommodation')
plt.show()

**Results**

We can see that there's an upward trend in pricing as review score values increase. Hosts who've received higher review scores are charging more.