In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

# Article that was published on [Medium](https://elibrunette.medium.com/factors-to-increase-rental-prices-for-airbnb-6a4cbb928e0d) discussing interesting information from analysis.

# Section 1: Business Understanding

#### The purpose of this project is to find insight in the dataset for AirBnB datasets. A few questions that would be interesting to explore would be and that correlates to how much money one might make on rental properties. 
### Question 1: Does having more pictures of the house correspond to higher overall reviews or prices for the location? 
### Question 2: Is square feet correlated to price?  
### Question 3: What feature correlates to higher prices? 
### Question 4: What feature correlate to higher overall review ratings? 
### Question 5: What days are the most popular? Specifically Weekdays or Weekends? 
### Question 6: Does higher review count correspond to higher prices on the location? 
### Question 7: Does not having a picture correlate to not having a review?

# Section 2: Data Understanding

In [None]:
calendar = pd.read_csv('./data/archive/calendar.csv')
listings = pd.read_csv('./data/archive/listings.csv')
reviews = pd.read_csv('./data/archive/reviews.csv')

#### Look at the columns of the data

In [None]:
calendar.columns

In [None]:
listings.columns

In [None]:
reviews.columns

#### Does having more pictures of the house correspond to higher overall reviews or prices for the location? 

In [None]:
# Figure out how much of the data is missing.
listings[['thumbnail_url','medium_url','xl_picture_url']].isna().mean()

In [None]:
# Filling with the value of 0 becuase the cleaning method below will use that to feature generate a column for a binary value for if the url is present
dataForPictureQuestion = listings[['thumbnail_url','medium_url','xl_picture_url','review_scores_value']].dropna(subset=['review_scores_value']).fillna(0)

In [None]:
#cleaning up the column values
for col in dataForPictureQuestion[['thumbnail_url','medium_url','xl_picture_url']].columns: 
    dataForPictureQuestion[col] = dataForPictureQuestion[col].apply(lambda x: 1 if x != 0 else 0)

In [None]:
# view the result from the previous cleaning method. Expecting the only values in the columns to be 1 or 0 to help with filtering
dataForPictureQuestion.head()

In [None]:
dataForPictureQuestion[(dataForPictureQuestion['thumbnail_url'] == 1)]

In [None]:
dataForPictureQuestion[(dataForPictureQuestion['medium_url'] == 1) & (dataForPictureQuestion['xl_picture_url'] == 1)]

#### The data suggests that if you have one picture in the listing, then you will have all of the pictures in the listing. Determined by adjusting the columns in the previous cell, and noting that they all have the same number of rows in the resulting dataset. Also that the dataset has about 90% of a value in the dataset. 

I would like to see a correlation of scores to each of the datasets values. I am including the various picture values, even though if you have one value, then you will get ones across the board. <br>
Using the information that if it is one value for any of the picture urls, then it has a value for all of them. That means that I can keep one column and if it is true for that column, then it will hold true for all of them, and try to save some local ram for the dummy columns. I just randomly decided to use the 'thumbnail_url' column.

In [None]:
dataForPictureQuestion.head()

In [None]:
dataForPictureQuestion['review_scores_value'].unique()

In [None]:
dummyScoreValues = pd.get_dummies(dataForPictureQuestion['review_scores_value'])
#change values so that when we create a dummy column for it, the values will make sense in the resulting dataset.
dataForPictureQuestion['thumbnail_url'].loc[dataForPictureQuestion['thumbnail_url'] == 1] = 'Contains Pictures'
dataForPictureQuestion['thumbnail_url'].loc[dataForPictureQuestion['thumbnail_url'] == 0] = 'No Pictures'
dummyPictureValues = pd.get_dummies(dataForPictureQuestion['thumbnail_url'])

In [None]:
#in the large string of items I group_by values to count the items in the group. Then I reset the index to be able to pivot based on the previous columns. 
# finally filling na with 0 for any NA values becuase NA means that they didn't have any counts in the count method. 
dataForPictureHeatmap = dataForPictureQuestion[['thumbnail_url','review_scores_value','medium_url']].groupby(['thumbnail_url','review_scores_value']).count()\
                                                                            .rename({'medium_url':'count'},axis=1)\
                                                                            .reset_index()\
                                                                            .pivot('thumbnail_url','review_scores_value','count')\
                                                                            .fillna(0)

In [None]:
dataForPictureHeatmap

In [None]:
import matplotlib.pyplot as plt

sns.set(rc={'figure.figsize':(20,5)})
sns.heatmap(dataForPictureHeatmap,annot=True).set_ylim([0,2])
plt.title('Heatmap of distribution of review scores vs if the listing contains pictures.')
plt.xlabel("Review Scores for listing.")
plt.ylabel("Listing posting contains pictures.")

#### This picture shows is a visual representation of the distribution of pictures vs review score. This shows that most of the values that contain pictures also recieve a higher rating. However, compared to the other distribution, the listings with no pictures still recieve a very high review even if they don't have a picture. So it looks like, no picture, no problem! You are in the minority for not posting listings with pictures, but it doesn't seem to matter. 

#### Time to see if this is also representative of the entire dataset. 

In [None]:
listings['review_scores_value'].hist()
plt.xlabel("Review score values")
plt.ylabel("Count of that review score")
plt.title("Histogram of review scores")

In [None]:
listings['review_scores_value'].describe()

This plot shows that the lowest review score is a 2, and most of the distribution of rentals are towards the top for analyses that have pictures attached to the rental. <br> 
This also shows us that the overall review score is skewed towards the high end with more everyone enjoying their stay at AirBnB rooms. With over half of the review scores being a 10. 

### Is square feet correlated to price?

First thing to do is to look into the features for those two items.

In [None]:
listings['square_feet'].isna().sum()

In [None]:
listings.shape

#### This shows that there isn't a ton of data around the square feet, and problable won't be able to accurately infer a square feet from the data that isn't biased... <br> 
#### However, we still might be able to infer some inforamtion from what we have.

In [None]:
not_null_square_feet = listings[listings['square_feet'].notnull()]

In [None]:
not_null_square_feet.head(2)

In [None]:
def clean_price_column(x):
    """
    Removes the $ value, and commas in a price value. 
    
    Param: the value you wish to adjust
    Type: String
    Rtype: String
    Retrun: String that allows the value to be type casted to a float
    """
    return x.replace('$', '').replace(',','')

In [None]:
not_null_square_feet['price'] = pd.to_numeric(not_null_square_feet['price'].str.replace('$',''))
not_null_square_feet.sort_values(['price','square_feet'])[['price','square_feet']]

In [None]:
not_null_square_feet[not_null_square_feet['square_feet'] == 3]

In [None]:
sns.set(rc={'figure.figsize':(10,10)})
sns.regplot(data=not_null_square_feet.sort_values(['price']),x='square_feet',y='price')
plt.title("Regression line for price vs square feet")
plt.ylabel("Price for rental")
plt.xlabel("Square feet for rental")

Since there is only 97 rows with square feet not null <br>
However, from the data present, there is a notable correlation between price and square feet noted both in the scatter plot. The other thing of note is the low values pushing the trend a little higher on he bottom. So we will remove those outliars in the dataset. 

In [None]:
not_null_square_feet.sort_values('square_feet')['square_feet'].unique()

In [None]:
outliar_values = [1.0, 2.0, 3.0, 4.0]
removed_lower_limit_outliars = not_null_square_feet[~not_null_square_feet['square_feet'].isin(outliar_values)]

In [None]:
sns.set(rc={'figure.figsize':(10,10)})
sns.regplot(data=removed_lower_limit_outliars.sort_values(['price']), x='square_feet', y='price')
plt.title("Regression line for price vs square feet")
plt.ylabel("Price for rental")
plt.xlabel("Square feet for rental")

What days are the most popular? Specifically Weekdays or Weekends?

In [None]:
from datetime import datetime

In [None]:
calendar.head(20)

In [None]:
date = datetime.strptime('2016-01-8', '%Y-%m-%d')

In [None]:
if date.weekday() > 4: 
    print("Weekend")
else: print("Weekday")

In [None]:
def is_weekend(date_in):
    """
    Retruns a boolean for if the time passed in is on the weekend or not. 
    Note is that Friday is considered a weekday under this rule, because it is a normal workday. 
    
    Param: date: The string of the format YYYY-M-D representing a date.
    Type: String
    Rtype: Boolean
    Return: True if the date passed in is on a weekend, and false if it is a weekday.  
    """
    date = datetime.strptime(date_in, '%Y-%m-%d')
    if date.weekday() > 4: 
        return True
    return False

In [None]:
calendar['weekend'] = calendar['date'].apply(is_weekend)

In [None]:
calendar['price'].isna()

In [None]:
calendar[calendar['available'] == 'f']['price'].unique()

A note for price on nights that are unavialable is null. So we will only be able to analyze the price for if the listing is available for that day. 

In [None]:
available = calendar[calendar['available'] == 't']

In [None]:
available['price'] = pd.to_numeric(available['price'].apply(clean_price_column))

In [None]:
available['price'].mean()

In [None]:
available_weekends = available[(available['weekend'] == True)]
available_weekdays = available[available['weekend'] == False]

In [None]:
available_weekends['price'].mean()

In [None]:
available_weekdays['price'].mean()

Interesting that the mean for the whole dataset is closer to the mean for the weekdays, however this makes sense because of the number of weekdays compared to weekend days. 

In [None]:
available_weekends['price'].hist(bins=100)

In [None]:
available_weekends['price'].describe()

#### Shows that the distibution has a lot of lower valued rooms available on the weekends. Seeming ot be lumping around the 100 range. 

In [None]:
import matplotlib.pyplot as plt

In [None]:
available[available['weekend'] == False]['price'].hist(bins=100, xlabelsize=10)

In [None]:
available_weekdays['price'].describe()

#### Shows the same result at the previous histogram, and the describe function seems to validate with the 50% being 105. <br> 
These distributions seem to represent log scales. So next is to convert and see if that assumption is accurate.

In [None]:
import math 
available_weekdays['Log Base 10 Price'] = available_weekdays['price'].apply(math.log10)

In [None]:
sns.histplot(available_weekdays, x="Log Base 10 Price", bins=100).set_title('Log10 histogram of price on weekdays')

#### Indeed, the distribution is log based

In [None]:
available_weekends['price'].describe()

#### Curious if the weekend distribution is the same.

In [None]:
import math 
available_weekends['Log Base 10 Price'] = available_weekends['price'].apply(math.log10)

In [None]:
sns.histplot(available_weekends, x="Log Base 10 Price", bins=100).set_title('Log10 histogram of price on weekends')

### And indeed it looks that way as well. 

Now to investigate a little more of the distribution before doing data preparation for the modeling.

In [None]:
available_weekends['price'].mode()

In [None]:
available_weekdays['price'].mode()

In [None]:
len(calendar[calendar['available']=='t'])

In [None]:
### Bonus: What does number of reviews compared to price look like? 

In [None]:
listings['price'] = listings['price'].apply(clean_price_column).astype('float64')

In [None]:
sns.scatterplot(data=listings, x='reviews_per_month', y='price')

Just checking listing reviews because I was curious. I don't think that there is much to gleen from this except that the more rentals you have, it appears that you have lower prices for that property. Most likely due to the rental being available during the weekdays. The ones with less reviews per month are higher prices, meaning that they are most likely rentals that are on rentals available one the weekend. <br> 
Along with the fact that if there are that many reviews for one property that indicates that there are a lot of people visiting that property in a month. Or they have some sort of incentive for leaving a review. <br> 
Another option is review buffing by having the same person leave multiple reviews for the same property. 

# Section 3: Data Preparation  

### What feature correlates to higher prices and create a model around it to analyze the coefficients for multiple variable analysis? <br> 
To start the data will have to be cleaned and then the information viewed to be described. 

In [None]:
#only grabbing the data that will fit the model at this moment. 
#If time allows I will add more data into the model to see if it affects the overall trend found from this analysis
startingListings = listings.select_dtypes(include=['double','int'])

In [None]:
# The goal is to predict price, so we will assign that to the y variable.
y=listings['price']

In [None]:
listings['square_feet'].isna().mean()

Dropping latitude and longitude from the dataset, because they have unique values for the entire dataset, and could be trained to be a 1-1 relationship with price.<br> 
Also dropping license, because it has all null values for this dataset. <br> 
Finally, dropping square feet from the dataset, because of the significant number of null values. 97% null will sway the dataset weights for square feet given the method for filling na values. <br>
The goal is to create a model for price and analyse the coef_ attribute to see what columns have a postive correlation to the price of the unit. 

In [None]:
x = startingListings.drop(['price','latitude','longitude','license','square_feet'],axis=1)

In [None]:
x.isna().sum()

In [None]:
x.shape

In [None]:
x.describe()

In [None]:
# Given the number of na values in the dataset, it makes sense to add fill the nas with values that doesn't make sense for the model. 
# If we removed all na values, there would not be much of a dataset for modeling.
filled_na = x.fillna(-1)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(filled_na, y, test_size=.33, random_state=42)

# Section 4: Modeling

Continuing the duscussion above for finding a linear regression model for price and what values help predict price. 

In [None]:
from sklearn import linear_model
reg = linear_model.LinearRegression()

In [None]:
reg.fit(x_train, y_train)

In [None]:
pred = reg.predict(x_test)

In [None]:
reg.coef_

Based on the coefficients for the logistical model, it looks like the three highest attributes to determine price is bathrooms, bedrooms and beds. 

In [None]:
reg2 = linear_model.LinearRegression()
reg2.fit(filled_na, y)

In [None]:
reg2.coef_

# Section 5: Evaluation

## 1. Does having more pictures of the house correspond to higher overall reviews or prices for the location? 
   1. Which ones have a better correlation to the higher review?
      1. After some review it appears there is a good coorelation between including pictures and ratings.
      2. This could be just reflecting the overall distribution of reviews for the data though without showing any clear signs of one being better than the other.      

In [None]:
listings['review_scores_value'].hist()
plt.xlabel("Review score values")
plt.ylabel("Count of that review score")
plt.title("Histogram of review scores")

This picture shows the overall review scores distribution for the data. 

In [None]:
sns.set(rc={'figure.figsize':(20,5)})
sns.heatmap(dataForPictureHeatmap,annot=True).set_ylim([0,2])
plt.title('Heatmap of distribution of review scores vs if the listing contains pictures.')
plt.xlabel("Review Scores for listing.")
plt.ylabel("Listing posting contains pictures.")

This is a heatmap of the review scores vs if they contain pictures. This visualization shows that there is a significant more amount of reviews containing pictures than not. However the distribution for no pictures drops off significantly at 8s, while reviews for listings with pictures drop off around review scores of 7.

## 2. Is square feet correlated to price?  
   1. Since there is only 97 rows with square feet not null, this questions will be excluded from investigation
   2. However, from the data present, there isn't any notable correlation between price and square feet

In [None]:
sns.set(rc={'figure.figsize':(10,10)})
sns.regplot(data=not_null_square_feet.sort_values(['price']),x='square_feet',y='price')
plt.title("Does square feet increase rental price?")
plt.ylabel("Price for rental")
plt.xlabel("Square feet for rental")

This plot showing square feet and price for rentals have an overall upward trend, even with some outliars. There is also a note about the square feet less than 100 and with prices that are fairly high for a night stay. These values pull the trend line down. This chart also has square feet of 0,1,2 ft^2 removed as well. However, dispite that, there is a correlation between square feet and rental price. Another note about this graph is that there isn't a ton of points, compared to the whole dataset (N ~ 100). 

### 3. What feature correlates to higher prices? 
   1. Based on the coefficients for the linear model, it looks like the three highest attributes to determine price is bathrooms, bedrooms and beds.

In [None]:
linear_model_coef = pd.DataFrame()
linear_model_coef['columns'] = x.columns
linear_model_coef['coef'] = reg2.coef_
linear_model_coef['annotation'] = reg2.coef_

In [None]:
linear_model_coef[['columns','coef']]

In [None]:
plot = sns.barplot(x='coef', y='columns', data=linear_model_coef.sort_values('coef'))
plt.title("What features affect the price of the rental?")

From the table, it is apparent that the best correlations to bedrooms, bathrooms, beds, review_scores_location, and the negative correlation is review_scores_value. 

### 4. What feature correlate to higher overall review ratings? 

See visual from 3 and analysis from 3. 

### 5. What days are the most popular? Specifically Weekdays or Weekends? 
   1. The data suggests that there is a correlation to the weekends having higher base rate for most of the stats. 

In [None]:
available_weekends['price'].hist(bins=100)
plt.xlabel('Weekend price')
plt.ylabel('Count')
plt.title("What does the distribution of weekend prices?")

Histogram of the weekend price showing that there is a huge spike of prices at the lower end of the dataset. With the distribution being: 

In [None]:
available_weekends['price'].describe()

In [None]:
available_weekends['price'].mode()

In [None]:
available_weekdays['price'].hist(bins=100, xlabelsize=10)
plt.xlabel("Weekday Price")
plt.ylabel("Count")
plt.title("What is the distribution of weekday prices?")

In [None]:
available_weekdays['price'].describe()

In [None]:
available_weekdays['price'].mode()

Based on the summary statistics for these distributions, the mean is more on the weekend, and most of the distribution is scued up on the weekend based on the 50th and 75th percent. So, overall the prices tend to be higher on the weekends. 

### 6. Does higher review count correspond to higher prices on the location? 

Based on the review from 3, it does not appear so. Actually the review scores negatively attribute to price. I believe this is because of the number of highly rated properties and listings in the dataset. 

### 7. Does not having a picture correlate to not having a review?
   1. It does not. There is a good portion of na reviews that have pictures. I again believe that this has something to do with the overall distribution of the data for the dataset.