In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

# Article that was published on [Medium](https://elibrunette.medium.com/factors-to-increase-rental-prices-for-airbnb-6a4cbb928e0d) discussing interesting information from analysis.

# Section 1: Business Understanding

#### The purpose of this project is to find insight in the dataset for AirBnB datasets. A few questions that would be interesting to explore would be. 
### Question 1: Does having more pictures of the house correspond to higher overall reviews or prices for the location? 
### Question 2: Is square feet correlated to price?  
### Question 3: What feature correlates to higher prices? 
### Question 4: What feature correlate to higher overall review ratings? 
### Question 5: Does a word cloud give any interesting feedback into the reviews? 
### Question 6: What days are the most popular? Specifically Weekdays or Weekends? 
### Question 7: Does higher review count correspond to higher prices on the location? 
### Question 8: Does stay length correspond to rating review? 
### Question 9: Does not having a picture correlate to not having a review?

# Section 2: Data Understanding

In [None]:
calendar = pd.read_csv('./data/archive/calendar.csv')
listings = pd.read_csv('./data/archive/listings.csv')
reviews = pd.read_csv('./data/archive/reviews.csv')

#### Look at the columns of the data

In [None]:
calendar.columns

In [None]:
listings.columns

In [None]:
reviews.columns

#### Does having more pictures of the house correspond to higher overall reviews or prices for the location? 

In [None]:
# Figure out how much of the data is missing.
listings[['thumbnail_url','medium_url','xl_picture_url']].isna().mean()

In [None]:
# Filling with the value of 0 becuase the cleaning method below will use that to feature generate a column for a binary value for if the url is present
dataForPictureQuestion = listings[['thumbnail_url','medium_url','xl_picture_url','review_scores_value']].dropna(subset=['review_scores_value']).fillna(0)

In [None]:
#cleaning up the column values
for col in dataForPictureQuestion[['thumbnail_url','medium_url','xl_picture_url']].columns: 
    dataForPictureQuestion[col] = dataForPictureQuestion[col].apply(lambda x: 1 if x != 0 else 0)

In [None]:
# view the result from the previous cleaning method. Expecting the only values in the columns to be 1 or 0 to help with filtering
dataForPictureQuestion.head()

In [None]:
dataForPictureQuestion[(dataForPictureQuestion['thumbnail_url'] == 1)]

In [None]:
dataForPictureQuestion[(dataForPictureQuestion['medium_url'] == 1) & (dataForPictureQuestion['xl_picture_url'] == 1)]

#### The data suggests that if you have one picture in the listing, then you will have all of the pictures in the listing. Determined by adjusting the columns in the previous cell, and noting that they all have the same number of rows in the resulting dataset. Also that the dataset has about 90% of a value in the dataset. 

I would like to see a correlation of scores to each of the datasets values. I am including the various picture values, even though if you have one value, then you will get ones across the board. <br>
Using the information that if it is one value for any of the picture urls, then it has a value for all of them. That means that I can keep one column and if it is true for that column, then it will hold true for all of them, and try to save some local ram for the dummy columns. I just randomly decided to use the 'thumbnail_url' column.

In [None]:
dataForPictureQuestion.head()

In [None]:
dataForPictureQuestion['review_scores_value'].unique()

In [None]:
dummyScoreValues = pd.get_dummies(dataForPictureQuestion['review_scores_value'])
#change values so that when we create a dummy column for it, the values will make sense in the resulting dataset.
dataForPictureQuestion['thumbnail_url'].loc[dataForPictureQuestion['thumbnail_url'] == 1] = 'Contains Pictures'
dataForPictureQuestion['thumbnail_url'].loc[dataForPictureQuestion['thumbnail_url'] == 0] = 'No Pictures'
dummyPictureValues = pd.get_dummies(dataForPictureQuestion['thumbnail_url'])

In [None]:
#in the large string of items I group_by values to count the items in the group. Then I reset the index to be able to pivot based on the previous columns. 
# finally filling na with 0 for any NA values becuase NA means that they didn't have any counts in the count method. 
dataForPictureHeatmap = dataForPictureQuestion[['thumbnail_url','review_scores_value','medium_url']].groupby(['thumbnail_url','review_scores_value']).count()\
                                                                            .rename({'medium_url':'count'},axis=1)\
                                                                            .reset_index()\
                                                                            .pivot('thumbnail_url','review_scores_value','count')\
                                                                            .fillna(0)

In [None]:
dataForPictureHeatmap

In [None]:
import matplotlib.pyplot as plt

sns.set(rc={'figure.figsize':(20,5)})
sns.heatmap(dataForPictureHeatmap,annot=True).set_ylim([0,2])
plt.title('Heatmap of distribution of review scores vs if the listing contains pictures.')
plt.xlabel("Review Scores for listing.")
plt.ylabel("Listing posting contains pictures.")

#### This picture shows is a visual representation of the distribution of pictures vs review score. This shows that most of the values that contain pictures also recieve a higher rating. However, compared to the other distribution, the listings with no pictures still recieve a very high review even if they don't have a picture. So it looks like, no picture, no problem! You are in the minority for not posting listings with pictures, but it doesn't seem to matter. 

#### Time to see if this is also representative of the entire dataset. 

In [None]:
listings['review_scores_value'].hist()
plt.xlabel("Review score values")
plt.ylabel("Count of that review score")
plt.title("Histogram of review scores")

In [None]:
listings['review_scores_value'].describe()

This plot shows that the lowest review score is a 2, and most of the distribution of rentals are towards the top for analyses that have pictures attached to the rental. <br> 
This also shows us that the overall review score is skewed towards the high end with more everyone enjoying their stay at AirBnB rooms. With over half of the review scores being a 10. 

# Does not having a picture correlate to not having a review?

In [None]:
noReviewScores = listings[listings['review_scores_value'].isna()]

In [None]:
noThumbnailURL = listings[listings['thumbnail_url'].isnull()]

In [None]:
merged = noReviewScores.merge(noThumbnailURL)

In [None]:
merged.shape

In [None]:
noThumbnailURL.shape

In [None]:
noReviewScores[noReviewScores['thumbnail_url'].isnull()].shape

The investigation from above does not have any good insight from what I can that leads from having one picture correlate to higher review rating. 

### Is square feet correlated to price?

In [None]:
listings['square_feet'].isna().sum()

In [None]:
listings.shape

In [None]:
not_null_square_feet = listings[listings['square_feet'].notnull()]

In [None]:
not_null_square_feet.head(2)

In [None]:
not_null_square_feet[['square_feet','price']]

In [None]:
sns.scatterplot(data=not_null_square_feet,x='square_feet',y='price')

Since there is only 97 rows with square feet not null, this questions will be excluded from investigation <br>
However, from the data present, there isn't any notable correlation between price and square feet

### What feature correlates to higher prices? <br> 
To start the data will have to be cleaned and then the information viewed to be described. 

In [None]:
def clean_price_column(x):
    """
    Removes the $ value, and commas in a price value. 
    
    Param: the value you wish to adjust
    Type: String
    Rtype: String
    Retrun: String that allows the value to be type casted to a float
    """
    return x.replace('$', '').replace(',','')

In [None]:
listings['price'] = listings['price'].apply(clean_price_column).astype('float64')

In [None]:
y=listings['price']

In [None]:
startingListings = listings.select_dtypes(include=['double','int'])

Dropping latitude and longitude from the dataset, because they have unique values for the entire dataset, and could be trained to be a 1-1 relationship with price.<br> 
Also dropping license, because it has all null values for this dataset. 

In [None]:
x = startingListings.drop(['price','latitude','longitude','license'],axis=1)

In [None]:
x.columns

In [None]:
x.isna().sum()

In [None]:
x.shape

In [None]:
x.describe()

In [None]:
filled_na = x.fillna(-1)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(filled_na, y, test_size=.33, random_state=42)

In [None]:
from sklearn import linear_model
reg = linear_model.LinearRegression()

In [None]:
reg.fit(x_train, y_train)

In [None]:
pred = reg.predict(x_test)

In [None]:
reg.coef_

Based on the coefficients for the logistical model, it looks like the three highest attributes to determine price is bathrooms, bedrooms and beds. 

In [None]:
reg2 = linear_model.LinearRegression()
reg2.fit(filled_na, y)

In [None]:
reg2.coef_

In [None]:
sns.scatterplot(data=listings, x='reviews_per_month', y='price')

What days are the most popular? Specifically Weekdays or Weekends?

In [None]:
from datetime import datetime

In [None]:
calendar.head(20)

In [None]:
date = datetime.strptime('2016-01-8', '%Y-%m-%d')

In [None]:
if date.weekday() > 4: 
    print("Weekend")
else: print("Weekday")

In [None]:
def is_weekend(date_in):
    """
    Retruns a boolean for if the time passed in is on the weekend or not. 
    
    Param: date: The string of the format YYYY-M-D representing a date.
    Type: String
    Rtype: Boolean
    Return: True if the date passed in is on a weekend, and false if it is a weekday.  
    """
    date = datetime.strptime(date_in, '%Y-%m-%d')
    if date.weekday() > 4: 
        return True
    return False

In [None]:
calendar['weekend'] = calendar['date'].apply(is_weekend)

In [None]:
calendar['price'].isna()

In [None]:
available = calendar[calendar['available'] == 't']

In [None]:
available['price'] = available['price'].apply(clean_price_column).astype('float64')

In [None]:
available['price'].mean()

In [None]:
available_weekends = available[(available['weekend'] == True)]
available_weekdays = available[available['weekend'] == False]

In [None]:
available_weekends['price'].mean()

In [None]:
available_weekdays['price'].mean()

In [None]:
available[available['weekend'] == True]['price'].hist(bins=100)

In [None]:
import matplotlib.pyplot as plt

In [None]:
available[available['weekend'] == False]['price'].hist(bins=100, xlabelsize=10)

In [None]:
available_weekdays['price'].describe()

In [None]:
import math 
available_weekdays['Log Base 10 Price'] = available_weekdays['price'].apply(math.log10)

In [None]:
sns.histplot(available_weekdays, x="Log Base 10 Price", bins=100).set_title('Log10 histogram of price on weekdays')

In [None]:
available_weekends['price'].describe()

In [None]:
import math 
available_weekends['Log Base 10 Price'] = available_weekends['price'].apply(math.log10)

In [None]:
sns.histplot(available_weekends, x="Log Base 10 Price", bins=100).set_title('Log10 histogram of price on weekends')

In [None]:
available_weekends['price'].mode()

In [None]:
available_weekdays['price'].mode()

In [None]:
len(calendar[calendar['available']=='t'])

In [None]:
calendar = calendar.rename({'listing_id':'id', 'price':'date_price'}, axis=1)

In [None]:
calendar.merge(listings, on='id')

# Section 3: Data Preparation  

# Section 4: Modeling

# Section 5: Evaluation

1. Does having more pictures of the house correspond to higher overall reviews or prices for the location? 
   1. Which ones have a better correlation to the higher review?
      1. After some review it appears there is a good coorelation between including pictures and ratings.
      2. This could be just reflecting the overall reviews for the data though. 
2. Is square feet correlated to price?  
   1. Since there is only 97 rows with square feet not null, this questions will be excluded from investigation
   2. However, from the data present, there isn't any notable correlation between price and square feet
3. What feature correlates to higher prices? 
   1. Based on the coefficients for the logistical model, it looks like the three highest attributes to determine price is bathrooms, bedrooms and beds.
4. What feature correlate to higher overall review ratings? 
5. Does a word cloud give any interesting feedback into the reviews? 
6. What days are the most popular? Specifically Weekdays or Weekends? 
   1. The data suggests that there is a correlation to the weekends having higher base rate. 
7. Does higher review count correspond to higher prices on the location? 
8. Does stay length correspond to rating review? 
9. Does not having a picture correlate to not having a review?
   1. It does not. There is a good portion of na reviews that have pictures. 