## Airbnb Data Challenge 

Data taken from:

The dataset has data about rental places from Latin America's capital, Rio de Janeiro. One can propose the following challenges for the candidates:

- Estimate price variable for each observation. Regression Approach.
- Estimate room type variable for each observation. Classification Approach.

This notebook makes a superficial analysis of the dataset, paying attention on each requirement from the **Expectation** section on readme.md.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

import re
import math

from airbnb_prediction.config import data_dir_raw
from airbnb_prediction import preprocess, objects

pd.set_option('display.max_columns', None)

In [2]:
df = pd.read_csv(data_dir_raw / 'listings.csv')

In [None]:
fillna_dict = {
    'host_response_time': 'no_info',
    'host_is_superhost': df['host_is_superhost'].mode()[0],
    'bedrooms': df['bedrooms'].mode()[0],
    'beds': df['beds'].mode()[0],
    'days_since_host': df['days_since_host'].mode()[0]
}

### 1. Basic Dataframe Info

We can check dataframe's shape to assert how many rows we have. The .head() method give us a glimpse of what we're about to attack.

In [None]:
df.shape

In [None]:
df.head(3)

### 2. Exploratory Data Analysis

Here we check on how data behaves. A good machine learning model heavily depends on how good we undestand our data and make use of them.

#### 2.1 Missing Data

Checking on how many missing data we have for each variable.

In [None]:
# Plotting Missing Values Info
preprocess.plot_missing_values(df)

As we can see, there are null columns in this dataset. For this very reason, they can be dropped without mercy. 

In [None]:
preprocess.dropping_empty_columns(df)

#### 2.2 Target Variable: Price
Price comes as a string variable with special characters. Because of this, some cleaning must be done before any kind of exploration.

In [None]:
# Converting Price to Integer
df['price'] = df['price'].apply(lambda x: preprocess.convert_price_to_int(x))

As price has a heavily assimetric distribution by it's own nature, we shall work with it's log. It's not mandatory, however the log-transformation removes data asymmetry by shinking the data space and can be tested for normallity presumptions. The visualization below shows how the new variable behaves. 

In [None]:
df['log_price'] = np.log1p(df['price'])

The **log_price** changes the data scale. With less extreme values, both mean and std are significantly lower. The regression models can be used on log values and then, transformed back using the exponential function.

Mlflow can track both experiments, using either log and normal prices as target variables.

In [None]:
df['price'].describe()

In [None]:
df['log_price'].describe()

In [None]:
sns.displot((df['price']), bins=100)

In [None]:
sns.displot(df['log_price'], bins=100)

In [None]:
sns.boxplot(x=df['price'])

In [None]:
# A number of outliers are more explicit here.
sns.boxplot(x=df['log_price'])

#### 2.3 Feature Engineering

Using the present data, we can gather further insights and make new variables to help model's prediction. A number of new variables will be made accordingly.

##### 2.3.1 String Variables
String variables can say a number of things about our data. The most basic engineering that can be made on them are the number of characters each one has.

In [None]:
string_variables = [
    'name',
    'description',
    'neighborhood_overview',
    'host_about'   
]

In [None]:
# Create counting string variables from string_variables list and drop the former.
preprocess.count_characters_variables(df, string_variables)

In [None]:
for variable in string_variables:
    print(df['count_{}'.format(variable)].describe())

- **count_descripion** has a seemly uniform distribution, aside the max value that might be about the character limit.

- **count_name** has a rather exotic distribution (and almost a bimodal distribution). Most of it's data is within the 0-39 count range and some outliers (check the original variable to see these entries).

- Over 51% of the **host_about** variable is missing. People tend to speak less about themselves as it's distribution decreases as the length rises.

- The same happens similarly with **count_name**: Over 46% of it's data is missing and it has a descendent distribution.

Some hypothesis:

- Longer texts results in higher prices.
- If host describe themself or the host has a neighborhood overview (binary variable), the price is higher.

In [None]:
sns.displot(df['count_description'])

In [None]:
sns.displot(df['count_name'])

#### - Comparing Full X Filtered Variable: 'count_neighborhood_overview'

In [None]:
# Over 46% of the host_about variable is missing.
missing_host_about = df.query('count_neighborhood_overview==0').shape[0]
round(missing_host_about/df.shape[0], 3)

In [None]:
sns.displot(df['count_neighborhood_overview'])

In [None]:
sns.displot(df[df['count_neighborhood_overview']>0]['count_neighborhood_overview'])

#### - Comparing Full X Filtered Variable: 'count_host_about'

In [None]:
# Over 51% of the host_about variable is missing.
missing_host_about = df.query('count_host_about==0').shape[0]
round(missing_host_about/df.shape[0], 3)

In [None]:
sns.displot(df['count_host_about'])

In [None]:
# Filtrando observações > 0
sns.displot(df[df['count_host_about']>0]['count_host_about'])

##### 2.3.2 Numerical Variables

Construction of some numerical variables

In [None]:
# Time Since Registered as Host
# Still, we have 24 missing values for this variable. 
# Check if its a new host crossing with review number and other variables.
df['days_since_host'] = (pd.to_datetime('today')-pd.to_datetime(df['host_since'])).dt.days

Bathrooms is a string variable, instead of a int. Because of that, we can make some manipulation like the **price** variable.

Definition for Half-Bath:
- a bathroom in a private home that contains a toilet and sink but no bathtub or shower. If it's just a half bath for guests, a nice sink, a sturdy toilet, and a decorative towel rack will do.

By this definition we can make two new variables: A binary variable for the presence of a half-bath and a count variable for the number of bathrooms.

In [None]:
df['bathrooms_text'].unique()

In [None]:
# Create "bathroom_text_clean" variable containing only "bathrooms_text"'s numerical part.
df['bathroom_text_clean'] = preprocess.extract_numbers(df, 'bathrooms_text', fillna=True)


# Create "Bathroom" count variable and "half_bath" as binary variable.
df['bathrooms'] = np.where(df['bathroom_text_clean'].isnull()==False,
                           (df['bathroom_text_clean']).astype(float).apply(np.floor), 0)
df['half_bath'] = np.where(df['bathroom_text_clean'].str.isalnum()==False, 1, 0)

We can calculate the delta between the minimum and maximum nights a person can spend at that place.

In [None]:
df['delta_nights'] = preprocess.creating_delta_variable(df, 'minimum_nights', 'maximum_nights')

In [None]:
df[['minimum_nights', 'maximum_nights', 'delta_nights']]

Mean number of reviews: number of reviews / delta from first/last review

In [None]:
df['delta_date_reviews'] = preprocess.creating_delta_date_variable(df, 'first_review', 'last_review')
df['mean_reviews'] = df['number_of_reviews']/(df['number_of_reviews'].fillna(0)+1)

In [None]:
sns.displot(df['mean_reviews'])

#### 2.3.3 Categorical Variables

Refactor variables to categorical, missing values imputing and some other feature engineering.

The 'neighbourhood_cleansed' variable has 151 unique entries. This high variability can bring no info to the model. However, it's public knowledge that city's regions are like clusters, with similarites among their own neighborhoods.

We can group all neighborhoods into Zones, lowering 151 classes to only 4.

In [None]:
df['regioes'] = preprocess.creating_zones(df)

Now that we have the grouped neighborhoods, we can do a quick analysis with the target variable.
One hypotesis is that Zona Sul and Zona Oeste (Barra, Recreio, etc...) have the higher prices than the other regions.

However, as we can see below, The region variable is highly unbalanced. As Zona Sul has over 60% of the observations we can binarize the variable as well, but by judging the statistics below, it might not be a good idea as the Zones have distinct measure positions.

In [None]:
# Zona Sul covers more than 60% rows.
round(df[df['regioes']=='zona_sul'].shape[0]/df.shape[0], 4)

As one can see below, the Zona Sul region does not have the highest mean price from the other. However we might not be capturing the "true" statistics from the other regioes due their sample size.

In [None]:
df.groupby('regioes')['price'].describe()

In [None]:
preprocess.plot_configuration()
sns.boxplot(x="regioes", y="log_price", data=df)

Host Response Time' missing values can be filled with a "no_info" category. As we have a lot of missings, the mode inputing could not be the best alternative here (the assumption that 8k missings are from the "within an hour" response can be too heavy here... in this case, missing can infer cases where the host simply does not do the response action at all because the **host_response_rate** is missing for these cases as well).

In [None]:
df['host_response_time'].fillna('no_info', inplace=True)
df['host_response_time'].value_counts(dropna=False)

In [None]:
# The variable is mostly constant here.
df['host_response_rate'] = df['host_response_rate'].str.slice(0,-1)
df['host_response_rate'] = df['host_response_rate'].fillna(0)
df['host_response_rate'] = df['host_response_rate'].astype(int)
df[df['host_response_rate']>0]['host_response_rate'].describe()

Property Type has many categories, so we can refactor it.
The "Entire apartment" entry has almost 50% rows.

In [None]:
df['property_type'].nunique()

In [None]:
df['property_type_refactor'] = preprocess.creating_property_type_refactor(df)

In [None]:
# However, its cleary the "others" variable need more refining due its larger std.
df.groupby('property_type_refactor')['price'].describe()

We can create a binary variable indicating if the host lives in RJ. However, as one can see that their distribution are pretty similar the tree model can capture patterns that are not linear.

In [None]:
df['is_host_rj'] = preprocess.creating_host_location(df)

In [None]:
df.groupby('is_host_rj')['price'].describe()

In [None]:
preprocess.plot_configuration()
sns.boxplot(x=df['is_host_rj'], y=df['log_price'])

In [None]:
# Dropping Unused Variables

In [None]:
df.drop(objects.to_drop, axis=1, inplace=True)

In [None]:
df.fillna(fillna_dict, inplace=True)