### __BUSA3020 Group Assignment - Predicting Airbnb Listing Prices in Sydney__

--- 

**Due Date:** Friday, 3 June 2022 @ 11.59pm (Week 13)

**Overview:**   

- In the group assignment you will form a team of up to 3 students (minimum 2) and participate in a forecasting competition on Kaggle
- The goal is to predict listed property prices of Airbnb stays based on various Airbnb characteristics and regression models

- You will:  
    - Write a problem statement and perform Exploratory Data Analysis  
    - Clean up data, deal with categorical features and missing observations, and create new variables (feature engineering)  
    - Construct and tune forecasting models, produce forecasts and submit your predictions to Kaggle  
    - Each member of the team will record a video presentation of their work  
    - Marks will be awarded producing a prediction in the top 3 positions of their unit as well as for reaching the highest ranking on Kaggle amongst all teams.

**Instructions:** 

- Form a team of 3 students (minimum 2 students)  
- Each team member needs to join [https://www.kaggle.com](https://www.kaggle.com/)  
- Choose a team leader and form a team in the competition [https://www.kaggle.com/t/caad5fd1f5134d86a15ab13d37d98d19](https://www.kaggle.com/t/caad5fd1f5134d86a15ab13d37d98d19)
    - Team leader to click on `team` and join and invite other team members to join
    - There are two MQBS BUSA units competing in this competition
    - Your **team's name must start** with your unit code, for instance you could have a team called BUSA3020_PR3D1CT0RS
- All team members should work on all the tasks listed below however   
    - **Choose a team member who will be responsible for one of each of the 3 tasks listed below**    

**Marks**: 

- Total Marks: 40
- Your mark will consist of:  
    - 50% x overall assignment mark + 45% x mark for the task that you are responsible for + 5% x mark received from your teammates for your effort in group work 
- 7 marks will be deducted from each Task for which there is no video presentation 

**Competition Marks:**
- 5 marks: Ranking in the top 3 places of your unit on Kaggle (make sure you name your team as instructed above)
- 2 marks: Reaching the first place in your unit  (make sure you name your team as instructed above)


**Submissions:**  

1. On Kaggle: submit your team's forecast in order to be ranked by Kaggle
    - Can do this as many times as necessary while building their model  
2. On iLearn **only team leader to submit** this Jupyter notebook re-named `Group_Assignment_MQ_ID.ipynb` where MQ_ID is team leader's MQ ID number 
    - The Jupyter notebook must contain team members names/ID numbers, and team name in the competition
    - Provide answers to the 3 Tasks below in the allocated cells including all codes/outputs/writeups 
    - One 15 minute video recording of your work 
        - Each team member to provide a 5 minute presentation of the Task that they led (it is best to jointly record your video using Zoom)
        - When recording your video make sure your face is visible, that you share your Jupyter Notebook and explain everything you've done in the submitted Jupyter notebook on screen
        - 7 marks will be deducted from each Task for which there is no video presentation or if you don't follow the above instructions
        
3. On iLearn each student needs to submit a file with their teammates' names, SID and a mark for their group effort (out of 100%)



---

**Fill out the following information**

For each team member provide name, Student ID number and which task is performed below

- Team Name on Kaggle: `(insert here)`
- Team Leader and Team Member 1: `(insert here)`
- Team Member 2: `(insert here)`
- Team Member 3: `(insert here)`

---

## Task 1: Problem Description and Initial Data Analysis

1. Read the Competition Overview on Kaggle [https://www.kaggle.com/t/caad5fd1f5134d86a15ab13d37d98d19](https://www.kaggle.com/t/caad5fd1f5134d86a15ab13d37d98d19)
2. Referring to Competition Overview and the data provided on Kaggle write about a 500 words **Problem Description** focusing on key points that will need to be addressed as first steps in Tasks 2 and 3 below, using the following headings:
    - Forecasting Problem
    - Evaluation Criteria
    - Types of Variables/Features
    - Data summary and main data characteristics
    - Missing Values (only explain what you found at this stage)
    
Total Marks: 11


In [1]:
import json
import ast

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
from IPython import display
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

# load data
train_df = pd.read_csv(
    "busa-2022s1/train.csv",
    index_col="ID",
    parse_dates=["host_since", "first_review", "last_review"],
)
test_df = pd.read_csv(
    "busa-2022s1/test.csv",
    index_col="ID",
    parse_dates=["host_since", "first_review", "last_review"],
)

In [2]:
df = pd.concat([train_df, test_df], keys=['train', 'test'])

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 10000 entries, ('train', 0) to ('test', 9999)
Data columns (total 60 columns):
 #   Column                                        Non-Null Count  Dtype         
---  ------                                        --------------  -----         
 0   name                                          9998 non-null   object        
 1   description                                   9846 non-null   object        
 2   neighborhood_overview                         6943 non-null   object        
 3   host_name                                     10000 non-null  object        
 4   host_since                                    10000 non-null  datetime64[ns]
 5   host_location                                 9996 non-null   object        
 6   host_about                                    6110 non-null   object        
 7   host_response_time                            6821 non-null   object        
 8   host_response_rate                            

In [4]:
df.head().T

Unnamed: 0_level_0,train,train,train,train,train
ID,0,1,2,3,4
name,Manly Harbour House,Unique Designer Rooftop Apartment in City Loca...,"Studio Yindi @ Mosman, Sydney","2br Eclectic Stylish Home, 2 mins to Bondi Beach",A little bit of Sydney - Australia
description,"Beautifully renovated, spacious and quiet, our...",Penthouse living at it best ... You will be st...,"An open plan apartment, which opens onto a spa...","Two blocks to the beach, surf and coffee. Larg...","Hello Everyone,<br /><br />We have a quiet are..."
neighborhood_overview,Balgowlah Heights is one of the most prestigio...,The location is really central and there is nu...,"Mosman is a smart, middle to upper class subur...",3 minutes to the beach and cafes. 5 minutes t...,
host_name,Heidi,Morag,John,Eilish,Bryan
host_since,2009-11-20 00:00:00,2009-12-03 00:00:00,2010-11-06 00:00:00,2010-11-25 00:00:00,2011-01-03 00:00:00
host_location,"Sydney, New South Wales, Australia","Sydney, New South Wales, Australia","Sydney, New South Wales, Australia","Bondi Beach, New South Wales, Australia","Sydney, New South Wales, Australia"
host_about,I am a Canadian who has made Australia her hom...,I am originally Scottish but I have made Sydne...,Faber est suae quisquae fortunae\r\n\r\nWe bec...,"I'm a designer. I have lived in many cities, c...",We are living in Sydney. We like to see Wine r...
host_response_time,within a few hours,within an hour,within a few hours,within a day,within an hour
host_response_rate,100%,100%,100%,100%,100%
host_acceptance_rate,69%,100%,81%,100%,89%


In [5]:
# Forecasting Problem: regression on prices of Airbnb stays based on various Airbnb characteristics

- Forecasting Problem:
When it comes to determining the rental price that is posted on Airbnb, both the host and the customer bear significant and difficult responsibilities. For the host, it is possible for them to set a reasonable price without sacrificing the amount of profit they can earn. Besides, it is essential for the customers to understand the significant factors influencing the price, and look for places that provide comparable prices. The objective of this project is to forecast Airbnb listing prices in Sydney based on the listed properties characteristics. In order to predict rental price, several machine learning models including linear regression, random forest and other models will be adopted with the availability of Scikit-learn module library in Python.

In [6]:
# Evaluation Criteria: MSE

- Evaluation Criteria:
In fact, this problem relates to the supervised regression problem, thus, in this case, root mean squared error (RMSE) and mean absolute error (MAE) are employed as the selection criteria in order to compare the performance of all the models.

- Data Presentation:
The data used in this study is the data set recording all the listing Airbnb transactions in Sydney, which contains a total of 10,000 observations of listing prices in both training and test sets from 2010 to 2019. Furthermore, training data consists of 7000 entries and 60 variables ranging from the housing characteristics, customer feedback, price and other features. Whereas, the test set contains 3000 observations and 59 variables.

In [7]:
# check types of variables/features
col_dtype_dict = {
    dtype: df.select_dtypes(dtype).columns
    for dtype in ["int", "float", "object"]
}

In [8]:
col_dtype_dict

{'int': Index(['accommodates', 'minimum_nights', 'maximum_nights', 'availability_30',
        'availability_60', 'availability_90', 'number_of_reviews',
        'number_of_reviews_ltm', 'number_of_reviews_l30d',
        'calculated_host_listings_count',
        'calculated_host_listings_count_entire_homes',
        'calculated_host_listings_count_private_rooms',
        'calculated_host_listings_count_shared_rooms'],
       dtype='object'),
 'float': Index(['host_listings_count', 'latitude', 'longitude', 'bedrooms', 'beds',
        'minimum_minimum_nights', 'maximum_minimum_nights',
        'minimum_maximum_nights', 'maximum_maximum_nights',
        'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'availability_365',
        'review_scores_rating', 'review_scores_accuracy',
        'review_scores_cleanliness', 'review_scores_checkin',
        'review_scores_communication', 'review_scores_location',
        'review_scores_value', 'reviews_per_month'],
       dtype='object'),
 'object

- Types of Variables/Features:
The data set contains both numeric and object variables. The numeric variables (including both integer and float values) refer to the number of rooms, the price and other measurable features of the rental house. On the contrary, object features are typically represented for the description, customer feedback and further information of both host and customers.

- Data summary and main data characteristics:
The following table shows a brief statistical description of several numeric variables. There are 33 numeric variables over a total of 60 features.

In [11]:
df.describe()

Unnamed: 0,host_listings_count,latitude,longitude,accommodates,bedrooms,beds,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,...,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,10000.0,10000.0,10000.0,10000.0,9162.0,9631.0,10000.0,10000.0,9945.0,10000.0,...,9155.0,9148.0,9156.0,9148.0,9146.0,10000.0,10000.0,10000.0,10000.0,9420.0
mean,12.8712,-33.853093,151.194656,3.6298,1.77745,2.171633,43.5017,828.8222,42.506083,43.9913,...,4.623815,4.835223,4.832469,4.831458,4.635984,9.9163,8.1517,1.6478,0.0544,0.797516
std,38.223115,0.087999,0.100407,2.28063,1.070244,1.64961,48.756685,451.594808,48.739458,48.155792,...,0.579326,0.402332,0.418854,0.349528,0.505872,25.341076,23.494082,9.719661,0.638656,1.151838
min,0.0,-34.10068,150.63049,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.01
25%,1.0,-33.895832,151.167258,2.0,1.0,1.0,2.0,365.0,2.0,3.0,...,4.5,4.84,4.84,4.8,4.5225,1.0,1.0,0.0,0.0,0.08
50%,2.0,-33.877735,151.212825,3.0,1.0,2.0,7.0,1125.0,7.0,10.0,...,4.81,4.98,4.99,4.95,4.76,1.0,1.0,0.0,0.0,0.33
75%,5.0,-33.814463,151.261212,4.0,2.0,3.0,90.0,1125.0,90.0,90.0,...,5.0,5.0,5.0,5.0,4.98,4.0,2.0,1.0,0.0,1.06
max,457.0,-33.39267,151.34041,16.0,18.0,39.0,1125.0,1162.0,1125.0,1125.0,...,5.0,5.0,5.0,5.0,5.0,197.0,197.0,100.0,17.0,24.27


In [12]:
# check the missing values in the dataset
missing_df = df.groupby(level=0).apply(lambda x: x.isnull().sum()).T
missing_df = missing_df.loc[missing_df.sum(axis=1) > 0]

In [15]:
missing_df["all"] = missing_df.sum(axis=1)

In [18]:
missing_df.sort_values(by="all", ascending=False)

Unnamed: 0,test,train,all
license,2393,1868,4261
host_neighbourhood,1263,2912,4175
host_about,1432,2458,3890
host_response_time,2534,645,3179
host_response_rate,2534,645,3179
neighborhood_overview,1240,1817,3057
neighbourhood,1240,1816,3056
host_acceptance_rate,2285,745,3030
price,3000,0,3000
review_scores_value,643,211,854


- Missing Values:
This section focuses on analysing the quality of the data set via the analysis of missing values. The following tables shows the frequency of missing values in the top ten  variables having the largest number of missing values.

As can be seen from the table above, the top five variables in terms of missing values are object features that are not good categorical variables (with the large number of unique values in the variable). As a result, we will drop all missing values in object variables which are not categorical variables. Furthermore, missing values in host_acceptance_rate and price account for a large proportion, notably in training set. To be more specific, 3000 missing values in price are equivalent to more than 40% of the total data set. As a result, the decision of imputing missing values or removing all these values should be carefully considered.

---

## Task 2: Data Cleaning, Missing Observations and Feature Engineering
- In this task you will follow a set of instructions/questions listed below.
- Make sure you **explain** each step you do both in Markdown text and on your video.
    - Do not just read out your commands without exaplaining what they do and why you used them 

Total Marks: 11

**Task 2, Question 1**: Clean **all** numerical features and the target variable `price` so that they can be used in training algorithms. For instance, `host_response_rate` feature is in object format containing both numerical values and text. Extract numerical values (or equivalently eliminate the text) so that the numerical values can be used as a regular feature.  
(2 marks)

In [11]:
## Task 2, Question 1 Code Here

`(Task 2, Question 1 Text Here - insert more cells as required)`

**Task 2, Question 2** Create at least 4 new features from existing features which contain multiple items of information, e.g. creating `email`,  `phone`, `reviews`, `jumio`, etc. from feature `host_verifications`.  
(2 marks)

In [12]:
## Task 2, Question 2 Code Here

`(Task 2, Question 2 Text Here - insert more cells as required)`

**Task 2, Question 3**: Impute missing values for all features in both training and test datasets.   
(2 marks)

In [13]:
## Task 2, Question 3 Code Here

`(Task 2, Question 3 Text Here - insert more cells as required)`

**Task 2, Question 4**: Encode all categorical variables appropriately as discussed in class. 


Where a categorical feature contains more than 5 unique values, map the features into 5 most frequent values + 'other' and then encode appropriately. For instance, you could group then map `property_type` into 5 basic types + 'other': [entire rental unit, private room, entire room, entire towehouse, shared room, other] and then encode.  
(2 marks)

In [14]:
## Task 2, Question 4 Code Here

`(Task 2, Question 4 Text Here - insert more cells as required)`

**Task 2, Question 5**: Perform any other actions you think need to be done on the data before constructing predictive models, and clearly explain what you have done.   
(1 marks)

In [15]:
## Task 2, Question 5 Code Here

`(Task 2, Question 5 Text Here - insert more cells as required)`

**Task 2, Question 6**: Perform exploratory data analysis to measure the relationship between the features and the target and write up your findings. 
(2 marks)

In [16]:
## Task 2, Question 6 Code Here

`(Task 2, Question 6 Text Here - insert more cells as required)`

--- 
## Task 3: Fit and tune a forecasting model/Submit predictions/Report score and ranking

Make sure you **clearly explain each step** you do, both in text and on the recoded video.

1. Build a machine learning (ML) regression model taking into account the outcomes of Tasks 1 & 2
2. Fit the model and tune hyperparameters via cross-validation: make sure you comment and explain each step clearly
3. Create predictions using the test dataset and submit your predictions on Kaggle's competition page
4. Provide Kaggle ranking and **score** (screenshot your best submission) and comment
5. Make sure your Python code works, so that a marker that can replicate your all of your results and obtain the same MSE from Kaggle

- Hint: to perform well you will need to iterate Task 3, building and tuning various models in order to find the best one.

Total Marks: 11

In [10]:
#Task 3 code here

`(Task 3 - insert more cells as required)`