### Business Analytics Group Assignment - Predicting Airbnb Listing Prices in Melbourne__ {-}

--- 

**Kaggle Competition Ends:** Friday, 2 June 2023 @ 3:00pm (Week 13)  
**Assignment Due Date on iLearn:** Friday, 2 June 2023 @ 11.59pm (Week 13)   

**Overview:**   

- In the group assignment you will form a team of 3 students and participate in a forecasting competition on Kaggle
- The goal is to predict listed prices of Airbnb properties in Melbourne based on various Airbnb characteristics and regression models
- Assessment Summary:  
    - Write a problem statement and perform Exploratory Data Analysis  
    - Clean up data, deal with categorical features and missing observations, and create new explanatory variables (feature engineering)  
    - Construct and tune forecasting models, produce forecasts and submit your predictions to Kaggle  
    - Each member of the team will record a video presentation of their work  
    - Marks will be awarded producing a prediction in the top 5 positions of their unit as well as for reaching the highest ranking on Kaggle amongst all teams.

**Instructions:** 

- Form a team of 3 students (minimum 2 students)  
- Each team member needs to join [https://www.kaggle.com](https://www.kaggle.com/)  
- Choose a team leader and form a team in the competition [https://www.kaggle.com/t/a154f28787174b628a2b7eaa238a5c12](https://www.kaggle.com/t/a154f28787174b628a2b7eaa238a5c12)
    - Team leader to click on `team` and join and invite other team members to join
    - Your **team's name must start** with your unit code, for instance you could have a team called BUSA8001_masterful_geniuses or BUSA3020_l33t 
- All team members should work on all the tasks listed below however   
    - Choose a team member who will be responsible for one of each of the 3 tasks listed below    
- Your predictions must be generated by a model you develop here 
    - You will receive a mark of zero if your code provided here does not produce the forecasts you submit to Kaggle

**Marks**: 

- Total Marks: 40
- Your individual mark will consist of:  
    - 50% x overall assignment mark + 45% x mark for the task that you are responsible for + 5% x mark received from your teammates for your effort in group work 

**Competition Marks:**  

- 1 mark: Ranking in the top 5 places of your unit on Kaggle (make sure you name your team as instructed above)   
- 2 marks: Reaching the first place in your unit (make sure you name your team as instructed above)   


**Submissions:**  

1. On Kaggle: submit your team's forecast in order to be ranked by Kaggle
    - Can do this as many times as necessary while building their model  
2. On iLearn **only team leader to submit** this Jupyter notebook re-named `Group_Assignment_Team_Name.ipynb` where Team_Name is your team's name on Kaggle   
    - The Jupyter notebook must contain team members names/ID numbers, and team name in the competition
    - Provide answers to the 3 Tasks below in the allocated cells including all codes/outputs/writeups 
    - One 15 minute video recording of your work 
        - Each team member to provide a 5 minute presentation of the Task that they led (it is best to jointly record your video using Zoom)
        - When recording your video make sure your face is visible, that you share your Jupyter Notebook and explain everything you've done in the submitted Jupyter notebook on screen
        - 5 marks will be deducted from each Task for which there is no video presentation or if you don't follow the above instructions
        
3. On iLearn each student needs to submit a file with their teammates' names, ID number and a mark for their group effort (out of 100%)



---

**Fill out the following information**

For each team member provide name, Student ID number and which task is performed below

- Team Name on Kaggle: `BUSA8001_superhosts`
- Team Leader and Team Member 1: `Felix Rosenberger`
- Team Member 2: `John Rizk`

---

## Task 1: Problem Description and Initial Data Analysis {-}

1. Read the Competition Overview on Kaggle [https://www.kaggle.com/t/a154f28787174b628a2b7eaa238a5c12](https://www.kaggle.com/t/a154f28787174b628a2b7eaa238a5c12)
2. Referring to Competition Overview and the data provided on Kaggle write about a 500 words **Problem Description** focusing on key points that will need to be addressed as first steps in Tasks 2 and 3 below, using the following headings:
    - Forecasting Problem - explain what you are trying to do and how it could be used in the real world (i.e. why it may be important)
    - Evaluation Criteria - explain the criteria is used to assess forecast performance 
    - Types of Variables/Features
    - Data summary and main data characteristics
    - Missing Values (only explain what you found at this stage)
    - Hint: you should **not** discuss any specific predictive algorithms at this stage
    - Note: This task should be completed in a single Markdown cell (text box)
    
Total Marks: 12


In [3]:
# Task 1 code here
import pandas as pd

# read in data
trainpath = "train.csv"
train = pd.read_csv(trainpath)
testpath = "test.csv"
test = pd.read_csv(testpath)

# concatenate dataframes to reduce redundancies in operations
df = pd.concat([train, test])

In [4]:
# types of variables / features
df.dtypes

ID                                                int64
source                                           object
name                                             object
description                                      object
neighborhood_overview                            object
                                                 ...   
calculated_host_listings_count_entire_homes       int64
calculated_host_listings_count_private_rooms      int64
calculated_host_listings_count_shared_rooms       int64
reviews_per_month                               float64
price                                            object
Length: 61, dtype: object

In [5]:
# data summary and characteristics
df.describe()

Unnamed: 0,ID,host_listings_count,latitude,longitude,accommodates,bedrooms,beds,minimum_nights,maximum_nights,minimum_minimum_nights,...,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,10000.0,10000.0,10000.0,10000.0,10000.0,9562.0,9916.0,10000.0,10000.0,9945.0,...,9679.0,9678.0,9678.0,9678.0,9678.0,10000.0,10000.0,10000.0,10000.0,9737.0
mean,4999.5,19.5714,-37.827168,145.031142,3.8431,1.794081,2.235075,4.4164,651.0998,3.95093,...,4.677927,4.804996,4.82302,4.834347,4.666018,15.1365,10.4353,4.6234,0.0421,1.544088
std,2886.89568,65.232256,0.079428,0.173005,2.494735,1.05162,1.698655,22.24689,492.450292,18.7931,...,0.424685,0.34369,0.3472,0.280267,0.399465,42.980386,23.16327,29.580092,0.388513,1.740338
min,0.0,1.0,-38.22411,144.5178,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.01
25%,2499.75,1.0,-37.856305,144.956828,2.0,1.0,1.0,1.0,90.0,1.0,...,4.57,4.76,4.8,4.8,4.58,1.0,1.0,0.0,0.0,0.39
50%,4999.5,2.0,-37.819212,144.97916,4.0,1.0,2.0,2.0,700.0,2.0,...,4.8,4.91,4.93,4.91,4.76,2.0,1.0,0.0,0.0,1.04
75%,7499.25,10.0,-37.801015,145.044953,5.0,2.0,3.0,3.0,1125.0,3.0,...,4.94,5.0,5.0,4.98,4.88,8.0,7.0,1.0,0.0,2.17
max,9999.0,951.0,-37.48645,145.852974,16.0,14.0,22.0,1125.0,10000.0,1000.0,...,5.0,5.0,5.0,5.0,5.0,290.0,282.0,224.0,7.0,41.75


In [6]:
df.loc[:, df.isnull().any()]

Unnamed: 0,name,description,neighborhood_overview,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_neighbourhood,...,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month,price
0,"The Stables, Richmond",Superbly located hotel style accommodation in ...,Richmond is a great neighbourhood. A beautifu...,"Melbourne, Australia",I'm a working mum who loves being able to shar...,within an hour,100%,98%,f,Richmond,...,2023-02-18,4.88,4.91,4.97,4.94,4.93,4.93,4.82,6.11,$132.00
1,Room in Cool Deco Apartment in Brunswick East,A large air conditioned room with firm queen s...,This hip area is a crossroads between two grea...,"Melbourne, Australia",As an artist working in animation and video I ...,within a few hours,100%,98%,f,Brunswick,...,2023-03-08,4.48,4.64,3.97,4.72,4.69,4.65,4.60,1.37,$39.00
2,The Suite @ Angelus Retreat,<b>The space</b><br />Welcome to ANGELUS Retre...,,"Melbourne, Australia",I have very special interests in Life and Life...,within a few hours,100%,78%,t,,...,2022-06-13,4.75,4.88,4.75,4.88,4.50,5.00,4.75,0.09,$270.00
3,Million Dollar Views Over Melbourne,<b>The space</b><br /><b>Enjoy Million Dollar ...,,"Melbourne, Australia",Professional couple who enjoy entertaining in ...,within a day,75%,92%,f,Southbank,...,2012-01-27,4.50,4.00,4.50,4.00,4.00,5.00,4.00,0.01,"$1,000.00"
4,Melbourne - Old Trafford Apartment,After hosting many guests from all over the wo...,Our street is quiet & secluded but within walk...,"Berwick, Australia",We are an active couple who work from home and...,within a few hours,100%,87%,t,,...,2023-03-03,4.86,4.91,4.98,4.91,4.93,4.90,4.87,1.43,$116.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,Comfy Bedroom with Private Bathroom,The apartment is located 180 meters away from ...,Hawthorn is a very family oriented suburb and ...,,I'm a fun and avid traveller. I've been expose...,,,,f,,...,2016-07-25,4.50,4.00,5.00,5.00,5.00,4.50,4.50,0.02,
2996,Light-Filled & Cosy in the Heart of South Yarra,"Footsteps from South Yarra Train station, this...","Just 3km from the heart of Melbourne, South Ya...",,,,,,f,,...,2023-01-15,4.21,4.50,4.29,4.74,4.64,4.74,4.19,0.86,
2997,New 6 bedrooms house in Williams Landing,Have fun with the whole family at this stylish...,,,"I am a travel consultant, love to meet differe...",within an hour,100%,92%,f,,...,,,,,,,,,,
2998,Comfortable bedroom in CBD,"In Melbourne CBD, within the free tram zone. v...",,"Melbourne, Australia",,,,,f,Central Business District,...,2020-02-25,3.00,4.00,3.00,2.33,3.33,4.00,3.33,0.08,


In [21]:
# missing values
total_miss = df.isna().sum()
total_miss.to_frame()
total_miss = total_miss.reset_index()
total_miss.loc[total_miss[0] > 0]

Unnamed: 0,index,0
2,name,1
3,description,88
4,neighborhood_overview,3247
7,host_location,2050
8,host_about,3711
9,host_response_time,737
10,host_response_rate,737
11,host_acceptance_rate,721
12,host_is_superhost,2
13,host_neighbourhood,5526


SyntaxError: invalid syntax (4163833379.py, line 1)

In [39]:
df.columns[df.isnull().any()]

Index(['name', 'description', 'neighborhood_overview', 'host_location',
       'host_about', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_neighbourhood',
       'neighbourhood', 'neighbourhood_cleansed', 'property_type', 'room_type',
       'bathrooms', 'bedrooms', 'beds', 'minimum_minimum_nights',
       'maximum_maximum_nights', 'availability_365', 'first_review',
       'last_review', 'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'reviews_per_month', 'price'],
      dtype='object')

In [41]:
train.columns[train.isnull().any()]

Index(['name', 'description', 'neighborhood_overview', 'host_location',
       'host_about', 'host_acceptance_rate', 'host_neighbourhood',
       'neighbourhood', 'neighbourhood_cleansed', 'property_type', 'room_type',
       'bathrooms', 'bedrooms', 'beds', 'minimum_minimum_nights',
       'maximum_maximum_nights', 'availability_365', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value'],
      dtype='object')

In [42]:
test.columns[test.isnull().any()]

Index(['description', 'neighborhood_overview', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_neighbourhood', 'neighbourhood',
       'neighbourhood_cleansed', 'property_type', 'room_type', 'bedrooms',
       'beds', 'minimum_minimum_nights', 'maximum_maximum_nights',
       'availability_365', 'first_review', 'last_review',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'reviews_per_month'],
      dtype='object')

`(Task 1, Text Here)`
### Forecasting Problem
The goal is to predict daily rental prices of Melbourne located AirBnB listings. This information could be used, for example, to assess the rental price a property with specific characteristics in certain suburbs is expected to yield. Especially in a business case scenario, where these cashflows might be used to pay off debt, this is a critical aspect for feasability assessment.

### Evaluation Criteria
The criteria to assess prediction performance is RMSE. This performance metric measures the average distance between predictions obtained by a model and actual target values. Thus, the lower the distance (and the smaller RMSE), the better the prediction quality. It also has the advantage of being in the same unit as the predicted variable which makes it easy to interpret.

### Types of Variables / Features


### Data Summary and Main Data Characteristics


### Missing Values



---

## Task 2: Data Cleaning, Missing Observations and Feature Engineering {-}
- In this task you will follow a set of instructions/questions listed below.
- Make sure you **explain** each step you do both in Markdown text and on your video.
    - Do not just read out your commands without explaining what they do and why you used them 

Total Marks: 12

**Task 2, Question 1**: Clean **all** numerical features and the target variable `price` so that they can be used in training algorithms. For instance, `host_response_rate` feature is in object format containing both numerical values and text. Extract numerical values (or equivalently eliminate the text) so that the numerical values can be used as a regular feature.  
(2 marks)

In [11]:
## Task 2, Question 1 Code Here


`(Task 2, Question 1 Text Here - insert more cells as required)`

**Task 2, Question 2** Create at least 4 new features from existing features which contain multiple items of information, e.g. creating `email`,  `phone`, `work_email`, etc. from feature `host_verifications`.  
(2 marks)

In [12]:
## Task 2, Question 2 Code Here

`(Task 2, Question 2 Text Here)`

**Task 2, Question 3**: Impute missing values for all features in both training and test datasets. Hint: make sure you do **not** impute the price in the test dataset.
(3 marks)

In [13]:
## Task 2, Question 3 Code Here

`(Task 2, Question 3 Text Here)`

**Task 2, Question 4**: Encode all categorical variables appropriately as discussed in class. 


Where a categorical feature contains more than 5 unique values, map the feature into 5 most frequent values + 'other' and then encode appropriately. For instance, you could group then map `property_type` into 5 most frequent property types + 'other'  
(2 marks)

In [14]:
## Task 2, Question 4 Code Here

`(Task 2, Question 4 Text Here)`

**Task 2, Question 5**: Perform any other actions you think need to be done on the data before constructing predictive models, and clearly explain what you have done.   
(1 marks)

In [15]:
## Task 2, Question 5 Code Here

`(Task 2, Question 5 Text Here)`

**Task 2, Question 6**: Perform exploratory data analysis to measure the relationship between the features and the target and write up your findings. 
(2 marks)

In [16]:
## Task 2, Question 6 Code Here

`(Task 2, Question 6 Text Here)`

--- 
## Task 3: Fit and tune a forecasting model/Submit predictions/Report score and ranking {-}

Make sure you **clearly explain each step** you do, both in text and on the recoded video.

1. Build a machine learning (ML) regression model by taking into account the outcomes of Tasks 1 & 2 (Explain carefully)
2. Fit the model and tune hyperparameters via cross-validation: make sure you comment and explain each step clearly
3. Create predictions using the test dataset and submit your predictions on Kaggle's competition page
4. Provide Kaggle ranking and **score** (screenshot your best submission) and Comment
5. Make sure your Python code works, so that a marker that can replicate your all of your results and obtain the same RMSE from Kaggle

- Hint: to perform well in this assignment you will need to iterate Tasks 2 & 3, creating new features and training various models in order to find the best one.

Total Marks: 12

In [10]:
#Task 3 code here

`(Task 3 - insert more cells as required)`