Analysis of Seattle airbnb Data

Installations

To run the project you should ue jupyter notebooks. A number of Python libraries are required, including Numpy, Padas, Matplotlib, Seaborn and various models and functions from sklearn.

Project Motivation

I carried out this project for an assignment in the Udacity Nanodegree program. The objective was to pick a data set and analyse it to answer three defined business questions. I decided to use the suggested airbnb dataset containing listings for Seattle (in 2016).

The three business questions I aim to answer are:

Concerning airbnb rental listings in Seattle (as of April 2016)

How does the neighbourhood of the property affect the price?
How does the property type affect the price?
How effectively can I predict the price of a listing using a machine learning model?

File Descriptions

Seattle_airbnb.ipynb : Jupyter notebook containing my analysis
calendar.csv : availability of each listing per day
listings.csv: full descriptions of properties listed for rental on airbnb
reviews.csv: reviews for each listing

The three csv data files were sourced from Kaggle https://www.kaggle.com/airbnb/seattle/data. More detail can be sourced there.

Interacting With The Project

The Notebook is self-contained, provided you have Python and Jupyter notebooks you should be able to load the ipynb file and run all cells. The data files need to be saved in a "Seattle" sub-folder for the import commands to work.

The Process

Business Understanding

The analysis aims to explore the Seattle airbnb data to try and gain insights into what affects the price of a listing. Three specicic questions the analysis aims to address are:

Concerning airbnb rental listings in Seattle (as of April 2016)

How does the neighbourhood of the property affect the price?
How does the property type affect the price?
How effectively can I predict the price of a listing using a machine learning model?

Data Understanding

The are three data files, calendar, reviews and listings. Although I inspect them all, ultimately my analysis uses the listings data. The listings data contains 3,818 rows; each representings a property listed for rental in Seattle (as of 1st April 2016). There are 92 seperate columns, including the price.

Some of the columns are not in an ideal format for analysis or machine learnings, so I undetake a number of data cleaning and processing steps.

Data Preparation

In the notebook, jump to the "Explore Listings Data" heading. Here I undertake the following checks:

Explore Listings Data

Check Data Types
Check for missing values
Identify complete columns

Explore Price

Clean price column - remove $ signs and converter to numeric
Remove single price null row
Inspect price distribution using histogram
Drop numeric columns which will not be useful features (as as id)
Plot correlation matrix
Plot boxplot of price by property type
Plot boxplot of price by superhost status

Cleaning Data for Model

Host Verifications

Process host verification column to have a seperate binary column for each verification type

Amenities

Process amenities column to have a seperate binary column for each amenity_type

Categorical Variables

Convert categorical variables cancellation_policy, property_type, room_type, host_is_superhost into dummy columns

Numeric Features

Select existing numeric features for use in model. Drop square_feet as it has too many null values.

Modelling

Flag price column as the target variable
Split into training set (70%) and test set (30%)
Train multiple models and evaluate with Mean absolute erro, root mean squared error and R Squared
- LinearRegression
- KNeighboursRegressor
- SGDRegressor
- Lasso
- ElasticNet
- Ridge
- SVR-linear
- SVR-rbf
Run cross folds validation on each model, computing the R squared
Select top 3 models (Ridge, Lasso and ElasticNet) and fine tune hyperparameters using GridSearch

Evaluation

Identfy best performing model and hyperparameters, - Lasso with alpha = 1.0
Check Mean Absolute Error, Root Mean Squared Error and R Squared of best model
Examine coefficients to see which features are most helpful for the model prediction
Plot coefficients on histogram
Plot model errors on histogram
Plot mean absolute errors on histogram

Deployment

Model does not need to be deployed, but could be used by stakeholders or potential customers in the Seattle airbnb market.

Acknowledgements

Author: Neale Denton Data: sourced from Kaggle.com

This code is free to use, modify and share with or without attribution, but attribution appreciated!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Seattle		Seattle
README.md		README.md
Seattle_airbnb.ipynb		Seattle_airbnb.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis of Seattle airbnb Data

Installations

Project Motivation

File Descriptions

Interacting With The Project

The Process

Business Understanding

Data Understanding

Data Preparation

Explore Listings Data

Explore Price

Cleaning Data for Model

Modelling

Evaluation

Deployment

Acknowledgements

About

Releases

Packages

Languages

nealedenton/Udacity_AirBNB

Folders and files

Latest commit

History

Repository files navigation

Analysis of Seattle airbnb Data

Installations

Project Motivation

File Descriptions

Interacting With The Project

The Process

Business Understanding

Data Understanding

Data Preparation

Explore Listings Data

Explore Price

Cleaning Data for Model

Modelling

Evaluation

Deployment

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages