Kaggle-RentHop

Final project for Data Analytics class, QMUL 2017. Dataset and challenge taken from Kaggle:

https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries

Using Supervised machine learning, find what makes a particular apartment rental stand out and get high interest. Project written in Python, in Spyder IDE. Main third-party libraries which were used are

SciKit learn
Pandas
MatplotLib.

Below is a quick summary of the main points of the analysis - Full report is available in pdf (31 pages)

Workflow

The first step was to understand RentHop, read the declaration from the spokesperson and explore the assignment given to get a sense of direction as to where to start. Thereafter, the following workflow was implemented until the end of the project:

Data - EDA and cleaning

Initial features

The vast majority, or 69% of the data, contains low-interest rentals and only 8% is high-interest apartments. This means the dataset is very imbalanced.

Summary statistics before cleaning, showed that outliers were present in location data and price data. This finding is highlighted in red.

These outliers were then removed:

Location outliers: New York location boundary was found using coordinates. Anything outside the boundary was removed.
Price: Top and bottom 1% was removed. Typical right-skewed distribution remained.

Features Engineering

Feature engineering involves transforming and combining features in an attempt to produce better representations of the dataset for the purposes of modeling

Created features:

Broker quality (high interest apartments / total managed apartments)
Apartment quality
Price per room
Number of photos
Common key features from feature column
Neighborhood (Manhattan, Queens, Staten Island, etc.)
Time features (weekday, hour, day)

Modelling and Inference

The general approach to modelling involved application of a number of different algorithms with the full complement of engineered features. Models selected included Logistic Regression, Random Forest (decision tree ensemble), and Neural Networks. This followed the notion that with a rich enough feature representation only a simple model is required (e.g. Logistic Regression), however in lieu of good features, utilising an algorithm such as neural nets can potentially compensate for this.

Feature Selection

The relative importance of each feature was used to determine its potential usefulness in modelling as some methods are sensitive to the number of features included.

This was generated using the Extra-Trees Classifier which finds the mean importance of each feature over the number of trees used in the model (1000 trees were used in this instance). It was decided any feature below a threshold of 0.04 mean importance would be discarded from the modelling process.

Algorithm comparison

Aside from including better features, the other primary method of improving prediction was to compare various types of classification models and optimise their associated parameters. In each case, a number of parameters were optimised based on a range using 5-fold cross validation to minimise the risk of overfitting to the training dataset

The models that were selected for comparison included:

Random Forest Classifier
Neural Network
Logistic Regression

The best performing classifier was the Neural Network

#image

Conclusions

The main deliverable for this project was the predictive model. While in some instances it performed well the primary drawback was the poor classification quality for the high interest apartments.

However, the neural network was quite accurate in predicting low interest apartments and the raw probabilities provide better insight than the absolute classification scores.

This could be used by Two Sigma to help the landlords that post to the site, that their apartment listing is showing behaviours of an apartment that receives low interest. It could then highlight the features that had the largest impact on the prediction to make recommendations how the landlord could improve their listing, and drive actionable decisions.

Authors

Hesham Dar
Petur Einarsson
Christopher Ottesen

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
Images		Images
.gitignore		.gitignore
Data-Analytics-Project.pdf		Data-Analytics-Project.pdf
README.md		README.md
The_Rent_mainfile.py		The_Rent_mainfile.py
eda.py		eda.py
preProcess.py		preProcess.py
train.json		train.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle-RentHop

Workflow

Data - EDA and cleaning

Features Engineering

Modelling and Inference

Feature Selection

Algorithm comparison

Conclusions

Authors

About

Releases

Packages

Contributors 3

Languages

chriotte/Kaggle_RentHop_flat_Interest_detection

Folders and files

Latest commit

History

Repository files navigation

Kaggle-RentHop

Workflow

Data - EDA and cleaning

Features Engineering

Modelling and Inference

Feature Selection

Algorithm comparison

Conclusions

Authors

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages