<a href="https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/ArangoBnB_simple_data_exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Brief Data Exploration of the Airbnb Dataset

This notebook showcases an example of data exploration as a part of a data modeling activity. The original dataset is pulled from and maintained by [insideAirbnb](http://insideairbnb.com/). The goal is to show a different approach to data analysis and exploration, compared to AQL. The approaches should not be considered contradictory but instead complimentary. This notebook focuses on some basic overview data and some insights into the prices of listings. A more complete look at the listings of the Airbnb dataset can be found in the original article.

This data was eventually used in the full stack JavaScript project, [ArangoBnB](https://github.com/cw00dw0rd/ArangoBnB). The ArangoBnB project has been created with the community and is always open to new contributors looking to learn more about ArangoDB, JavaScript, Vue, and/or React.

Some objectives for this data exploration activity includes:
 * Learn about the included fields and data types
 * Evaluate completeness of the fields
 * Determine if the data can fulfill our application requirements
 * Attempt to gain some quick insights from listing prices
 * Discover any necessary transformations

#### Application Requirements
The following is a list of application requirements that were previously outlined in the associated data modeling article. During the activity we hope to determine the viability of some of these features. It isn't likely that we will find all features listed in this single file but it's a good idea to keep them in mind.
 * Search an AirBnB dataset to find rentals nearby a specified location
 * A draggable map that shows results based on position
 * Use ArangoSearch to keep everything fast
 * Search the dataset with GeoJSON coordinates
 * Filter results based on keywords, price, guests, etc
 * Natural language search (Ex: Houses in Florida with pools.)
 * Use AQL for all queries
 * Multi-lingual support


## Install Prequisites

In [None]:
%%capture
!pip install python-arango
!pip install arangopipe==0.0.70.0.0
!pip install pandas

## Download the Data
Ideally, similar activities for each of the datasets would be performed and the collective insights would be used for development. For simplicity, this notebook will only look at a single collection, the `listings` collection. 

[insideAirbnb](http://insideairbnb.com/) provides two files named `listings`; one is essentially a summary of the larger file. The larger file is used in this example as it is only 40MB and provides a more complete view of the available data.

In [None]:
!wget "http://data.insideairbnb.com/germany/be/berlin/2020-12-21/data/listings.csv.gz"
!gunzip listings.csv.gz

## Read Data

Now that we have downloaded and unzipped the `listings.csv.gz` file for Berlin from [insideAirbnb](http://insideairbnb.com/get-the-data.html) we can start working with it.

To keep things easy we mostly stick with using [Pandas](https://pandas.pydata.org/docs/index.html) and [numPy](https://numpy.org/).

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv("listings.csv", error_bad_lines=False)

## Generate Summaries
First, we look at the document's head and tail to see what fields exist and get a quick view of how the data looks.

Some immediate takeaways include:
 * Some values show `NaN`
 * Some fields contain HTML 
 * Price appears to be a string with symbols
 
Do you see anything else that needs attention?

In [None]:
df.head()

In [None]:
df.tail()

We can get an idea of the data types for the various fields with the [info()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) method which prints a summary of the dataframe. 
This summary includes:
 * Field(column) name
 * The number of non-null values
 * The field datatype

In [None]:
df.info()

A common method to start with is [describe()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) which generates descriptive statistics. This method may be more appropriate for other datasets but still provides some interesting stats that gets us a little more familiar with the data.

For instance, values like `id`, `scrape_id`, `host_id`, 

In [None]:
df.describe()

In [None]:
df.isna().any()

In [None]:
df.duplicated()

## Listing Price Insights

In [None]:
# Converts price from string to number and removes special characters
df['price'] = df["price"].replace('[\$\,]',"",regex=True).astype(float)

# Sets sensible tick range for prices
# Eventually isn't necessary as setting c and colormap does this for us
ticks = np.arange(min(df['price']) ,max(df['price']), (max(df['price']) / 25))

In [None]:
plt = df.plot(x="price", y="neighbourhood_group_cleansed", s=100, kind="scatter", c="price", colormap='Set1', figsize=[20, 10], grid=True)

This seems to be indicating that there are prices reaching 8000?

Using [max()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.max.html) we can confirm this, as well as get some insights into other fields.
Some interesting fields:
 * The max number of bedrooms is 50!
 * The max number of beds is 96!
 * The min/max nights seem to have some outliers of 1124/9999
 * The max review score is 10 in all categories 

In [None]:
df.max()

In [None]:
# Of the 22k listings, there are 430 different prices set
len(pd.unique(df['price']))

This has been a quick look at using pandas to explore data.
In this notebook we:
 * Gained an overview of the available data and their types
 * Got some insights into neighborhoods and prices
 * Transformed the price from a string into a useable number
 