# Lab 5 - Manaing big(ger) Airbnb data with data frames
*© 2021 Colin Conrad*

Welcome to Week 5 of INFO 6270! Last week we explored how to use _libraries_ to make our lives easier. We then used this skill to import CSV files and JSON data into our Python framework. With this knowledge in hand, you are now ready to start working with data more earnestly. This is the first lab of the second unit, nicknamed _Scrappy Data Science for Managers_. Scrappy, in this context, because we are going to learn data science tools in a seemingly ad-hoc way, in an effort to introduce you tackle common problems that you might experience in your information management or other boradly managerial career.

The [Pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started) dataframe is a tool which makes it easier to navigate and analyze large datasets. Built upon numpy and other dependencies, this tool is among the most essential resources for conducting analysis on larger datasets. We will also use this tool in nearly all subsequent labs (at least, all labs in Unit 2), so be sure to explore this one closely!

**This week, we will achieve the following objectives:**
- Turn your dataset into a dataframe and build a simple query
- Observe additional features and create an advanced query
- Collect descriptive statistics from your dataframe
- Make changes to your dataframe

Weekly reading: [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)

# Case: Airbnb
It's pretty likely that you know something about [Airbnb](https://www.airbnb.ca/). Airbnb has been called the [world's largest hotel chain](https://www.bizjournals.com/sanfrancisco/news/2017/08/11/airbnb-surpasses-ihg-wyn-hilton-marriott-listings.html), while owning no hotels themselves. As a crowdsourcing platform, users can list their properties and rent them out to short-term renters using the Airbnb app. Though the company is not yet 13 years old as of January 2021, it has a market valuation of [$113 billion], nearly 13 times the estimated value of Hilton hotel brands (the most valuable hotel chain).

Airbnb is not without controversy. Airbnb has been identified by the [Economic Policy Institute](https://www.epi.org/publication/the-economic-costs-and-benefits-of-airbnb-no-reason-for-local-policymakers-to-let-airbnb-bypass-tax-or-regulatory-obligations/) as an important factor in rising rent an property prices, despite often escaping tax and regulation. The company [regularly releases their application data publicly](http://insideairbnb.com/get-the-data.html). Though we cannot investigate this phenomenon in one lab, this is a useful resource for learning about data science tools.

# Objective 1: Turn your dataset into a dataframe and build a simple query
As discussed in class, pandas is a framework built on top of the numpy library designed to make data science easier. Numpy is a tool for transforming your data into a multi-dimensional array, sort of like a hyper-efficient Python list. It's not great to use unless you are interested in going deep into machine learning. The pandas (PANel + DAta) library transforms our data into numerical tables (a.k.a. data frames) which are easier to calculate and sort through. We will start with Pandas because this is the tool that will be most useful for us.

To transform a csv file into a pandas object we need to import the pandas library. We can then import a csv file by using pandas' built-in read_csv feature.

In [None]:
import pandas as pd # import pandas 

import numpy as np # import numpy; it's usually a good practice to import this as well

nyc = pd.read_csv('data/w5_nyc.csv') # command pandas to import the data; isn't this easier than the csv library?

### Dataframe head
Once our data frame has been imported we can apply a few methods that can generate knowledge about the dataset. The `head()` method gives us a summary of the first five items in the dataset.

In [None]:
nyc.head()

### Dataframe series

Data frames are easily navigable compared to lists or dictionaries. If we want to retrieve all of the data from a column in the dataframe, we can call that column similarly to calling a method. The code below will give us the values for `neighbourhood_group` from the whole dataset, but will give us only the first and last values when printed. This is super-handy!

In [None]:
nyc.neighbourhood_group

### A transposed dataframe

Some things that are somewhat cumbersome with lists and dictionaries are also very simple with pandas. For instance, if we wish to transpose our data (make the rows columns and the columns rows) we can use the `.T` method. This can be helpful when making calculations across entities.

In [None]:
nyc.T

### Sort values
In addition, dataframes can be easily sorted. These sorting features are similar to SQL (_Structured Query Language_) which many of you will be familiar with (and will cover in Week 10). The following code will sort the data by price starting with the highest values. 

I wonder who seriously believes that they can rent an apartment for $10 000 per night?! It must be fancy!

In [None]:
nyc.sort_values(by='last_review', ascending=True)

## *Challenge Question 1 (0.5 points)*
Using the `sort` function on the `nyc` dataframe, conduct a query that can be used to discover which listing has the highest `reviews_per_month`. Provide a comment which indicates your opinion about whether this value is an anomoly. 

In [None]:
# insert your code here

## *Challenge Question 2 (0.5 points)*
Using the `nyc` dataframe, conduct a simple query to retrieve the last five rows in the dataframe. There is a pandas method to do this-- be sure to read _10 minutes to pandas_ to learn more!

In [None]:
# insert your code here

# Objective 2: Observe additional features and create an advanced query
## Subsetting the data
Dataframes are for a lot more than performing large observations. Perhaps the coolest feature of a dataframe is that it facilitates efficient queries and to retrieve subsets of the data. In pandas, a subset is declared by writing square brackets following the data frame-- for instance, `nyc['neighbourhood_group']` would return the values of neighborhood. However, we can also use this to conduct Boolean searches as well. For instance, if we wanted to retrieve only the values where `neighbourhood_group == Brooklyn` we could write a query as follows.

In [None]:
nyc[nyc.neighbourhood_group == 'Brooklyn']

### Sorting subsets

Similarly, to before, if we wanted to list the values from Brooklyn according to price, we can create a new data frame which is equal to this subset and sort it by price.

In [None]:
brooklyn = nyc[nyc.neighbourhood_group == 'Brooklyn']

brooklyn.sort_values(by='price', ascending=False)

### Sort by date-time
Pretty cool! Another feature of pandas is that it recognizes common data types which are not recognized as distinct types by Python itself. For example, pandas dataframes are optimized to recognize date and time formats. If we want to sort a search by `last_review`, for instance, we could conduct a similar query as with `price`.

In [None]:
recent_brooklyn = nyc[(nyc.neighbourhood_group == 'Brooklyn')]

recent_brooklyn.sort_values(by='last_review', ascending=False)

### Query using two conditions

Queries can also be more complex. If we wish to choose a subset of data which is constrained by two conditions, we can include both conditions by using the `&` operator. The following query will retrieve the values that match `Brooklyn` which also have a `last_review` equal to `2019-08-06`, the date that I seem to have retrieved this data.

In [None]:
recent_brooklyn = nyc[(nyc.neighbourhood_group == 'Brooklyn') & 
                      (nyc.last_review == '2019-08-06')]

recent_brooklyn.sort_values(by='price', ascending=False)

### Querying using two conditions, one of which is an OR

Finally, we can also create nested queries. The following query searches for values which match `Brooklyn` but have a last_review in the two days prior.

In [None]:
recent_brooklyn = nyc[(nyc.neighbourhood_group == 'Brooklyn') & 
                      ((nyc.last_review == '2019-08-06') | (nyc.last_review == '2019-08-05'))]

recent_brooklyn.sort_values(by='price', ascending=False)

## *Challenge Question 3 (1 point)*
Using the `nyc` dataframe, conduct a query which retrieves the following:
- Rentals only from the `Queens` neighborhood
- Rentals with either more than 100 reviews or more than 5 reviews per month
- Rentals with a price of less than 200
- Rentals which are the `Entire home/apt` room type

Sort your findings by order of price, starting with the lowest price.

In [None]:
# insert your code here

# Objective 2: Collect descriptive statistics from your dataframe
One of the most handy features of pandas dataframes is that they come with a few built-in methods for conducting descriptive analysis. For example, the `.describe()` method will give summary of statistical measures of a given dataframe.

In [None]:
nyc.describe()

### Describe a column
This is good, but perhaps too much to be useful. Instead, we could choose to apply `.describe()` to a single column. This will give us more manageable information.

In [None]:
nyc.price.describe()

### Calculate the mean price
In addition, dataframes also have functions for calculating specific statistics such as mean, median and mode. To calculate the mean value of a column we can write the line below.

In [None]:
nyc.price.mean()

### Calculate the sum
Alternatively, if we wanted to find the sum of a column (e.g. the total number of reviews) we can use the sum function.

In [None]:
nyc.number_of_reviews.sum()

### Calculate number of unique values
Finally, there are a few other methods which are handy. For instance, the `.nunique()` method will tell use the number of unique values in a dataset.

In [None]:
nyc.host_id.nunique()

## *Challenge Question 4 (1 point)*
Write code that calculates the median price for the property category of `'Entire home/apt'`. 

**Hint**: [This tutorial site](https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htm) has a pretty good summary of dataframe functions.

In [None]:
# insert your code here

## *Challenge Question 5 (0.5 points)*
Write code which finds the neighborhood (*not* neighbourhood_group) with the most listings. You can probably do this in one line, though if you choose to use a more complex function, you are welcome to do so! 

**Hint:** Consider reviewing the standard descriptive statistics: mean, median and mode

In [None]:
# insert your code here

## *Challenge Question 6 (0.5 points)*
The `availability_365` column represents the number of days in the past year that the property was available to rent through the Airbnb app. Retrieve the number of listings in New York which were available every day of the previous year.

In [None]:
# insert your code here

# Objective 4: Make changes to your dataframe
In addition to being navigable, dataframes are also relatively easy to change. For instance, if we wanted to insert a column, we could use the `.insert()` method. According to the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html), this method requires four pieces of information: 
- Where to insert it
- The name of the column
- The value to be inserted
- Whether to allow duplicates

The code below inserts the value "Airbnb" in a column named `dataset`. This would be handy if we acquired our data from more than one source.

In [None]:
nyc.insert(2, "dataset", "Airbnb", True)
nyc

### Deleting data in python
This said, given that our data came from a single source, we have no need for this. To drop a column, we could choose to use the del keyword, which deletes objects stored in python. Note that this keyword is not unique to pandas and can be used for virtually anything in python.

In [None]:
del nyc['dataset']
nyc

### The drop method
The proper way to drop a column in pandas however is to use the `.drop()` method. This method is used to drop rows or columns from a pandas dataframe. For instance, if we wished to drop the first entry we could use the following:

In [None]:
mod_nyc = nyc.drop([0, 1]) # create a new dataframe which has the first two values dropped

mod_nyc.head()

Pandas drops rows by default so we only needed to provide the indexes to make it happen. Alternatively, to drop columns we need to provide a label and an `axis=1` value to tell pandas to search for the column with said label. If we wished to remove the host names (say, in order to better preserve privacy) we could specify the following.

In [None]:
mod_nyc = nyc.drop(labels='host_name', axis=1) # create a new dataframe which has the first two values dropped

mod_nyc.head()

### Entering new columns
We can also add new columns to our dataframe. To create a new column, you can add the column values using a key/value format. The code below creates a new column called `reviews_to_avaliability_ratio` which calculates the number of reviews relative to the listing availability.

In [None]:
nyc['reviews_to_avaliability_ratio'] = nyc['number_of_reviews']/nyc['availability_365']

nyc.head()

## *Challenge Question 7 (1 point)*
One measure which might be interesting in this dataset is the ratio of price to number of reviews. Create a new column called `price_to_review_ratio` which calculates the price divided by the reviews. Once you have added this column, provide code which prints the median value of this number.

In [None]:
# insert your code here