# Tutorial 1: Loading and Visualising

In this tutorial you will:

1.  Learn how to run Python code in Jupyter notebooks
2.  Load a CSV file of Airbnb listings into a Pandas DataFrame
3.  Use Pandas methods to calculate summary statistics of listing
    attributes
4.  Use Plotly to produce simple plots to visualise the listings

## Jupyter Notebooks

> Note: Your notebooks in Google Colab will be saved to your Google
> Drive account, you can find previous notebooks from the
> `File -> Open notebook` menu.

In the empty *cell* below, type `1 + 1` and then **press the `Enter` key
*while holding* the `Shift` key** to run the code (there may be short
delay while the notebook session starts up):

<details>
<summary>
Reveal model answer
</summary>

``` code
1 + 1
```

</details>

However, **we need to be careful of the order we run cells**. Try to run
the following cell:

In [None]:
days_in_year / 7

Run the following cell to define `days_in_year`, then re-run the cell
above:

In [None]:
days_in_year = 365.2425

## Loading data with Pandas

Running the following cell will ensure we have Pandas installed
(however, it is pre-installed in Google Colab):

In [None]:
%pip install pandas

Import the Pandas library:

<details>
<summary>
Reveal model answer
</summary>

``` code
import pandas as pd
```

</details>

Now, let’s load our CSV of Airbnb listings:

In [None]:
listings_df = pd.read_csv('https://ben-denham.github.io/python-eda/data/inside_airbnb_listings_nz_2023_09.csv')

Display the contents of the DataFrame:

<details>
<summary>
Reveal model answer
</summary>

``` code
listings_df
```

</details>

Only a few rows from the top and bottom of the DataFrame are shown

Sort the DataFrame by the listing’s current rating:

<details>
<summary>
Reveal model answer
</summary>

``` code
listings_df.sort_values('review_scores_rating')
```

</details>

**This doesn’t change the original DataFrame**, it returns a new
DataFrame that is sorted.

Notice that the ratings sorted “last” are `NaN` (aka `null` or `None` or
“missing”). **By default `sort_values()` puts all `NaN` values at the
end, regardless of sort order.**

Indeed, these listings have a `number_of_reviews` equal to `0`. We’ll
need to consider how to handle these rating-less listings later.

Select a subset of columns from a DataFrame:

<details>
<summary>
Reveal model answer
</summary>

``` code
listings_df[['latitude', 'longitude']]
```

</details>

Now extract the `name` column on its own:

<details>
<summary>
Reveal model answer
</summary>

``` code
listings_df['name']
```

</details>

## Summary Statistics

Get summary statistics for all numeric columns of the listings:

<details>
<summary>
Reveal model answer
</summary>

``` code
listings_df.describe()
```

</details>

Get the maximum price of any listing:

<details>
<summary>
Reveal model answer
</summary>

``` code
listings_df['price_nzd'].max()
```

</details>

Count the frequency of each room type (values are listed by descending
frequency):

<details>
<summary>
Reveal model answer
</summary>

``` code
listings_df['room_type'].value_counts()
```

</details>

## Plotting with Plotly

Running the following cell will ensure we have Plotly installed
(however, it is pre-installed in Google Colab):

In [None]:
%pip install plotly

Import Plotly Express:

<details>
<summary>
Reveal model answer
</summary>

``` code
import plotly.express as px
```

</details>

Create a scatter plot of listing latitude and longitude values:

<details>
<summary>
Reveal model answer
</summary>

``` code
px.scatter(listings_df, x='longitude', y='latitude')
```

</details>

Plotly Express makes such plots easy when we have *tidy data* - that is:

-   Each row is an *observation* or *data point* to plot
-   Each column is an *attribute* or *feature* describing some aspect of
    an observation
-   Each cell contains only a single value

Plot the *distribution* of listing prices:

<details>
<summary>
Reveal model answer
</summary>

``` code
px.histogram(listings_df, x='price_nzd')
```

</details>

# Practice Exercises

## 1. Dataset Statistics

Using the methods we’ve looked at so far, answer the following
questions:

### 1a. Which “parent region” has the least listings?

<details>
<summary>
Hint
</summary>

Use the `.value_counts` method.

</details>

<details>
<summary>
Reveal model answer
</summary>

``` code
listings_df.value_counts('region_parent_name')
```

</details>

### 2a. Which listing was last reviewed the longest time ago?

<details>
<summary>
Hint
</summary>

Use the `.sort_values` method.

</details>

<details>
<summary>
Reveal model answer
</summary>

``` code
# Get the first `last_review` date.
listings_df.sort_values('last_review')
```

</details>

### 3a. What would you consider to be a “typical” minimum number of nights for a listing?

<details>
<summary>
Hint
</summary>

Look at the `mean`, median (`50%`), and `max` for `minimum_nights`.

</details>

<details>
<summary>
Reveal model answer
</summary>

``` code
listings_df.describe()
```

The mean `minimum_nights` is slightly higher than the median, which
seems to be due to an abnormally high maximum value. We could further
confirm this by plotting a histogram.

Therefore, the median of `2` is a better choice for a “typical” minimum
number of nights.
</details>

## 2. Plotting Practice

### 2a. Plot the frequencies of room types

In Plotly, we can use histograms to plot the frequencies of values in
*categorical* columns (e.g. string labels or integers) as well as
*continuous* columns (e.g. numbers with a decimal point).

Plot a histogram of the categorical room type column.

<details>
<summary>
Hint
</summary>

Call `px.histogram` with `x` set to the `room_type` column.

</details>

<details>
<summary>
Reveal model answer
</summary>

``` code
px.histogram(listings_df, x='room_type')
```

</details>

### 2b. Plot the relationship between price and number of reviews

Investigating relationships between attributes (columns) is an
incredibly important part of exploratory data analysis.

Construct a scatter plot that shows the relationship between the
listing’s price and its number of reviews (you may need to zoom in on
the plot).

<details>
<summary>
Hint
</summary>

Call `px.scatter` with `x` and `y` set to the columns for price and
number of reviews.

</details>

<details>
<summary>
Reveal model answer
</summary>

``` code
px.scatter(listings_df, x='price_nzd', y='number_of_reviews')
```

</details>

What does this scatter plot tell you about listings that are very
expensive (say, greater than \$1,000 per night)?

<details>
<summary>
Reveal model answer
</summary>

These more expensive listings tend not to have more than about 100
reviews.

This might be important to consider if we plan to use review scores to
predict the price of an expensive listing.
</details>

## 3. Extra for Experts - Plot Types

Choosing the right type of plot is important. For example, say we want
to look at the relationship between the price of reasonably-priced
listings (less than \$500 per night) and how many people it
accommodates. The following scatter plot doesn’t give us much insight
because the points overlap too much:

In [None]:
px.scatter(listings_df, x='accommodates', y='price_nzd', range_y=[0, 500])

However, a heatmap can help us more easily see the relationship by
representing the number of points in each cell with colour:

In [None]:
px.density_heatmap(listings_df, x='accommodates', y='price_nzd', range_y=[0, 500], nbinsy=2000)

Try out some other plot types from:
https://plotly.com/python/plotly-express/

For example, try making a 3D scatter plot between 3 columns:
https://plotly.com/python/3d-scatter-plots/