# Tutorial 2: Transforming and Filtering

In this tutorial you will:

-   Learn about Pandas’ core types and column data types
-   Transform columns to clean your data and compute new columns
-   Use Boolean conditions on columns to filter DataFrame rows

In [None]:
%pip install pandas plotly nbformat

import pandas as pd
import plotly.express as px

listings_df = pd.read_csv('https://ben-denham.github.io/python-eda/data/inside_airbnb_listings_nz_2023_09.csv')

listings_df

## Data Types in Pandas

What is the *type* of a table of data?

In [None]:
type(listings_df)

What is the *type* of an individual column?

In [None]:
type(listings_df['room_type'])

Check the data type of each column in the DataFrame:

## Transforming Data

We can convert a column with datetime-like strings to have a datetime
data type:

In [None]:
listings_df['last_review']

Replace the `last_review` column with a new Series with a datetime data
type:

Check that the `last_review` column now has the correct data type:

In [None]:
listings_df.info()

### Maths with columns

Convert listing prices to Australian dollars:

In [None]:
nzd_to_aud = 0.93

listings_df['price_nzd']

Add a `price_nzd_per_person` column to `listings_df`:

In [None]:
listings_df['price_nzd_per_person'] =

listings_df

### Applying Functions to DataFrames

The following function transforms a listing ID into a URL:

In [None]:
def id_to_url(id: str) -> str:
    numeric_id = id.removeprefix('l')
    return f'https://www.airbnb.co.nz/rooms/{numeric_id}'

id_to_url('l11909616')

Produce a series of listing URLs:

The following function produces a description from a listing row:

In [None]:
def listing_to_description(row):
    room_type = row['room_type']
    host_name = row['host_name']
    return f'{room_type} by {host_name}'

Produce a Series of listing descriptions:

## Filtering Rows

Construct a *Boolean Series* that is `True` for listings in
`'Wellington City'`:

In [None]:
wellington_mask =

Use the `wellington_mask` to get a DataFrame of listings in Wellington:

Use the `wellington_mask` to get a DataFrame of listings NOT in
Wellington:

### Combining Filters

Construct a *Boolean Series* that is `True` for listings that cost less
than \$100 per night:

In [None]:
cheap_mask = listings_df['price_nzd'] <= 100

cheap_mask

Find cheap listings in Wellington:

Find listings that are either cheap OR in Wellington:

# Practice Exercises

## 1. Analysing listing ratings

### 1a. Plotting the ratings

Guests may review a listing on a scale of 0 to 5 stars, and the mean of
those reviews is recorded as the *rating* of the listing.

We’d like to get an idea of the distribution of the typical ratings of
listings, so your first task is to plot a histogram of the
`review_scores_rating` column:

### 1b. Thinking about distribution spikes

In your histogram, you should see that the smooth-ish shape of the
distribution is broken up by a series of spikes, especially around the
exact ratings of 0, 1, 2, 3, and 4. Why do you think that may be?

<details>
<summary>
Hint
</summary>

Listings with a small number of reviews are more likely to have a rating
that is either an exact number of stars or a simple ratio. For example,
a listing with a single review must have a rating that is equal to the
exact number of stars of its only review!

When considering summary statistics (like an average rating) it is
important to consider how many data points have made up that statistic.

</details>

### 1c. Plotting filtered data

Let’s seen what the distribution looks like if we ignore listings with
few reviews.

Plot a histogram of listings that have more than 100 reviews:

Does plotting only listings with a large number of reviews reveal any
other insights about the data?

<details>
<summary>
Hint
</summary>

No listing with more than 100 reviews has a rating less than 4.0 stars.

</details>

## 2. Transforming and Filtering Practice

Combine the techniques we’ve learned in the last two tutorials to answer
these questions:

### 2a. What is the lowest cost booking I could make?

Note: Each listing has a minimum number of nights that you must book for
(recorded in the `minimum_nights` column).

<details>
<summary>
Hint
</summary>

You may find it helpful to multiply two columns and use `.min()` as part
of your answer.

</details>

### 2b. Which parent region has more affordable listings on average, Auckland or Wellington City?

<details>
<summary>
Hint
</summary>

You may find it helpful to use filters and `.mean()` as part of your
answer.

</details>

**Extra:** The term “affordable” is subject to interpretation here - is
a listing more affordable if it can accommodate more people for the same
price? Does your answer change depending on your interpretation of
affordable?

## 3. Extra for Experts - Data Cleaning

Pandas DataFrames and Series have many other methods that apply useful
transformations. You can find references for them at these links:

-   https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
-   https://pandas.pydata.org/pandas-docs/stable/reference/series.html

Try using some of them to perform the data cleaning tasks below.

### 3a. Filter out listings that are missing a rating

Use the
[`.isna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isna.html)
Series method to find listings that are missing a rating (i.e. the
rating is `NaN`).

### 3b. Add a `filled_rating` column to `listings_df` that fills in missing ratings with the mean rating

Use the
[`.fillna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html)
Series method to replace missing values in a Series with a given value.

> Filling in missing values with an average value can *sometimes* be a
> useful technique for applying machine learning models that cannot work
> when some data values are missing.

### 3c. Convert formatted price strings to numeric values

The original Inside Airbnb dataset provides each price as a formatted
string starting with a `$` and containing commas. The original formatted
values are retained in the `formatted_price` column.

Your task is to transform the `formatted_price` column into a numeric
`float` data type, just like the `price_nzd` column.

You may like to use the `.apply()` used in the tutorial, or try using
the
[`.str.replace()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html)
and
[`.astype()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html)
Series methods.