# **Airbnb Berlin Listings Analysis : Summary and Insights**

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## **1. Introduction**

In this analysis, we explore the dataset of Airbnb listings to uncover key insights about pricing trends and geographical patterns in Berlin. The dataset includes various features such as price, location (latitude and longitude), and other attributes associated with each listing. Our goal is to understand factors like location and price relate to each other and identify significant insights from the data.

Throughout this analysis, we perform data cleaning, exploratory data analysis (EDA), and visualizations to gain insights into the pricing, popular areas, and trends within the dataset. The analysis also investigates any potential relationships between some features mainly between the price and other attributes.

The insights gathered from this analysis aim to help understanding pricing trends based on location.

## **2. Summary of the Dataset**

In [2]:
# Load Dataset
listings = pd.read_csv("data/raw/listings.csv")

The dataset used in this analysis is sourced from [Inside Airbnb](https://insideairbnb.com/get-the-data/), which provides publicly available data on Airbnb listings across various cities. The dataset specifically covers Airbnb listings in Berlin and contains several attributes related to the listings. There are three datasets that we retrieved from the source, but only the data set `listings.csv` is used for the analysis, which is located at `data/raw/listings.csv`. Below is a brief summary of the dataset 

### **2.1 Dataset Overview**

In [3]:
listings.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,3176,Fabulous Flat in great Location,3718,Britta,Pankow,Prenzlauer Berg Südwest,52.53471,13.4181,Entire home/apt,105.0,63,148,2023-05-25,0.8,1,333,0,First name and Last name: Nicolas Krotz <br/> ...
1,9991,Geourgeous flat - outstanding views,33852,Philipp,Pankow,Prenzlauer Berg Südwest,52.53269,13.41805,Entire home/apt,180.0,6,7,2020-01-04,0.06,1,22,0,03/Z/RA/003410-18
2,14325,Studio Apartment in Prenzlauer Berg,55531,Chris + Oliver,Pankow,Prenzlauer Berg Nordwest,52.54813,13.40366,Entire home/apt,70.0,150,26,2023-11-30,0.15,4,270,1,
3,16644,In the Heart of Berlin - Kreuzberg,64696,Rene,Friedrichshain-Kreuzberg,nördliche Luisenstadt,52.50312,13.43508,Entire home/apt,90.0,93,48,2017-12-14,0.28,2,17,0,
4,17904,Beautiful Kreuzberg studio - 3 months minimum,68997,Matthias,Neukölln,Reuterstraße,52.49419,13.42166,Entire home/apt,28.0,92,299,2022-12-01,1.68,1,16,0,


In [4]:
listings.dtypes

id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                             float64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
number_of_reviews_ltm               int64
license                            object
dtype: object

#### **Attributes**

- **`id`**: A unique identifier assigned to each listing. (type `integer`)
- **`name`**: The title or name of the listing, provided by the host. (type `string`) 
- **`host_id`**: A unique identifier assigned to each host, representing the listing owner. (type `integer`)  
- **`host_name`**: The name of the host who owns the listing.  (type `string`)
- **`neighbourhood_group`**: The district or precinct in Berlin where the listing is located. (type `string`)
- **`neighbourhood`**: A more specific location within the district, pinpointing the listing’s exact area. (type `string`)
- **`latitude`** and **`longitude`**: Geographic coordinates indicating the precise location of the listing.  (type `float`)
- **`room_type`** : The type of room (e.g., entire home, private room). (type `string`)
- **`price`** : The price per night of the listing. (type `float`)
- **`minimum_nights`**: The minimum number of nights required for a stay at the listing. (type `integer`)
- **`number_of_reviews`**: The total number of reviews the listing has received. (type `integer`)
- **`last_review`**: The date of the most recent review for the listing. (type `string`)
- **`reviews_per_month`**: The average number of reviews the listing receives each month. (type `float`)
- **`calculated_host_listings_count`**: The total number of listings managed by the same host. (type `integer`)
- **`availability_365`**: The number of days in a year that the listing is available for booking. (type `integer`)
- **`number_of_reviews_ltm`**: The total number of reviews received in the last 12 months. (type `integer`)
- **`license`**: Information about the license or registration status of the listing, as required by local regulations. (type `string`)

### **2.2 Data Cleaning**

The code for data cleaning is located in `airbnb_berlin_data_cleaning`. So the only dataset that is being used is `listings.csv`. Before we can proceed with the analysis, we need to first and foremost clean the dataset. So here is how the dataset is cleaned.


#### **1. Remove Unnecessary Columns**

The removed columns primarily pertain to **reviews**, including `number_of_reviews`, `last_review`, `reviews_per_month`, and `number_of_reviews_ltm`. While these columns provide data on the volume and timing of reviews, they lack information about the quality of the reviews (e.g., whether they are positive or negative). As a result, we deemed them less relevant for our analysis and excluded them.

```python
listings = listings.drop(["number_of_reviews", "last_review", "reviews_per_month", "number_of_reviews_ltm"], axis="columns")
```

#### **2. Assign The Proper Data Type for Each Attributes/Columns**

The data types for **`neighbourhood_group`**, **`neighbourhood`**, and **`room_type`** were converted from `string` to `category`. This change was implemented to optimize memory usage and enhance performance during data analysis, as these columns contain a limited number of unique values.

```python
listings["neighbourhood_group"] = listings["neighbourhood_group"].astype("category")
listings["neighbourhood"] = listings["neighbourhood"].astype("category")
listings["room_type"] = listings["room_type"].astype("category")
```

#### **3. Handle Missing Values**

First, we excluded any listings without `price` data, as this attribute is essential for the analysis.
```python
listings = listings.dropna(subset=["price"])
```

Next, for the `license` attribute, we fill up the missing values with a string value `"Unknown"`
```python
listings["license"] = listings["license"].fillna("Unknown")
```

After executing the previous code, the dataset's indices may no longer be sequential or aligned. To fix this, we need to reset the indices to ensure they are properly ordered and consistent.
```python
listings.reset_index(drop=True)
```

#### **4. Remove Outliers With The Interquartile Range (IQR) Method**


Lastly, we removed outliers in the `price` column using the interquartile range (IQR) method. This step was necessary because some listings had prices that were unreasonably high or low. Such extreme values could distort the analysis by skewing the data and affecting statistical calculations. One disadvantage of this approach is that, while it removes extreme outliers, it may also exclude some valid data points that fall within a reasonable range but are still outside the IQR thresholds. These values, although not extreme, could be mistakenly removed in the process.

```python
q1 = listings["price"].quantile(0.25)
q3 = listings["price"].quantile(0.75)
iqr = q3 - q1

lower_bound = q1 - 1.5*iqr
upper_bound = q3 + 1.5*iqr

condition = (listings["price"] > lower_bound) & (listings["price"] < upper_bound)
listings_no_outliers = listings[condition]

listings = listings_no_outliers
```

#### **5. Export Cleaned Dataset**

Once the data has been cleaned, it is ready for export. The cleaned dataset is saved to the file path `data/cleaned/listings.csv`.
```python
listings.to_csv("data/cleaned/listings.csv", index=False)
```

### **2.3 Cleaned Dataset Overview**

In [5]:
# Load The Cleaned Dataset
listings = pd.read_csv("data/cleaned/cleaned_listings.csv")

# Reassign Proper Columns / Attributes Types
listings["neighbourhood_group"] = listings["neighbourhood_group"].astype("category")
listings["neighbourhood"] = listings["neighbourhood"].astype("category")
listings["room_type"] = listings["room_type"].astype("category")

***Note*** : Previously, the variable `listings` contained the raw dataset. From this point onward, `listings` refers to the cleaned dataset.

In [6]:
listings.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,calculated_host_listings_count,availability_365,license
0,3176,Fabulous Flat in great Location,3718,Britta,Pankow,Prenzlauer Berg Südwest,52.53471,13.4181,Entire home/apt,105.0,63,1,333,First name and Last name: Nicolas Krotz <br/> ...
1,9991,Geourgeous flat - outstanding views,33852,Philipp,Pankow,Prenzlauer Berg Südwest,52.53269,13.41805,Entire home/apt,180.0,6,1,22,03/Z/RA/003410-18
2,14325,Studio Apartment in Prenzlauer Berg,55531,Chris + Oliver,Pankow,Prenzlauer Berg Nordwest,52.54813,13.40366,Entire home/apt,70.0,150,4,270,Unknown
3,16644,In the Heart of Berlin - Kreuzberg,64696,Rene,Friedrichshain-Kreuzberg,nördliche Luisenstadt,52.50312,13.43508,Entire home/apt,90.0,93,2,17,Unknown
4,17904,Beautiful Kreuzberg studio - 3 months minimum,68997,Matthias,Neukölln,Reuterstraße,52.49419,13.42166,Entire home/apt,28.0,92,1,16,Unknown


#### **Attributes**

- **`id`**: A unique identifier assigned to each listing. (type `integer`)
- **`name`**: The title or name of the listing, provided by the host. (type `string`) 
- **`host_id`**: A unique identifier assigned to each host, representing the listing owner. (type `integer`)  
- **`host_name`**: The name of the host who owns the listing.  (type `string`)
- **`neighbourhood_group`**: The district or precinct in Berlin where the listing is located. (type `category`)
- **`neighbourhood`**: A more specific location within the district, pinpointing the listing’s exact area. (type `category`)
- **`latitude`** and **`longitude`**: Geographic coordinates indicating the precise location of the listing.  (type `float`)
- **`room_type`** : The type of room (e.g., entire home, private room). (type `category`)
- **`price`** : The price per night of the listing. (type `float`)
- **`reviews_per_month`**: The average number of reviews the listing receives each month. (type `float`)
- **`calculated_host_listings_count`**: The total number of listings managed by the same host. (type `integer`)
- **`availability_365`**: The number of days in a year that the listing is available for booking. (type `integer`)
- **`license`**: Information about the license or registration status of the listing, as required by local regulations. (type `string`)



## **3. Exploratory Data Analysis (EDA) Insights**

### **3.1 How Does Location Of The Airbnb Listing Affects The Price Of Accomodation Per Night ?.**

<img src="plots/price_distribution.png" alt="Airbnb Price Distribution" width="400" height="300">

In [7]:
listings["price"].mean()

124.07909604519774

The majority of the listings are priced between €42 and €75, with an average listing price of €124.08. Before conducting the analysis, outliers were removed using the interquartile range method. This process excluded all listings priced above €342. Initially, the goal was to filter out a few extreme outliers with prices ranging from €9,999 to €11,000.

<img src="plots/price_location_scatter.png" alt="Price per Night Based on Locations" width="400" height="300">

The geospatial data visualizations show a high concentration of listings near the center of Berlin, with the density gradually decreasing as you move toward the outskirts of the city. This could indicate that the central areas are more popular, with higher demand for accomodations, while the outer areas may offer fewer listings due to fewer demands or be less attractive to tourists or guests.

This could be due to the reason that the center of the city is a tourist heavy area, which oftens reflects demand driven by visitors who seek proximity to the landmarks. In contrast, The outer part of the city contains residential areas, which may be less appealing to short-term visitors, thus reducing the density of the listings.

<img src="plots/berlin_neighborhoods_with_listings_counts_plot.png" alt="Berlin Neighbourhoods with Listing Counts" width="700" height="600">

<img src="plots/neighbourhood_group_distribution.png" alt="Berlin Neighbourhoods with Listing Counts" width="400" height="300">

The plot titled **`Scatter Plot of Berlin Neighborhoods with Listing Counts`** displays the approximate center of each neighborhood group in Berlin. The number above each dot represents the total number of listings in that neighborhood group. As shown, the neighborhood group **`Mitte`** has the highest number of listings, with **`1,835`** available. As we move further from the city center, the number of listings decreases, with **`Marzahn-Hellersdorf`** having the fewest at **`99`** listings.

<img src="plots/berlin_neighborhoods_avg_price_plot.png" alt="Berlin Neighbourhoods with Average Price" width="700" height="600">

<img src="plots/average_price_neighbourhoods.png" alt="Average Prices Of Berlin Neighbourhoods" width="400" height="300">

Above, the locations of each neighborhood are marked with their corresponding average price. As shown, **`Mitte`** has the highest average price at **`€140.12`**. This is because Mitte is the central district of Berlin and it's often the most expensive due to its proximity to major tourist's attractions, business districts, cultural landmarks, and vibrant nightlife. THese factors increase the demand for accomodation, driving the price higher.

As you move away from Mitte towards the outskirts of Berlin, the average price tends to decrease. This trend is typical in most large cities, where central areas (especially business and tourist districts) command higher accommodation rates. The further a neighborhood is from the city center, the less demand there is for short-term rentals, leading to more affordable pricing.

### **3.2 How Does The Room Type Of The Airbnb Listing Affects The Price Of Accomodation Per Night ?.**

<img src="plots/average_price_room_type.png" alt="Airbnb Price Distribution" width="400" height="300">

Here we can see that the average price for the room type **`Hotel room`** is the highest among all of the other types with **`219.05`** average price. Hotel rooms generally have a higher price due to the added services they offer. Guests who book hotel rooms are often willing to pay more for these amenities. Additionally, hotels are typically located in prime areas or near business and tourist districts, which can drive up the price. 

The average price differences between **`Entire home/apt`**, **`Private room`**, and **`Shared room`** are likely due to the varying levels of privacy and control over space each option provides. **`Entire home/apt`** offers the highest level of both privacy and control, followed by **`Private room`**, while **`Shared room`** provides the least privacy and control, resulting in the lowest prices.

<img src="plots/room_type_distribution.png" alt="Airbnb Price Distribution" width="400" height="300">


Another factor that could explain why **`Hotel Room`** has a higher price than other room types, particularly **`Entire home/apt`**, is its significantly lower supply, as shown above. This limited availability could be contributing to the higher price of **`Hotel Room`**.

## **4. Key Findings and Insights**

### **4.1 Location vs. Price**

Listings closer to the city center tend to have higher prices per night.

### **4.2 Room Type vs. Price**

**`Hotel room`**  have the highest average price, followed by **`Entire home/apt`**, **`Private room`**, and then  **`Shared room`**

## **5. Limitations of the Analysis**

Some listings have **missing price data**, which must be excluded from the analysis, as price is a key variable in this study. 

The attributes or columns related to reviews have also been removed from the analysis, as they do not provide context on whether the reviews are positive or negative.

The method used to remove outliers, the interquartile range, was intended to eliminate a few listings with exceptionally high prices (e.g., €9,999 and €11,000 per night), which may distort the dataset. However, this method also removed some listings with more reasonable prices above €342 per night.

## **6. Reccomendations and Next Steps**

The analysis could be extended by incorporating data that provides context on whether reviews are positive or negative. This could be in the form of numerical values (e.g., higher numbers indicating positive reviews) or categorical/string values (e.g., "positive" and "negative"). This would allow for an analysis of the relationship between review sentiment and price.

Another potential analysis involves exploring the impact of seasonality on listing prices, which would require additional data. Understanding how different seasons affect prices could be valuable, particularly for tourists.

Further analysis could also focus on the relationship between location (latitude and longitude) and room type, as this may help explain price differences across room types. For instance, the higher price of **`Hotel Room`** may be due to its proximity to the city center.