# **Data Preparation**

In [1]:
from PIL import Image
img = Image.open("data preparation flowchart.jpeg")
display(img)

FileNotFoundError: [Errno 2] No such file or directory: 'data preparation flowchart.jpeg'

# **Data Import**

In [None]:
import pandas as pd
data= pd.read_csv("listings_airbnb_munich.csv")
data

In [None]:
num_entries = data.shape[0]
num_entries


In [None]:
num_features = data.shape[1]
num_features

In [None]:
data.dtypes

# **Data Cleaning**

 Visualizing Missing Data

In [None]:
import missingno as msno
msno.matrix(data)

In [None]:
columns_to_drop = ['id', 'host_id', 'host_name', 'neighbourhood_group','number_of_reviews', 'last_review', 'reviews_per_month',
                   'calculated_host_listings_count', 'number_of_reviews_ltm', 'license']
new_data = data.drop(columns_to_drop, axis=1)
new_data

For a cleaner and more objective analysis, these variables had been deleted since the variables that dropped will not be used in our analysis.

In [None]:
import missingno as msno
msno.matrix(new_data)

In [None]:
import pandas as pd
missing_percentage = new_data.isnull().mean()*100
missing_percentage

This dataset has practically no null values, therefore no null values treatment will be performed.

Outliers Detection and Treatment

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize=(6, 4))
plt.hist(new_data['latitude'])
plt.title("latitude" + " Histogram")
plt.xlabel("latitude")
plt.ylabel("Frequency")
plt.show()

In [None]:
plt.figure(figsize=(6, 4))
plt.hist(new_data['longitude'])
plt.title("longitude" + " Histogram")
plt.xlabel("longitude")
plt.ylabel("Frequency")
plt.show()

In [None]:
plt.figure(figsize=(6,4))
plt.hist(new_data['price'])
plt.title("price" + " Histogram")
plt.xlabel("price")
plt.ylabel("Frequency")

plt.show()

In [None]:
plt.figure(figsize=(6, 4))
plt.hist(new_data['minimum_nights'])
plt.title("minimum_nights" + " Histogram")
plt.xlabel("minimum_nights")
plt.ylabel("Frequency")
plt.show()

In [None]:
plt.figure(figsize=(6, 4))
plt.hist(new_data['availability_365'])
plt.title("availability_365" + " Histogram")
plt.xlabel("availability_365")
plt.ylabel("Frequency")
plt.show()

Through the five histogram, it is possible to verify the presence of outliers in the variables 'price' and 'minimum_nights'. The values do not follow a distribution and distort the entire graphical presentation.



In [None]:
new_data.describe()

The variable 'price' has 75% of its values below 167, but its maximum value is 96274, which is well above the values obtained up to 75% of the dataset, which proves the presence of outliers. 

The variable 'price' presents minimum values equal to 0. Understanding the Airbnb business, it is known that no one rents any property on Airbnb for free.

The variable 'minimum_nights' has 75% of its values below 4, but its maximum value is 1095, which is well above the values obtained up to 75% of the dataset, which proves the presence of outliers.

**Boxplot for "Price"**

In [None]:
plt.figure(figsize=(8, 6))
plt.boxplot(new_data['price'], vert=False)
plt.title('Boxplot of ' + 'price')
plt.xlabel('price')
plt.ylabel('Values')
plt.xlim(0,10000)
plt.show()

In [None]:
import numpy as np
import pandas as pd

outliers = new_data[new_data['price'] > 1000]
outliers_count = np.sum(new_data['price'] > 1000)
outliers_ratio = round(outliers_count / len(new_data), 3)

print("Outliers:")
print(outliers)
print("\nQuantity of Outliers:", outliers_count)
print("Outliers Ratio:", outliers_ratio)


In [None]:
import numpy as np
import pandas as pd

zero_price_values = new_data[new_data['price'] == 0]

print("\nValues where 'price' is equal to 0:")
print(zero_price_values)

outliers_count = np.sum(new_data['price'] == 0)
print("\nQuantity of Outliers:", outliers_count)

outliers_ratio = round(outliers_count / len(new_data), 3)
print("Outliers Ratio:", outliers_ratio)


Above, the boxplot for the variable 'price' visually shows the information contained in the summary statistics.

As a parameter, all data greater than 1000 will be considered an outlier in this analysis. Above also, we will see the quantity and ratio of these outliers and the values where 'price' is equal to 0.

**Boxplot for "minimum_nights"**

In [None]:
plt.figure(figsize=(8, 6))
plt.boxplot(new_data['minimum_nights'], vert=False)
plt.title('Boxplot of ' + 'minimum_nights')
plt.xlabel('minimum_nights')
plt.ylabel('Values')
plt.xlim(0,1000)

plt.show()

In [None]:
import numpy as np
import pandas as pd

outliers = new_data[new_data['price'] > 30]

outliers_count = np.sum(new_data['price'] > 30)
outliers_ratio = round(outliers_count / len(new_data), 3)

print("Outliers:")
print(outliers)
print("\nQuantity of Outliers:", outliers_count)
print("Outliers Ratio:", outliers_ratio)


Above, the boxplot for the variable 'minimum_nights' visually shows the information contained in the summary statistics.

As a parameter, all data greater than 30 will be considered an outlier in this analysis. Above also, we will see the quantity and ratio of these outliers.

We had set parameters for both of them in which will be considered outliers since the amount of outliers are too many.

## **Removing Outliers and Creating Data Frame for Analysis**

In [None]:
outliers = new_data[new_data['price'] > 1000]

cleaned_data = new_data.drop(outliers.index)

print("Cleaned Data:")
print(cleaned_data)

In [None]:
plt.figure(figsize=(8, 6))
plt.boxplot(cleaned_data['price'], vert=False)
plt.title('Boxplot of ' + 'price')
plt.xlabel('price')
plt.ylabel('Values')
plt.show()

In [None]:
outliers = new_data[new_data['minimum_nights'] > 30]

# Remove outliers from the DataFrame
cleaned_data = new_data.drop(outliers.index)

# Display the cleaned data
print("Cleaned Data:")
print(cleaned_data)

In [None]:
plt.figure(figsize=(8, 6))
plt.boxplot(cleaned_data['minimum_nights'], vert=False)
plt.title('Boxplot of ' + 'minimum_nights')
plt.xlabel('minimum_nights')
plt.ylabel('Values')
plt.show()

Finally, with the clean data frame created and treated, the analysis begins.

In [None]:
cleaned_data.describe()

# **Exploratory Data Analysis**

Objective 1:

In [None]:
import pandas as pd
property_counts = new_data['room_type'].value_counts()
property_ratios = (property_counts / len(new_data) * 100).round(2).astype(str) + '%'

print("Number of Property Types:")
print(property_counts)
print()

print("Ratio of Property Types:")
print(property_ratios)

In [None]:
import matplotlib.pyplot as plt

room_types = ['Entire home/apt', 'Private room', 'Shared room', 'Hotel room']
counts = [2462, 2050, 80, 58]
colors = ['#FFC0CB', '#FFD700', '#ADD8E6', '#90EE90']

plt.figure(figsize=(8, 6))
plt.bar(room_types, counts, color=colors)

for i, count in enumerate(counts):
    plt.text(i, count, str(count), ha='center', va='bottom')

plt.title("Number of Property Types")
plt.xlabel("Room Type")
plt.ylabel("Count")

plt.show()


As the property types 'Shared room' and 'Hotel room' are not relevant to the number of properties being rented on Airbnb in the city of Munich, we will continue this analysis in the neighbourhoods using only the property types 'Entire home/apt' and 'Private room'. In Munich, Germany, hotel rooms and shared rooms might not be as common as other property types for a variety of reasons. First off, shared rooms often offer little in the way of personal space or privacy, which may not be what most travellers like. Munich is a well-liked travel destination, drawing a variety of tourists who frequently seek out more secluded and pleasant lodgings. Additionally, Munich has a large supply of hotels with a variety of alternatives and amenities, which could reduce the demand for hotels with listings on Airbnb. Additionally, Munich's cultural tastes and accepted travel practices may have an impact on how popular shared rooms and hotel rooms are, with a stronger desire for complete homes or private rooms that offer a more individualised and opulent experience.


In [None]:
import matplotlib.pyplot as plt

grouped_data = grouped_data.sort_values(by='Entire home/apt Ratio', ascending=False)

fig, ax = plt.subplots(figsize=(10, 6))
grouped_data[['Entire home/apt Ratio', 'Private room Ratio']].plot(kind='bar', stacked=True, ax=ax)

plt.title('Ratio of Room Types by Neighbourhood')
plt.xlabel('Neighbourhood')
plt.ylabel('Ratio (%)')

plt.legend()

plt.xticks(rotation=90)

plt.show()


It can be seen that the distribution of property types in Munich is well balanced in most neighbourhoods. This indicates that both types of accommodations are available and in demand, providing a diverse range of options for travelers.
A well-balanced distribution of property types can be beneficial for both hosts and guests. Hosts have the flexibility to offer different types of accommodations based on their property and preferences, while guests have the opportunity to choose the type of accommodation that suits their needs and preferences. 


Below we can see the neighbourhoods where the proportion of property type 'Entire home/apt' is higher, so if the Airbnb user wants to stay in one of the neighbourhoods listed below, there is a greater chance that this user will find offers of this type property available.

In [None]:
import matplotlib.pyplot as plt


top_5_neighborhoods = grouped_data.head(5)


plt.figure(figsize=(8, 6))
plt.barh(top_5_neighborhoods.index, top_5_neighborhoods['Entire home/apt Ratio'], color='blue', label='Entire home/apt')
plt.barh(top_5_neighborhoods.index, top_5_neighborhoods['Private room Ratio'], color='pink', label='Private room')


for i, neighborhood in enumerate(top_5_neighborhoods.index):
    ratio_entire = top_5_neighborhoods.loc[neighborhood, 'Entire home/apt Ratio']
    ratio_private = top_5_neighborhoods.loc[neighborhood, 'Private room Ratio']
    plt.text(ratio_entire, i, f'{ratio_entire}%', va='center', color='black')
    plt.text(ratio_private, i, f'{ratio_private}%', va='center', color='black')


plt.title("Top 5 Neighborhoods with Highest Ratios")
plt.xlabel("Ratio (%)")
plt.ylabel("Neighborhood")
plt.legend()


plt.show()




These neighbourhoods in Munich may have a higher concentration of residential properties or property owners who are more inclined to rent out their entire homes/apartments rather than individual rooms.


Now, we can see the neighbourhoods where the proportion of property type 'Private room' is higher, so if the Airbnb user wants to stay in one of the neighbourhoods listed below, there is a greater chance that this user will find offers for this type of property available.

In [None]:
import matplotlib.pyplot as plt

top_5_neighborhoods = grouped_data.tail(5)

plt.figure(figsize=(8, 6))
plt.barh(top_5_neighborhoods.index, top_5_neighborhoods['Entire home/apt Ratio'], color='blue', label='Entire home/apt')
plt.barh(top_5_neighborhoods.index, top_5_neighborhoods['Private room Ratio'], color='pink', label='Private room')

for i, neighborhood in enumerate(top_5_neighborhoods.index):
    ratio_entire = top_5_neighborhoods.loc[neighborhood, 'Entire home/apt Ratio']
    ratio_private = top_5_neighborhoods.loc[neighborhood, 'Private room Ratio']
    plt.text(ratio_entire, i, f'{ratio_entire}%', va='center', color='black')
    plt.text(ratio_private, i, f'{ratio_private}%', va='center', color='black')

plt.title("Top 5 Neighborhoods with Highest Ratios")
plt.xlabel("Ratio (%)")
plt.ylabel("Neighborhood")
plt.legend()

plt.show()


The unique characteristics of the neighborhoods, such as their location, amenities, atmosphere, or target audience, might make them more attractive for travelers or tenants seeking private room accommodations. For example, neighborhoods near universities or popular tourist destinations often have a higher demand for private rooms.


Objective 2:

In [None]:
selected_columns = ['neighbourhood', 'price']
selected_data = data[selected_columns]
average_price_by_neighbourhood = round(selected_data.groupby('neighbourhood')['price'].mean().sort_values(ascending=False),2)
average_price_by_neighbourhood

In [None]:
import matplotlib.pyplot as plt

neighbourhood_prices = new_data.groupby('neighbourhood')['price'].mean().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
neighbourhood_prices.plot(kind='bar', color='red')
plt.title('Average Price by Neighbourhood')
plt.xlabel('Neighbourhood')
plt.ylabel('Average Price')
plt.xticks(rotation=90)
plt.show()



Above is the average property price per neighbourhood in Munich. We can see that there is a big price difference, where the average price in the most expensive neighbourhood is 412.64 and in the cheapest neighbourhood is 95.43.

In [None]:
highest_prices = round(new_data.groupby('neighbourhood')['price'].mean().nlargest(2),2)

lowest_prices = round(new_data.groupby('neighbourhood')['price'].mean().nsmallest(2),2)

print("Neighbourhoods with the Highest Prices:")
print(highest_prices)

print("\nNeighbourhoods with the Lowest Prices:")
print(lowest_prices)


Objective 3:

In [None]:
!pip install folium

In [None]:
!pip install geopandas

In [None]:
import geopandas as gpd

In [None]:
selected_columns = ['latitude', 'longitude']
selected_data = data[selected_columns]

plt.figure(figsize=(10, 8))
plt.scatter(selected_data['longitude'], selected_data['latitude'], s=5, alpha=1.0)
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Distribution of Airbnb Listings')
plt.show()

Sparse areas or gaps on the scatter plot indicate regions in Munich with fewer Airbnb listings. These areas might be underrepresented or less popular for rentals, which could provide opportunities for market analysis or potential expansion of Airbnb services.

In [None]:
import folium

selected_columns = ['latitude', 'longitude']
selected_data = data[selected_columns]

m = folium.Map(selected_data=['latitude', 'longitude'], zoom_start=12)

for index, row in data.iterrows():
    folium.CircleMarker(location=[row['latitude'], row['longitude']],
                        radius=5,
                        color='blue',
                        fill=True,
                        fill_color='blue',
                        fill_opacity=0.6,
                        popup=row['neighbourhood']).add_to(m)

m

Above graph show more specific for the location with Airbnb activity. The graph provides a visual representation of the geographic distribution of Airbnb in Munich listings based on latitude and longitude coordinates. Each marker on the map represents a specific location. The density of markers indicates the concentration of Airbnb listings in different areas. We can observe clusters or hotspots where multiple markers are closely grouped together, indicating popular or densely populated areas for rentals. Sparse areas or gaps on the map with fewer markers indicate regions with lower Airbnb activity. These areas might be less popular for rentals, indicating potential gaps or opportunities for market analysis or expansion of Airbnb services.  By examining the map, we can identify the proximity of Airbnb listings to specific points of interest such as tourist attractions, parks, transportation hubs, or commercial centers. This information can be useful for travelers or property owners in understanding the accessibility and desirability of different locations.
The interactive nature of the map allows users to zoom in and out, as well as pan across different areas for a more detailed exploration of the distribution and patterns.


Objective 4:

In [None]:
selected_columns = ['minimum_nights', 'price']
selected_data = data[selected_columns]
round(selected_data.groupby('minimum_nights')['price'].mean().sort_values(ascending=False),2)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

selected_columns = ['minimum_nights', 'price']
selected_data = data[selected_columns]

average_price_by_minimum_nights = selected_data.groupby('minimum_nights')['price'].mean().reset_index()

plt.figure(figsize=(10, 6))
sns.lineplot(data=average_price_by_minimum_nights, x='minimum_nights', y='price', marker='o')
plt.title('Average Price by Minimum Nights Requirement')
plt.xlabel('Minimum Nights Requirement')
plt.ylabel('Average Price')
plt.grid(True)
plt.show()


From this we can determine the optimal minimum nights for maximizing rental income or occupancy rates, we can look for points on the graph that correspond to higher average prices. These points indicate the minimum nights requirement that yields higher rental prices, which can contribute to maximizing rental income or occupancy rates.

There is no trend that we can see between the minimum nights requirement and rental prices. It is because of rental prices are typically influenced by multiple factors, such as property size, location, amenities, and seasonal demand. The minimum nights requirement alone may not be the primary determinant of rental prices. We should consider analyzing the combined effect of multiple variables to gain a comprehensive understanding of the factors influencing rental prices.