### Importing the data and libraries

In [None]:
# Importing the libraries
import folium
from folium.plugins import HeatMap
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statistics
from wordcloud import WordCloud, STOPWORDS

In [None]:
# Importing the data
df_calendario = pd.read_csv('../data/calendar.csv')
df_listin = pd.read_csv('../data/listings.csv')
df_reviews = pd.read_csv('../data/reviews.csv')

### Data Understanding

In [None]:
# Checking the shape of dataframes
print(df_calendario.shape)
print(df_listin.shape)
print(df_reviews.shape)

In [None]:
# Checking the values
df_calendario.head()

In [None]:
# Checking the values
df_listin.head()

In [None]:
# Checking the values
df_reviews.head()

In [None]:
# Checking the values
df_calendario.date.describe

In [None]:
# Checking the columns
df_listin.columns

- <b>df_calendario:</b> In this dataset we have future dates, showing which days the place will be available for rent and which days it is already booked until the date of 2023-09-20. Together, we also have some features, such as the minimum and maximum number of days to rent and the rent amount.

- <b>df_listin:</b> In this dataset we have information about the places, such as number of rooms, average grade, location, neighborhood and information that is also in the df_calendario.

- <b>df_review:</b> In this dataset we have reviews from people who have already rented the place.

From this, some questions arise:

- Which areas of Rio de Janeiro have the highest ratings?

- What are the most common words in reviews?

- How is the vacancy situation for the year 2023, are there already many rental reservations or not? On which days?

### EDA

#### 1. Which areas of Rio de Janeiro have the highest ratings?

In [None]:
# Getting center coordinates
rj_coordinates = (df_listin.latitude.mean(), df_listin.longitude.mean())

In [None]:
# Creating the map
map_rj = folium.Map(location=rj_coordinates, zoom_start=10)

In [None]:
# Shading the areas
heatmap = HeatMap(data=df_listin[['latitude', 'longitude', 'review_scores_rating']].groupby(['latitude', 'longitude']).mean().dropna().reset_index().values.tolist(), radius=11, max_zoom=13)
heatmap.add_to(map_rj)

In [None]:
# Checking the final result
map_rj

In [None]:
# Saving the map
map_rj.save('../rj_heatmap.html')

- The waterfront area usually has better scores than places farther from the beach.

#### 2. What are the most common words in reviews?

In [None]:
# Checking for missing values
df_reviews.isna().sum()

In [None]:
# Dropping the missing values
df_review_droppedNan = df_reviews.dropna()

In [None]:
# Putting together the words of the reviews
summary = " ".join(s for s in df_review_droppedNan.comments)

In [None]:
# Creating a set of words that will be excluded
stopwords = set(STOPWORDS)
stopwords.update(["da", "meu", "em", "você", "de", "ao", "os", "é", 'br'])

In [None]:
# Creating the word cloud
img_wordcloud = WordCloud(stopwords=stopwords,
                          background_color='black',
                          width=1600,
                          height=800).generate(summary)

In [None]:
# Plotting the word cloud
fig, ax = plt.subplots(figsize=(10,6))
ax.imshow(img_wordcloud, interpolation='bilinear')
ax.set_axis_off()

plt.imshow(img_wordcloud)

From this word cloud we can see the following things from the reviews:

- Natives from countries that speak Portuguese, English or Spanish are common.
- The cleanliness and organization of the properties are usually praised.
- The location is also often praised, perhaps related to the beach and the view, which also appear in the word cloud.

#### 3. How is the vacancy situation for the year 2023, are there already many rental reservations or not? On which months?

In [None]:
# Checking the info about the dataframe
df_calendario.info()

In [None]:
# Changing the data type
df_calendario['date'] = pd.to_datetime(df_calendario.date)

In [None]:
# Selecting data for 2023
df_2023 = df_calendario[df_calendario.date.dt.year == 2023]

In [None]:
# Checking the min
df_2023.date.min()

In [None]:
# Checking the max
df_2023.date.max()

In [None]:
# Grouping data
df_aval_2023 = df_2023.groupby('listing_id').available.value_counts().to_frame()
df_aval_2023.rename(columns={'available':'days'}, inplace = True)
df_aval_2023.reset_index(inplace=True)

In [None]:
# Checking the shape of unavailable places
df_aval_2023[df_aval_2023.available == 'f'].shape[0]

- January to October has 273 days.
- For at least 1 day, 19251 different appointments have already been made for the year 2023.

In [None]:
# Copying the dataset
df_2023_day = df_2023

In [None]:
# Adding a new variable
df_2023_day['day'] = df_2023_day.date.dt.day

In [None]:
# Grouping the data
df_t = df_2023.groupby([df_2023.date]).available.value_counts().to_frame()
df_t = df_t.rename(columns={'available':'qtd'})
df_t.reset_index(inplace=True)

In [None]:
# Plotting a graph
sns.lineplot(data = df_t, x = 'date', y = 'qtd', hue = 'available')
plt.xticks(rotation = 90)
plt.title('Availability of locations by month')

- At the beginning of January, we can see that there is a drop in bookings, most likely because a lot of people go to Rio de Janeiro to celebrate the new year.
- There is another sharp increase in bookings between February and March, which must be due to Carnival.
- There is another increase between the end of March and the beginning of April, which is when the festivities of All Saints in Brazil begin and also when Cold Play shows will take place, on the 25th, 26th and 28th of March.