This assignment deals with the most recent Airbnb listings in Copenhagen. The data is collected from [Inside Airbnb](http://insideairbnb.com/copenhagen). Feel free to explore the website further in order to better understand the data. The data (*listings.csv*) has been collected as raw data and needs to be preprocessed.

**Hand-in:** Hand in as a group in Itslearning in a **single**, well-organized and easy-to-read Jupyter Notebook. If your group consists of students from different classes, upload in **both** classes.

1. First we need to remove all the redundant columns. Please keep the following 22 columns and remove all others:

    id\
    name  
    host_id  
    host_name  
    neighbourhood_cleansed  
    latitude  
    longitude  
    room_type  
    price  
    minimum_nights  
    number_of_reviews  
    last_review  
    review_scores_rating  
    review_scores_accuracy  
    review_scores_cleanliness  
    review_scores_checkin  
    review_scores_communication  
    review_scores_location  
    review_scores_value  
    reviews_per_month  
    calculated_host_listings_count  
    availability_365



2. Next we have to handle missing values. Remove all rows where `number_of_reviews = 0`. If there are still missing values, remove the rows that contain them so you have a data set with no missing values.

3. Fix the `neighbourhood_cleansed` values (some are missing 'æ ø å'), and if necessary change the price to DKK.

4. Create a fitting word cloud based on the `name` column. Feel free to remove non-descriptive stop words (e.g. since this is about Copenhagen, perhaps the word 'Copenhagen' is redundant).

In [None]:
#install packages 

!pip install wordcloud
!pip install pandas
!pip install folium
!pip install matplotlib

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import pandas as pd

airbnbSheet = pd.read_csv("../data/listings.csv")

desiredColumns = [
    'id',
    'name',
    'host_id',
    'host_name',
    'neighbourhood_cleansed',
    'latitude',
    'longitude',
    'room_type',
    'price',
    'minimum_nights',
    'number_of_reviews',
    'last_review',
    'review_scores_rating',
    'review_scores_accuracy',
    'review_scores_cleanliness',
    'review_scores_checkin',
    'review_scores_communication',
    'review_scores_location',
    'review_scores_value',
    'reviews_per_month',
    'calculated_host_listings_count',
    'availability_365',
]

fas = airbnbSheet[desiredColumns]  # keep only these columns
fas = fas.dropna()  # drop columns with empty vals
# drop if num of reviews is 0
fas = fas.drop(fas[fas.number_of_reviews > 1].index)


#APPARTMENT NAMES:
# Combine all the names into a single string

text = " ".join(name for name in fas['name'].astype(str))

# List of words to remove
stopwords = set(["Copenhagen", "CPH", "the", "to", "and", "of", "in", "og", "i", "at", "a"])

# Generate the word cloud
wordcloud = WordCloud(stopwords=stopwords, background_color="grey").generate(text)

# Display the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show() #uncomment to show the wordcloud

5. Since data science is so much fun, provide a word cloud of the names of the hosts, removing any names of non-persons. Does this more or less correspond with the distribution of names according to [Danmarks Statistik](https://www.dst.dk/da/Statistik/emner/borgere/navne/navne-i-hele-befolkningen)?

In [None]:
# HOST NAMES:

# Combine all the host names into a single string
text = " ".join(name for name in fas['host_name'].astype(str))

# List of words to remove (non-person names, or common words that might appear as host names)
# This list should be expanded based on the data and common non-person names you observe
# Replace with actual non-person names or words
stopwords = set(["Host1", "Host2", "ApartmentinCopenhagen"])

# Generate the word cloud
wordcloud = WordCloud(stopwords=stopwords,
                      background_color="grey").generate(text)

# Display the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

6. Create a new column using bins of price. Use 11 bins, evenly distributed but with the last bin $> 10,000$.

In [None]:
# the price values are in dollars
fas = fas.rename(columns={'price': 'price_in_dollars'})

# remove the dollar sign from the numbers
fas['price_in_dollars'] = fas['price_in_dollars'].str.replace(
    '$', '', regex=False).str.replace(',', '').astype(float)

bins = [0, 1_000, 2_000, 3_000, 4_000, 5_000, 6_000,
        7_000, 8_000, 9_000, 10_000, float('inf')]
labels = ['$0-1,000', '$1,000-2,000', '$2,000-3,000', '$3,000-4,000', '$4,000-5,000',
          '$5,000-6,000', '$6,000-7,000', '$7,000-8,000', '$8,000-9,000', '$9,000-10,000', '$10,000+']

fas['price_bin'] = pd.cut(fas['price_in_dollars'],
                          bins=bins, labels=labels, right=False)

print(fas[['price_in_dollars', 'price_bin']])

7. Using non-scaled versions of latitude and longitude, plot the listings data on a map. Use the newly created price bins as a color parameter. Also, create a plot (i.e. another plot) where you group the listings with regard to the neighbourhood.

In [None]:
import folium

color_map = {
    '$0-1,000': 'green',
    '$1,000-2,000': 'blue',
    '$2,000-3,000': 'lightblue',
    '$3,000-4,000': 'yellow',
    '$4,000-5,000': 'orange',
    '$5,000-6,000': 'darkorange',
    '$6,000-7,000': 'orangered',
    '$7,000-8,000': 'red',
    '$8,000-9,000': 'darkred',
    '$9,000-10,000': 'purple',
    '$10,000+': 'black'
}


# Create a base map centered around Copenhagen
m = folium.Map(location=[55.6761, 12.5683], zoom_start=12)

# Add listings to the map using the price bins for color
for idx, row in fas.iterrows():
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=5,
        color=color_map[row['price_bin']],
        fill=True,
        fill_color=color_map[row['price_bin']]
    ).add_to(m)

# Display the map (in Jupyter Notebook)
m

In [None]:
import matplotlib.colors as colors
import matplotlib.cm as cm
import numpy as np

# Get a list of unique neighborhoods
neighborhoods = fas['neighbourhood_cleansed'].unique()

# Create a colormap
colors_array = cm.rainbow(np.linspace(0, 1, len(neighborhoods)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
neighborhood_color_map = dict(zip(neighborhoods, rainbow))


m2 = folium.Map(location=[55.6761, 12.5683], zoom_start=12)

# Add listings to the map using neighborhoods for color
for idx, row in fas.iterrows():
    neighborhood = row['neighbourhood_cleansed']
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=5,
        color=neighborhood_color_map[neighborhood],
        fill=True,
        fill_color=neighborhood_color_map[neighborhood],
        fill_opacity=0.7
    ).add_to(m2)

# Display the map (if you're in Jupyter Notebook)
m2

8. Create boxplots where you have the neighbourhood on the x-axis and price on the y-axis. What does this tell you about the listings in Copenhagen? Keep the x-axis as is and move different variables into the y-axis to see how things are distributed between the neighborhoods to create different plots (your choice).