## 1. Getting Started: Airbnb Copenhagen

This assignment deals with the most recent Airbnb listings in Copenhagen. The data is collected from [Inside Airbnb](http://insideairbnb.com/copenhagen). Feel free to explore the website further in order to better understand the data. The data (*listings.csv*) has been collected as raw data and needs to be preprocessed.

**Hand-in:** Hand in as a group in Itslearning in a **single**, well-organized and easy-to-read Jupyter Notebook. If your group consists of students from different classes, upload in **both** classes.

1. First we need to remove all the redundant columns. Please keep the following 22 columns and remove all others:

    id\
    name  
    host_id  
    host_name  
    neighbourhood_cleansed  
    latitude  
    longitude  
    room_type  
    price  
    minimum_nights  
    number_of_reviews  
    last_review  
    review_scores_rating  
    review_scores_accuracy  
    review_scores_cleanliness  
    review_scores_checkin  
    review_scores_communication  
    review_scores_location  
    review_scores_value  
    reviews_per_month  
    calculated_host_listings_count  
    availability_365



2. Next we have to handle missing values. Remove all rows where `number_of_reviews = 0`. If there are still missing values, remove the rows that contain them so you have a data set with no missing values.

3. Fix the `neighbourhood_cleansed` values (some are missing 'æ ø å'), and if necessary change the price to DKK.

4. Create a fitting word cloud based on the `name` column. Feel free to remove non-descriptive stop words (e.g. since this is about Copenhagen, perhaps the word 'Copenhagen' is redundant).

5. Since data science is so much fun, provide a word cloud of the names of the hosts, removing any names of non-persons. Does this more or less correspond with the distribution of names according to [Danmarks Statistik](https://www.dst.dk/da/Statistik/emner/borgere/navne/navne-i-hele-befolkningen)?

6. Create a new column using bins of price. Use 11 bins, evenly distributed but with the last bin $> 10,000$.

7. Using non-scaled versions of latitude and longitude, plot the listings data on a map. Use the newly created price bins as a color parameter. Also, create a plot (i.e. another plot) where you group the listings with regard to the neighbourhood.

8. Create boxplots where you have the neighbourhood on the x-axis and price on the y-axis. What does this tell you about the listings in Copenhagen? Keep the x-axis as is and move different variables into the y-axis to see how things are distributed between the neighborhoods to create different plots (your choice).

9. Create a bar chart of the hosts with the top ten most listings. Place host id on the x-axis and the count of listings on the y-axis.

10. Do a descriptive analysis of the neighborhoods. Include information about room type in the analysis as well as one other self-chosen feature. The descriptive analysis should contain mean/average, mode, median, standard deviation/variance, minimum, maximum and quartiles.

11. Supply a list of the top 10 highest rated listings and visualize them on a map.

12. Now, use any preprocessing and feature engineering steps that you find relevant before proceeding (optional).

13. Create another new column, where the price is divided into two categories: "expensive" listings defined by all listings with a price higher than the median price, and "affordable" listings defined by all listings with a price equal to or below the median price. You can encode the affordable listings as "0" and the expensive ones as "1". All listings should now have a classification indicating either expensive listings (1) or affordable listings (0).

14. Based on self-chosen features, develop a Naïve Bayes and k-Nearest Neighbor model to determine whether a rental property should be classified as 0 or 1. Remember to divide your data into training data and test data. Comment on your findings.

15. Try to come up with a final conclusion to the Airbnb-Copenhagen assignment.


In [12]:
import pandas as pd
import sklearn as sk

# 1 Load and select columns

In [18]:
# load the data
data = pd.read_csv('listings_new.csv')
data_limited = data[["id",
    "name",
    "host_id"  ,
    "host_name" , 
    "neighbourhood_cleansed"  ,
    "latitude"  ,
    "longitude"  ,
    "room_type"  ,
    "price"  ,
    "minimum_nights"  ,
    "number_of_reviews",  
    "last_review"  ,
    "review_scores_rating"  ,
    "review_scores_accuracy" , 
    "review_scores_cleanliness"  ,
    "review_scores_checkin"  ,
    "review_scores_communication"  ,
    "review_scores_location"  ,
    "review_scores_value"  ,
    "reviews_per_month"  ,
    "calculated_host_listings_count"  ,
    "availability_365",]]
data_limited.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_cleansed,latitude,longitude,room_type,price,minimum_nights,...,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month,calculated_host_listings_count,availability_365
0,6983,Rental unit in Copenhagen · ★4.78 · 1 bedroom ...,16774,Simon,Nrrebro,55.68641,12.54741,Entire home/apt,$803.00,3,...,4.78,4.79,4.78,4.86,4.89,4.73,4.71,1.03,1,0
1,26057,Home in Copenhagen · ★4.91 · 4 bedrooms · 4 be...,109777,Kari,Indre By,55.69307,12.57649,Entire home/apt,"$2,600.00",7,...,4.91,4.93,4.96,4.93,4.86,4.94,4.81,0.51,1,261
2,334127,Rental unit in Copenhagen · ★4.88 · 1 bedroom ...,1702034,Mette,Vesterbro-Kongens Enghave,55.67059,12.55651,Entire home/apt,"$1,401.00",4,...,4.88,4.92,4.85,4.96,4.98,4.76,4.78,0.87,1,42
3,338928,Rental unit in Copenhagen · ★4.91 · 1 bedroom ...,113348,Samy,Nrrebro,55.69388,12.54725,Entire home/apt,$793.00,5,...,4.91,4.86,4.92,4.9,4.94,4.89,4.8,0.69,1,5
4,26473,Townhouse in Copenhagen · ★4.56 · 6 bedrooms ·...,112210,Julia,Indre By,55.67602,12.5754,Entire home/apt,"$3,350.00",3,...,4.56,4.64,4.45,4.79,4.72,4.89,4.61,2.11,10,109


# 2 Handle missing values

In [19]:
# removing rows with no reviews

data_filtered = data_limited.loc[data_limited['number_of_reviews'] != 0]

# remove nan

data_filtered = data_filtered.dropna()
data_filtered.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_cleansed,latitude,longitude,room_type,price,minimum_nights,...,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,reviews_per_month,calculated_host_listings_count,availability_365
0,6983,Rental unit in Copenhagen · ★4.78 · 1 bedroom ...,16774,Simon,Nrrebro,55.68641,12.54741,Entire home/apt,$803.00,3,...,4.78,4.79,4.78,4.86,4.89,4.73,4.71,1.03,1,0
1,26057,Home in Copenhagen · ★4.91 · 4 bedrooms · 4 be...,109777,Kari,Indre By,55.69307,12.57649,Entire home/apt,"$2,600.00",7,...,4.91,4.93,4.96,4.93,4.86,4.94,4.81,0.51,1,261
2,334127,Rental unit in Copenhagen · ★4.88 · 1 bedroom ...,1702034,Mette,Vesterbro-Kongens Enghave,55.67059,12.55651,Entire home/apt,"$1,401.00",4,...,4.88,4.92,4.85,4.96,4.98,4.76,4.78,0.87,1,42
3,338928,Rental unit in Copenhagen · ★4.91 · 1 bedroom ...,113348,Samy,Nrrebro,55.69388,12.54725,Entire home/apt,$793.00,5,...,4.91,4.86,4.92,4.9,4.94,4.89,4.8,0.69,1,5
4,26473,Townhouse in Copenhagen · ★4.56 · 6 bedrooms ·...,112210,Julia,Indre By,55.67602,12.5754,Entire home/apt,"$3,350.00",3,...,4.56,4.64,4.45,4.79,4.72,4.89,4.61,2.11,10,109


# 3 Fix 'neighbourhood_cleansed'