# Analyzing AirBnb
## A text-based analysis about Berlin's hospitality scene

Airbnb has successfully disrupted the traditional hospitality industry as more and more travelers decide to use Airbnb as their primary accommodation provider. Since its inception in 2008, Airbnb has seen an enormous growth, with the number of rentals listed on its website growing exponentially each year.

In Germany, no city is more popular than Berlin. That implies that Berlin is one of the hottest markets for Airbnb in Europe, with over 22,552 listings as of **November 2018**. With a size of 891 km², this means there are roughly 25 homes being rented out per km² in Berlin on Airbnb!

The following question will drive this project:

> **What do visitors like and dislike?**

<br> We will process the reviews to find out what peoples' likes and dislikes are. We will use Natural Language Processing (NLP) and specifically **Sentiment Analysis** and **Topic Modeling**.

### The datasets

We will use the <a href='https://www.kaggle.com/brittabettendorf/berlin-airbnb-data'> reviews data </a> and combine it with some features from the detailed Berlin listings data, sourced from the Inside Airbnb website. Both datasets were scraped on November 07th 2018.

## Table of Contents
<a id='Table of contents'></a>

### <a href='#1. Obtaining and Viewing the Data'> 1. Obtaining and Viewing the Data </a>

### <a href='#2. Preprocessing the Data'> 2. Preprocessing the Data </a>
* <a href='#2.1. Dealing with Missing Values'> 2.1. Dealing with Missing Values </a>
* <a href='#2.2. Language Detection'> 2.2. Language Detection </a>

### <a href='#3. Visualizing the Data with WordClouds'> 3. Visualizing the Data with WordClouds </a>

### <a href='#4. Sentiment Analysis'> 4. Sentiment Analysis </a>
* <a href='#4.1. Get used to VADER package'> 4.1. Get used to VADER package </a>
* <a href='#4.2. Calculating Sentiment Scores'> 4.2. Calculating Sentiment Scores </a>
* <a href='#4.3. Comparing Negative and Positive Comments'> 4.3. Comparing Negative and Positive Comments </a>
* <a href='#4.4. Investigating Positive Comments'> 4.4. Investigating Positive Comments </a>
* <a href='#4.5. Investigating Negative Comments'> 4.5. Investigating Negative Comments </a>

### <a href='#5. Topic Modeling'> 5. Topic Modeling </a>

### <a href='#5. Appendix'> 6. Appendix </a>

### 1. Obtaining and Viewing the Data 
<a id='1. Obtaining and Viewing the Data'></a>

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.rc('xtick', labelsize=15) 
plt.rc('ytick', labelsize=15) 
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import time
import datetime

In [None]:
filename = r'reviews_summary.csv'
reviews = pd.read_csv(filename)

# checking shape ...
print("The dataset has {} rows and {} columns.".format(*reviews.shape))

# ... and duplicates
print("It contains {} duplicates.".format(reviews.duplicated().sum()))

In [None]:
reviews.head()

Well, it may be valuable to have more details, such as the latitude and longitude of the accommodation that has been reviewed, the neighbourhood it's in, the host id, etc. 

To get this information, let's **combine our reviews_dataframe** with the **listings_dataframe** and take only the columns we need from the latter one:

In [None]:
filename = r'listings_summary.csv'
listings = pd.read_csv(filename)

# checking shape ...
print("The dataset has {} rows and {} columns.".format(*listings.shape))

# ... and duplicates
print("It contains {} duplicates.".format(listings.duplicated().sum()))

listings.head()

In [None]:
df = pd.merge(reviews, listings[['neighbourhood_group_cleansed', 'host_id', 'latitude',
                          'longitude', 'number_of_reviews', 'id', 'property_type']], 
              left_on='listing_id', right_on='id', how='left')

df.rename(columns = {'id_x':'id', 'neighbourhood_group_cleansed':'neighbourhood_group'}, inplace=True)
df.drop(['id_y'], axis=1, inplace=True)

In [None]:
df.head(3)

In [None]:
print("The dataset has {} rows and {} columns.".format(*df.shape))

**Hosts with many properties**

By the way, let's find out if any private hosts have started to run a professional business through Airbnb - at least this is what was in the press. Let's work this out:

In [None]:
properties_per_host = pd.DataFrame(df.groupby('host_id')['listing_id'].nunique())

properties_per_host.sort_values(by=['listing_id'], ascending=False, inplace=True)
properties_per_host.head(20)

Let's take a closer look at the top 3 hosts. How many properties do they have in the different areas? And are these private apartments, or something else, like a hostel?

**> No. 1 Host**

In [None]:
top1_host = df.host_id == 1625771
df[top1_host].neighbourhood_group.value_counts()

pd.DataFrame(df[top1_host].groupby('neighbourhood_group')['listing_id'].nunique().sort_values(ascending=False))

In [None]:
pd.DataFrame(df[top1_host].groupby('property_type')['listing_id'].nunique().sort_values(ascending=False))

> This host owns apartments in 8 (!) districts. It looks like he was really able to deeply expand a well working business into different neighbourhoods...

**> No. 2 Host**

In [None]:
top2_host = df.host_id == 8250486
df[top2_host].neighbourhood_group.value_counts()

pd.DataFrame(df[top2_host].groupby('neighbourhood_group')['listing_id'].nunique().sort_values(ascending=False))

In [None]:
pd.DataFrame(df[top2_host].groupby('property_type')['listing_id'].nunique().sort_values(ascending=False))

> Well, looks like the second biggest player turned out to be a hostel.

**> No. 3 Host**

In [None]:
top3_host = df.host_id == 2293972
df[top3_host].neighbourhood_group.value_counts() #it prints it without being beautiful

pd.DataFrame(df[top3_host].groupby('neighbourhood_group')['listing_id'].nunique().sort_values(ascending=False))

In [None]:
pd.DataFrame(df[top3_host].groupby('property_type')['listing_id'].nunique().sort_values(ascending=False))

> And host No. 3 also seems to be a professional lodging supplier.

*Back to: <a href='#Table of contents'> Table of contents</a>*
### 2. Preprocessing the Data 
<a id='2. Preprocessing the Data'></a>

#### 2.1. Dealing with Missing Values
<a id='2.1. Dealing with Missing Values'></a>

In [None]:
df.isna().sum()

In [None]:
df.dropna(inplace=True) 
df.isna().sum()

In [None]:
df.shape

*Back to: <a href='#Table of contents'> Table of contents</a>*
#### 2.2. Language Detection
<a id='2.2. Language Detection'></a>

In [None]:
from langdetect import detect

In [None]:
def language_detection(text):
    try:
        return detect(text)
    except:
        return None

In [None]:
language_detection('Als Gregor Samsa eines Morgens aus unruhigen Träumen erwachte, fand er sich in seinem Bett zu einem ungeheueren Ungeziefer verwandelt.')

In [None]:
language_detection('It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.')

In [None]:
# running this cell may take a long time, you can load the processed dataset in the next cell
# df['language'] = df['comments'].apply(language_detection)
# df.to_csv('language_processed.csv', index=False)

In [None]:
filename = r'language_processed.csv'
df_lang = pd.read_csv(filename)
df_lang.head()

In [None]:
pd.DataFrame(df_lang.language.value_counts().head(10))

In [None]:
pd.DataFrame(df_lang.language.value_counts(normalize=True).head(10))

In [None]:
plot = df_lang.language.value_counts().head(6).sort_values().plot(kind='barh', figsize=(10,5),color="lightcoral");
plot.set_title("\nWhat are the most frequent languages comments are written in?\n", 
             fontsize=30,fontweight='bold')

In [None]:
df_eng = df_lang[(df_lang['language']=='en')]
df_de  = df_lang[(df_lang['language']=='de')]
df_fr  = df_lang[(df_lang['language']=='fr')]

In [None]:
pd.set_option('display.max_colwidth', -1)
df_fr

*Back to: <a href='#Table of contents'> Table of contents</a>*
### 3. Visualizing the Data with WordClouds
<a id='3. Visualizing the Data with WordClouds'></a>

**Preparing Steps**

In [None]:
from nltk.corpus import stopwords
from wordcloud import WordCloud
from collections import Counter
from PIL import Image

import re
import string

In [None]:
def plot_wordcloud(string, language, title):
    
    # Generate WordCloud
    wordcloud = WordCloud(max_words=200, background_color="black", 
                      width=3000, height=2000,
                      stopwords=stopwords.words(language)).generate(string)

    # Plotting
    plt.figure(figsize=(12, 10))
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.title(title, fontsize=18, fontweight='bold')
    plt.show()

**English WordCloud**

In [None]:
english_comments = str(df_eng.comments.values).lower()
plot_wordcloud(english_comments, 'english', 'English Comments\n')

**German WordCloud**

In [None]:
german_comments = str(df_de.comments.values).lower()
plot_wordcloud(german_comments, 'german', 'German Comments\n')

**French WordCloud**

In [None]:
french_comments = str(df_fr.comments.values).lower()
plot_wordcloud(french_comments, 'french', 'French Comments\n')

*Back to: <a href='#Table of contents'> Table of contents</a>*
### 4. Sentiment Analysis
<a id='4. Sentiment Analysis'></a>

**Sentiment Analysis** tries to identify and extract **opinions** within a given text. The aim of sentiment analysis is to systematically identify, extract, quantify, and study affective states and subjective information.

Often applied to reviews (products, restaurants…), survey data or any user generated content that can carry opinions (e.g. tweets)

*“I loved the movie Parasite, it really deserved the Oscar”* -> Positive

*“I didn’t like the staff’s rude attitude”* -> Negative

#### 4.1. Get used to VADER package
<a id='4.1. Get used to VADER package'></a>

VADER: Valence Aware Dictionary and Sentiment Reasoner

VADER belongs to a type of sentiment analysis that is based on **lexicons** of sentiment-related words. In this approach, each of the words in the lexicon is rated as positive or negative, and in many cases, **how** positive or negative.

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

In [None]:
def sentiment_analyzer_scores(sentence):
    score = analyzer.polarity_scores(sentence)
    print("{:-<40} {}".format(sentence, str(score)))

In [None]:
sentiment_analyzer_scores("I AM HAPPY :)")

VADER produces four sentiment metrics from these word ratings, which you can see above. The first three - positive, neutral and negative - represent the proportion of the text that falls into those categories. 

The final metric, **the compound score**, is the sum of all of the lexicon ratings which have been standardised to range between -1 and 1. 

*Back to: <a href='#Table of contents'> Table of contents</a>*
#### 4.2. Calculating Sentiment Scores
<a id='4.2. Calculating Sentiment Scores'></a>

Let's now have VADER produce all four scores for each of our English-language comments.

In [None]:
def vader(text):
    score = analyzer.polarity_scores(text)
    score_list = list(score.values())
    return score_list

In [None]:
vader("I am happy")

In [None]:
# running this cell may take a long time, you can load the processed dataset in the next cell
# df_eng['sentiment_pos'], df_eng['sentiment_neg'], df_eng['sentiment_neu'], df_eng['sentiment_compound'] = zip(*df_eng['comments'].map(vader))
# df_eng.to_csv('df_en_sen.csv', index=False)

In [None]:
df_en_sen = pd.read_csv('df_en_sen.csv')
df_en_sen.head()

In [None]:
# checking shape ...
print("The dataset has {} rows and {} columns.".format(*df_en_sen.shape))

Let's investigate the distribution of all scores:

In [None]:
fig, axes = plt.subplots(3, figsize=(7,10))

# plot all 3 histograms
df_en_sen.hist('sentiment_neg', bins=25, ax=axes[0], color='lightcoral', alpha=0.6)
axes[0].set_title('Negative Sentiment Score', fontsize=15)
df_en_sen.hist('sentiment_neu', bins=25, ax=axes[1], color='lightsteelblue', alpha=0.6)
axes[1].set_title('Neutral Sentiment Score', fontsize=15)
df_en_sen.hist('sentiment_pos', bins=25, ax=axes[2], color='chartreuse', alpha=0.6)
axes[2].set_title('Positive Sentiment Score', fontsize=15)

# plot common x- and y-label
fig.text(0.5, 0.04, 'Sentiment Scores',  fontweight='bold', ha='center', fontsize=15)
fig.text(-0.04, 0.5, 'Number of Reviews', fontweight='bold', va='center', rotation='vertical', fontsize=15)

# plot title
plt.suptitle('Sentiment Analysis of Airbnb Reviews for Berlin\n\n', fontsize=20, fontweight='bold');

In [None]:
df_en_sen.hist('sentiment_compound', bins=25, color='orange', alpha=0.6)

# plot title
plt.suptitle('Compound Sentiment Analysis of Airbnb Reviews for Berlin\n\n', fontsize=20, fontweight='bold');

Clearly, the bulk of the reviews are tremendously positive. Wouldn't it be interesting to know what the negative and positive comments are about? Let's have a look.

*Back to: <a href='#Table of contents'> Table of contents</a>*
#### 4.3. Comparing Negative and Positive Comments
<a id='4.3. Comparing Negative and Positive Comments'></a>

In [None]:
df_pos = df_en_sen.loc[df_en_sen.sentiment_compound >= 0.95]

pos_comments = df_pos['comments'].tolist()
len(pos_comments)

In [None]:
df_neg = df_en_sen.loc[df_en_sen.sentiment_compound < 0.0]

neg_comments = df_neg['comments'].tolist()
len(neg_comments)

Let's compare the length of both positive and negative comments:

In [None]:
df_pos['text_length'] = df_pos['comments'].apply(len)
df_neg['text_length'] = df_neg['comments'].apply(len)

In [None]:
sns.set_style("whitegrid")
sns.set(font_scale=1.5)
plt.figure(figsize=(10,7))

sns.distplot(df_pos['text_length'], bins=50, color='chartreuse')
sns.distplot(df_neg['text_length'], bins=50, color='lightcoral')

plt.title('\nDistribution Plot for Length of Comments\n')
plt.legend(['Positive Comments', 'Negative Comments'])
plt.xlabel('\nText Length')
plt.ylabel('Percentage of Comments\n');

The mode for the text length of positive comments can be found more to the right than for the negative comments, which means most of the positive comments are longer than most of the negative comments. But the tail for negative comments is thicker.

In [None]:
print('\n\n'.join(pos_comments[10:15]))

In [None]:
print('\n\n'.join(neg_comments[10:15]))

Let's quickly check if a scatter plot may reveal some differences in the comments' sentiment with respect to the districts:

In [None]:
sns.set_style("white")
cmap = sns.cubehelix_palette(rot=-.4, as_cmap=True)
fig, ax = plt.subplots(figsize=(11,7))

ax = sns.scatterplot(x="longitude", y="latitude", size='number_of_reviews', sizes=(5, 200),
                     hue='sentiment_compound', palette=cmap,  data=df_en_sen)
ax.legend(bbox_to_anchor=(1.3, 1), borderaxespad=0.)
plt.title('\nAccommodations in Berlin by Number of Reviwws & Sentiment\n', fontsize=12, fontweight='bold')

sns.despine(ax=ax, top=True, right=True, left=True, bottom=True);

Not really...

*Back to: <a href='#Table of contents'> Table of contents</a>*
#### 4.4. Investigating Positive Comments
<a id='4.4. Investigating Positive Comments'></a>

**WordCloud**

After reading some of these reviews to get a feeling for what visitors applaud or complain about, WordClouds are a great tool to help us peek behind the curtain:

In [None]:
plot_wordcloud(str(pos_comments[0:3000]).lower(), 'english', 'Positively Tuned\n')

**Frequency Distribution**

Another method for visually exploring text is with frequency distributions. In the context of a text corpus, such a distribution tells us the prevalence of certain words. Here we use the Yellowbrick library.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from yellowbrick.text.freqdist import FreqDistVisualizer
from yellowbrick.style import set_palette

In [None]:
# vectorizing text
vectorizer = CountVectorizer(stop_words='english')
docs = vectorizer.fit_transform(pos_comments)
features = vectorizer.get_feature_names()

# preparing the plot
set_palette('pastel')
plt.figure(figsize=(18,8))
plt.title('The Top 30 most frequent words used in POSITIVE comments\n', fontweight='bold', fontsize=20)

# instantiating and fitting the FreqDistVisualizer, plotting the top 30 most frequent terms
visualizer = FreqDistVisualizer(features=features, n=30)
visualizer.fit(docs)
visualizer.poof;

*Back to: <a href='#Table of contents'> Table of contents</a>*
### 5. Topic Modeling
<a id='5. Topic Modeling'></a>
Next we'll explore **topic modeling**, an unsupervised machine learning technique for abstracting topics from collections of documents or, in our case, for identifying which topic is being discussed in a comment. 

Put simply: 
* A document can be represented using a set of topics.
* Each topic is represented as a set of words with their probabilities of occurring in that topic

Methods for topic modeling have evolved significantly over the last decade. In this section, we'll explore a technique called *Latent Dirichlet Allocation (LDA)*, a widely used topic modelling technique.

*Back to: <a href='#Table of contents'> Table of contents</a>*

#### 5.1 Cleaning and Preprocessing
<a id='5.1 Cleaning and Preprocessing'></a>

In [None]:
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

In [None]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

In [None]:
def clean(doc):
    stop_free = " ".join([word for word in doc.lower().split() if word not in stop])
    punc_free = "".join(token for token in stop_free if token not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(comment).split() for comment in pos_comments]


*Back to: <a href='#Table of contents'> Table of contents</a>*

#### 5.2. Building the model
<a id='5.2 Building the model'></a>
First, we create a Gensim dictionary from the normalized data, then we convert this to a bag-of-words corpus, and save both dictionary and corpus for future use.

In [None]:
from gensim import corpora
import pickle 

dictionary = corpora.Dictionary(doc_clean)
corpus = [dictionary.doc2bow(text) for text in doc_clean]

In [None]:
pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

In [None]:
import gensim
# let LDA find 3 topics
# running this cell may take a long time
ldamodel3 = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word=dictionary, passes=15)
ldamodel.save('lda_3_topics.gensim')

topics3 = ldamodel3.print_topics(num_words=10)
for topic in topics3:
    print(topic,'\n')

- The first topic includes words like *bed*, *also*, *even*, and a mysterious *u* (perhaps u-bahn for the underground?). It seems unclear to me what this was supposed to be about. 
- The second topic combines words like *great*, *place*, *stay*, and *recommend*, which sounds like a cluster related to overall satisfaction with the home.
- The third topic includes words like *apartment*, *great*, and *location*, and *minute*. This sounds like a topic related to convenient distances from the accommodation to wherever something interesting was to go to.

In [None]:
# now let LDA find 5 topics
# running this cell may take a long time
ldamodel5 = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15)
ldamodel5.save('lda_5_topics.gensim')

topics5 = ldamodel5.print_topics(num_words=4)
for topic in topics5:
    print(topic, '\n')

In [None]:
# and finally 10 topics
# running this cell may take a long time
ldamodel10 = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word=dictionary, passes=15)
ldamodel10.save('lda_10_topics.gensim')

topics10 = ldamodel10.print_topics(num_words=4)
for topic in topics10:
    print(topic, '\n')

Putting it all together - the WordCloud, the Frequency Distribution and the Topic Modelling - it is often the following criteria that make someone rate an apartment **positively:**
1. **The apartment is clean, the bathroom is clean, the bed is comfortable.**
2. **The apartment is quiet and conducive to getting sound sleep.**
3. **The area is centrally located with short walking distances, good public transport connections, and has cafes and restaurants nearby.**

Apparently, getting the last two means trying to square the circle... but this is true for tourists all over the world.

Before we move on to the negative comments, let's visualize the LDA model:

*3. Visualizing topics*

The pyLDAvis library is designed to provide a visual interface for interpreting the topics derived from a topic model by extracting information from a fitted LDA topic model.

***The following code should be run locally only!***

In [None]:
import pyLDAvis.gensim

In [None]:
# visualizing 3 topics
lda_display3 = pyLDAvis.gensim.prepare(ldamodel3, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display3)

In [None]:
pyLDAvis.save_html(lda_display3, 'lda_3_topics.html')

In short, the interface provides:

- a left panel that depicts a global view of the model (how prevalent each topic is and how topics relate to each other);
- a right panel containing a bar chart – the bars represent the terms that are most useful in interpreting the topic currently selected (what the meaning of each topic is).

On the left, the topics are plotted as circles, whose centers are defined by the computed distance between topics (projected into 2 dimensions). The prevalence of each topic is indicated by the circle’s area. On the right, two juxtaposed bars show the topic-specific frequency of each term (in red) and the corpus-wide frequency (in blueish gray). When no topic is selected, the right panel displays the top 30 most salient terms for the dataset.

In [None]:
# visualizing 5 topics
lda_display5 = pyLDAvis.gensim.prepare(ldamodel5, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display5)
pyLDAvis.save_html(lda_display5, 'lda_5_topics.html')

*Back to: <a href='#Table of contents'> Table of contents</a>*
#### 4.5. Investigating Negative Comments
<a id='4.5. Investigating Negative Comments'></a>

**WordCloud**

In [None]:
plot_wordcloud(str(neg_comments).lower(),'english', '\nNegatively Tuned')

**Frequency Distribution**

In [None]:
# vectorizing text
vectorizer = CountVectorizer(stop_words='english')
docs = vectorizer.fit_transform(neg_comments)
features = vectorizer.get_feature_names()

# preparing the plot
set_palette('pastel')
plt.figure(figsize=(18,8))
plt.title('The Top 30 most frequent words used in NEGATIVE comments\n', fontweight='bold')

# instantiating and fitting the FreqDistVisualizer, plotting the top 30 most frequent terms
visualizer = FreqDistVisualizer(features=features, n=30)
visualizer.fit(docs)
visualizer.poof;

**Topic Modelling**

*1. Cleaning and Preprocessing*

In [None]:
doc_clean_neg = [clean(comment).split() for comment in neg_comments]

*2. LDA the Gensim way*

In [None]:
dictionary_neg = corpora.Dictionary(doc_clean_neg)
corpus_neg = [dictionary_neg.doc2bow(text) for text in doc_clean_neg]

In [None]:
pickle.dump(corpus, open('corpus_neg.pkl', 'wb'))
dictionary.save('dictionary_neg.gensim')

In [None]:
# let LDA find 3 topics
# running this cell may take a long time
ldamodel3_neg = gensim.models.ldamodel.LdaModel(corpus_neg, num_topics=3, id2word=dictionary_neg, passes=15)
ldamodel3_neg.save('lda_3_topics_neg.gensim')

topics3_neg = ldamodel3_neg.print_topics(num_words=10)
for topic in topics3_neg:
    print(topic)

In [None]:
# now let LDA find 5 topics
# running this cell may take a long time
ldamodel5_neg = gensim.models.ldamodel.LdaModel(corpus_neg, num_topics=5, id2word=dictionary_neg, passes=15)
ldamodel5_neg.save('lda_5_topics_neg.gensim')

ldamodel5_neg = gensim.models.ldamodel.LdaModel.load('lda_5_topics_neg.gensim')
topics5_neg = ldamodel5_neg.print_topics(num_words=4)
for topic in topics5_neg:
    print(topic)

In [None]:
# and finally 10 topics
# running this cell may take a long time
ldamodel10_neg = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word=dictionary, passes=15)
ldamodel10_neg.save('lda_10_topics_neg.gensim')

ldamodel10_neg = gensim.models.ldamodel.LdaModel.load('lda_10_topics_neg.gensim')
topics10_neg = ldamodel10_neg.print_topics(num_words=4)
for topic in topics10_neg:
    print(topic)

Once again, let's put all of the visualizations together and summarize what makes someone rate an apartment **negatively:**
1. **The apartment and/or bathroom (especially the shower) are dirty.**
2. **Problems in communicating with the host, e.g. one-sided cancellations by the host or to not being able to get a hold of him/her when having issues.**
3. **The area is too far away from public transport connections or doesn't meet vistors' expectations in some way.**

Before we finish analyzing the negative comments, let's visualize the LDA model:

*3. Visualizing topics*

***The following code should be run locally only!***

In [None]:
# visualizing 3 topics
lda_display3_neg = pyLDAvis.gensim.prepare(ldamodel3_neg, corpus_neg, dictionary_neg, sort_topics=False)
pyLDAvis.display(lda_display3_neg)
pyLDAvis.save_html(lda_display3_neg, 'lda_3_topics_neg.html')

In [None]:
# visualizing 5 topics
lda_display5_neg = pyLDAvis.gensim.prepare(ldamodel5_neg, corpus_neg, dictionary_neg, sort_topics=False)
pyLDAvis.display(lda_display5_neg)
pyLDAvis.save_html(lda_display5_neg, 'lda_5_topics_neg.html')

*Back to: <a href='#Table of contents'> Table of contents</a>*
### 6. Appendix 
<a id='6. Appendix'></a>

All resources used in this notebook are listed below.

Data
- Inside Airbnb: http://insideairbnb.com/get-the-data.html

WordClouds
- https://vprusso.github.io/blog/2018/natural-language-processing-python-3/
- https://www.datacamp.com/community/tutorials/wordcloud-python

Bar Charts
- http://robertmitchellv.com/blog-bar-chart-annotations-pandas-mpl.html

YellowBrick Visualization
- http://www.scikit-yb.org/en/latest/index.html

Language Detection
- TextBlob:
    - https://www.analyticsvidhya.com/blog/2018/02/natural-language-processing-for-beginners-using-textblob/
    - https://github.com/shubhamjn1/TextBlob/blob/master/Textblob.ipynb
    - https://stackoverflow.com/questions/43485469/apply-textblob-in-for-each-row-of-a-dataframe
    - https://textblob.readthedocs.io/en/dev/quickstart.html
<br>
- Spacy:
    - https://github.com/nickdavidhaynes/spacy-cld
    - https://spacy.io/usage/models
<br>
- Langdetect & LangId:
    - https://pypi.org/project/langdetect/ 
    - https://www.probytes.net/blog/python-language-detection/
    - https://github.com/hb20007/hands-on-nltk-tutorial/blob/master/8-1-The-langdetect-and-langid-Libraries.ipynb

Sentiment Analysis
- *"Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning"* (Paperback) by B. Bengfort, R. Bilbro, T. Ojeda, published by O′Reilly
- Jodie Burchell: http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html
- Jodie Burchell: http://t-redactyl.io/blog/2017/01/how-do-we-feel-about-new-years-resolutions-according-to-sentiment-analysis.html
- Jodie Burchell: https://github.com/t-redactyl/Blog-posts/blob/master/2017-04-15-sentiment-analysis-in-vader-and-twitter-api.ipynb
- http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf

- Susan Li: https://towardsdatascience.com/latent-semantic-analysis-sentiment-classification-with-python-5f657346f6a3
- Sakshi Gupta (in R): https://towardsdatascience.com/uncovering-hidden-trends-in-airbnb-reviews-11eb924f2fec
- Dmytro Iakubovskyi: https://towardsdatascience.com/digging-into-airbnb-data-reviews-sentiments-superhosts-and-prices-prediction-part1-6c80ccb26c6a
- Dmytro Iakubovskyi: https://github.com/Dima806/Airbnb_project/blob/master/airbnb_final_analysis_v3.ipynb
- Maurizio Santamicone: https://medium.com/@mauriziosantamicone/seattle-confidential-unpacking-airbnb-reviews-with-sentiment-d421c15d8b8f
- Zhenyu: https://www.kaggle.com/zhenyufan/nlp-for-yelp-reviews/notebook?utm_medium=email&utm_source=intercom&utm_campaign=datanotes-2019

Topic Modeling / LDA
- Analytics Vidhya: https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
- Susan Li: https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21
- https://radimrehurek.com/gensim/models/ldamodel.html
- https://www.objectorientedsubject.net/2018/08/experiments-on-topic-modeling-pyldavis/

Diverse
- https://data-viz-for-fun.com/2018/08/airbnb-data-viz/