### Below is the code which groups the HomeAway's properties into different categories based on customer comments. It uses Latent Dirichlet Allocation (LDA) Model. 

In the LDA model, each document is viewed as a mixture of topics. The model proposes that each word in the document is attributable to one of the document’s topics. LDA gives us the proportion of different words associated with a particular topic. Based on the words associated with the topics, we defined 4 different groups - "*Home Size*", "*Surrounding*", "*Amenities*" and "*Location*". The LDA further discovered the proportion of different groups associated with a property. A property is classified into the group which is associated the most to it. 

The words which were most associated with each of the group (topic) are listed below. As can be seen below, LDA clearly clubbed similar words together into one group (topics) and separated out the distinguishable words into different groups (topics)


**1. Home Size** - family, group, plenty, space, friends, party

**2. Surroundings** - comfortable, quiet, beautiful, barton springs

**3. Amenities** - kitchen, room, bed, towels, TV, parking, pool

**4. Location** - downtown, south congress, 6th street, restaurants, food, bars

In [4]:
import os, csv, lda, nltk
import pandas as pd
import numpy as np
from nltk.tokenize import PunktSentenceTokenizer, RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

from nltk.tokenize import PunktSentenceTokenizer,RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


reviews_df=pd.read_excel("comments_homeaway_final.xlsx",encoding='utf8', errors='ignore')


#checking for nulls if present any
print("Number of rows with any of the empty columns:")
print(reviews_df.isnull().sum().sum())
reviews_df=reviews_df.dropna() 

property_name = input('provide the column name for property names: ')
property_review = input('provide the column name for property reviews: ')
ntopics= input('Provide the number of latent topics to be estimated: ');


word_tokenizer=RegexpTokenizer(r'\w+')
wordnet_lemmatizer = WordNetLemmatizer()
stopwords_nltk=set(stopwords.words('english'))


def tokenize_text(version_desc):
    lowercase=version_desc.lower()
    text = wordnet_lemmatizer.lemmatize(lowercase)
    tokens = word_tokenizer.tokenize(text)
    return tokens

vec_words = CountVectorizer(tokenizer=tokenize_text,stop_words=stopwords_nltk,decode_error='ignore')
total_features_words = vec_words.fit_transform(reviews_df[property_review])

print(total_features_words.shape)

model = lda.LDA(n_topics=int(ntopics), n_iter=500, random_state=1)
model.fit(total_features_words)

topic_word = model.topic_word_ 
doc_topic=model.doc_topic_
doc_topic=pd.DataFrame(doc_topic)
reviews_df=reviews_df.join(doc_topic)
properties=pd.DataFrame()

for i in range(int(ntopics)):
    topic="topic_"+str(i)
    properties[topic]=reviews_df.groupby([property_name])[i].mean()
    
properties=properties.reset_index()
topics=pd.DataFrame(topic_word)
topics.columns=vec_words.get_feature_names()
topics1=topics.transpose()
topics1.to_excel("topic_word_dist.xlsx")
properties.to_excel("properties_topic_dist.xlsx",index=False)

Number of rows with any of the empty columns:
0
provide the column name for property names: Property Link
provide the column name for property reviews: Comments
Provide the number of latent topics to be estimated: 4


INFO:lda:n_documents: 7795
INFO:lda:vocab_size: 11929
INFO:lda:n_words: 332533
INFO:lda:n_topics: 4
INFO:lda:n_iter: 500


(7795, 11929)


  if sparse and not np.issubdtype(doc_word.dtype, int):
INFO:lda:<0> log likelihood: -2966635
INFO:lda:<10> log likelihood: -2661610
INFO:lda:<20> log likelihood: -2584765
INFO:lda:<30> log likelihood: -2544893
INFO:lda:<40> log likelihood: -2524902
INFO:lda:<50> log likelihood: -2512406
INFO:lda:<60> log likelihood: -2505801
INFO:lda:<70> log likelihood: -2502112
INFO:lda:<80> log likelihood: -2499153
INFO:lda:<90> log likelihood: -2497906
INFO:lda:<100> log likelihood: -2495022
INFO:lda:<110> log likelihood: -2494129
INFO:lda:<120> log likelihood: -2494380
INFO:lda:<130> log likelihood: -2492312
INFO:lda:<140> log likelihood: -2492786
INFO:lda:<150> log likelihood: -2491992
INFO:lda:<160> log likelihood: -2492134
INFO:lda:<170> log likelihood: -2492467
INFO:lda:<180> log likelihood: -2490820
INFO:lda:<190> log likelihood: -2491255
INFO:lda:<200> log likelihood: -2491181
INFO:lda:<210> log likelihood: -2490948
INFO:lda:<220> log likelihood: -2491264
INFO:lda:<230> log likelihood: -249