# Text Mining


For this project, we will be using the Latent Dirichlet Allocation (LDA) model to uncover hidden structures in the description variable for each dataset. The description variable wil be visualized in both a wordcloud and intertopic distance map. These graphs will help visualize the most frequent terms for the description of each listing.

The Wordclouds, HTML files of the Intertopic Distance Maps and raw data files can all be found in the folders on the main branch.

To run for each city, the df must changed for the corpus variable. For example, corpus = df_tor[descript] would be used for Toronto whereas corpus = df_van[descript] would be used for Vancouver.

In [1]:
import pandas as pd
import os
import re
import numpy as np
import nltk

from nltk import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from gensim.corpora.dictionary import Dictionary
from gensim.models.tfidfmodel import TfidfModel
from gensim.models.ldamodel import LdaModel

import itertools
from collections import Counter
from collections import defaultdict

import json
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

First, given there are lots of rows in the CSV file. I wanted to separate out the description variable into its own dataframe.

In [2]:
path = ('LA_listings.csv') #Data file to load in
df = pd.read_csv(path, header=0, index_col=0)
descript = ['description']
corpus = df[descript] #Dataframe with just the description
corpus.head() #Check the df

Unnamed: 0_level_0,description
id,Unnamed: 1_level_1
109,"*** Unit upgraded with new bamboo flooring, br..."
2708,"Run Runyon Canyon, Our Gym & Sauna Open Beauti..."
2732,An oasis of tranquility awaits you.The spaceTh...
2864,Centrally located.... Furnished with King Size...
5728,Our home is located near Venice Beach without ...


Now that we have our own description dataframe, we must clean up the data. We'll be using Tokenization & Lemmatization. Each term will be chopped up into "tokens" and certain things that we don't want will be removed such as punctuation.

In [3]:
def preprocess_text(corpus): #Need to standardize the text (remove stopwords, punctuation, capitalization, lemmatization etc)
    clean_corpus = []
    en_words = set(nltk.corpus.words.words())
    en_stopwords = set(stopwords.words('english'))
    wordnet_lemmatizer = WordNetLemmatizer()
    tokenizer = RegexpTokenizer(r'[\w|!]+')
    for row in corpus:
        word_tokens = tokenizer.tokenize(row)
        word_tokens_lower = [t.lower() for t in word_tokens]
        word_tokens_lower_en = [t for t in word_tokens_lower if t in en_words or not t.isalpha()]
        word_tokens_no_stops = [t for t in word_tokens_lower_en if not t in en_stopwords]
        word_tokens_no_stops_lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in word_tokens_no_stops]
        clean_corpus.append(word_tokens_no_stops_lemmatized)
    return clean_corpus

Now we want to create the document-term matrix of our description corpus. Each row in the document-term matrix, is a vector, with one column for every term in the matrix. This tracks the term frequency for each term in the description.

In [4]:
def nlp_model_pipeline(clean_corpus): #Create a dictionary for the cleaned words
    dictionary = Dictionary(clean_corpus)
    doc_term_matrix = [dictionary.doc2bow(listing) for listing in clean_corpus]    
    return dictionary, doc_term_matrix

LDA is a topic model. Words from the document, in our case the description of Air Bnbs can be divided into topics. In these examples, I used 3 topics total as that was where I got the least amount of overlap between each topic.

In [5]:
def LDA_topic_modelling(doc_term_matrix, dictionary, num_topics=3, passes=2): #LDA Model
    LDA = LdaModel
    ldamodel = LDA(doc_term_matrix, num_topics=num_topics, id2word = dictionary, passes=passes)
    return ldamodel

def add_topics_to_df(ldamodel, doc_term_matrix, df, new_col, num_topics): #Create new Df with Topics assigned
    docTopicProbMat = ldamodel[doc_term_matrix]
    docTopicProbDf = pd.DataFrame(index=df.index, columns=range(0, num_topics))
    for i, doc in enumerate(docTopicProbMat):
        for topic in doc:
            docTopicProbDf.iloc[i, topic[0]] = topic[1]
    docTopicProbDf[new_col] = docTopicProbDf.idxmax(axis=1)
    df_topics = docTopicProbDf[new_col]
    df_new = pd.concat([df, df_topics], axis=1)
    return df_new

corpus_description = corpus['description'].astype(str)
new_corpus = preprocess_text(corpus_description)
dictionary_description, doc_term_matrix_description = nlp_model_pipeline(new_corpus)
ldamodel_description = LDA_topic_modelling(doc_term_matrix_description, dictionary_description, num_topics=3, passes=10)

Intertopic Distance Plots can be found on the main branch. This allows us to explore words that were classified into each of the 3 topics.

In [6]:
ch = gensimvis.prepare(ldamodel_description, doc_term_matrix_description, dictionary_description)
pyLDAvis.save_html(ch, 'laLDA.html')

  default_term_info = default_term_info.sort_values(


3 Topics

Budget: Contains words such as room, bed, apartment. Basic amenities most homes have.
Location: Contains words such as walking distance, subway, bus, downtown. Focus around travel and being close to the tourist-heavy areas.
Luxury: Contains words such as modern, unique, luxury. High-end features that most homes don’t have.

In addition to the intertopic distance maps, I created a WordCloud to show the most frequent words 

In [None]:
from wordcloud import WordCloud
canvas_width=1280
canvas_height=720
long_string = ','.join(list(corpus_description))
wordcloud = WordCloud(width=canvas_width,height=canvas_height,background_color="white", max_words=50000, contour_width=3, contour_color='steelblue')
wordcloud.generate(long_string)
wordcloud.to_file('LA_wordcloud.png') #Change name based on dataset you're using
import matplotlib.image as mpimg
import matplotlib as mpl
import matplotlib.pyplot as plt
img = mpimg.imread('la_wordcloud.png') #LA Wordcloud
imgplot = plt.imshow(img)
x_axis = imgplot.axes.get_xaxis()
x_axis.set_visible(False)

y_axis = imgplot.axes.get_yaxis()
y_axis.set_visible(False)
plt.show()

## Results

Common words/phrases in all eight cities appear to be around the city name as well as the dwelling type. Home, House & Apartment all appeared very frequently, and in certain datasets such as Toronto, they appear the most. Given the description is a place where the lister would be describing what type of dwelling the home is in the description, this makes sense logically. Words like kitchen, shower, living room also appear often, allowing for the assumption that most Air Bnb's have basic amenities that a guest would need for their short-term stay. Another theme that can be derived from the word clouds is words related to a location such as the heart of the city, downtown, subway, walking distance & access. Using this data one can conclude that a large portion of Air Bnb’s are in a location in major tourist areas of the city or close to transportation allowing people to get to these tourist areas. Given the eight cities for this report are all very large cities, this also makes sense logically. They all have tourist areas and the people traveling to these cities using Air Bnb's are likely not from the area and are traveling to the city for leisure. Both Chicago and Sydney have the license number prominent in the description whereas other cities do not. This is likely due to Airbnb regulations imposed on those two cities that hosts must include this information in their listing

Canadian cities tend to have a higher percentage of homes in the luxury bin compared to other worldwide cities. For the non-Canadian cities, the Budget and Location groups cover almost the dataset in its entirety with the exception of LA. When exploring the latitude/longitude coordinates in the EDA stage, LA didn’t have a specific cluster where properties tended to cluster so it makes sense why the Location percentage isn’t as high. From the intertopic distance maps, we can infer that the majority of Airbnb renters are looking for a budget property in an area near the heart of the city center.

Topic Breakdown from LDA Analysis

Toronto
Budget:51%
Location:25%
Luxury: 24%

Montreal
Budget:47%
Location:30%
Luxury:23%

Vancouver
Budget:47%
Location:33%
Luxury:20%

Barcelona
Budget:45%
Location:41%
Luxury:14%

LA
Budget:68%
Location:18%
Luxury:14%

Chicago
Budget:46%
Location:40%
Luxury:14%

Stockholm
Budget:50%
Location:41%
Luxury:9%

Sydney
Budget:48%
Location:45%
Luxury:7%

## References

https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0