## NLP Lecture for PPD 559

In this lecture, you will be introduced to using Natural Language Processing (NLP) in urban analytics.

Objectives for this lecture:

1. Understand and use common NLP python packages.
2. Find and visualize patterns in language topics. 
3. Relate language and topics to the underlying urban landscape.

### What is NLP and how can you use it?

NLP is ability to process text or spoken word based data with a computer in order to efficiently deal with large, potentially unruly or unstructured, data. 

In urban analytics, the uses of NLP are boundless! You can now handle large amounts of data coming from plans themselves, online open response questionaires, social media postings, transcripts from interviews or meetings, and more. Each of these datasets can illuminate important themes that may be difficult or time consuming to find by hand.

The NLP processing chain is most often:
1. Preprocess data to make text as uniform as possible.
2. Decide what each "document" should be - whole body, paragraph, sentence, few words, etc.
3. Turn each document into vector.
4. Utilize various existing tools with vectorized data.
5. Analyze results!

In [None]:
import re
import string
import nltk 
import gensim
from gensim import corpora
from collections import Counter
from itertools import chain
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import gensim.downloader as api
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

In [None]:
# Let's work with some example data from Zillow 
data = pd.read_csv('../../data/newyork_housing.csv')

### 1. Introduction to nltk

One of the most powerful tools in Python for NLP is the natural langauge toolkit (nltk) (https://www.nltk.org/). It is rich with processes and easy to use. Often, this package is used for the preprocessing stage where your text data may undergo any of the following:

#### - forcing lowercase and removing unwanted symbols
Ultimately, you are working with one string composed of different symbols (letters and numbers), so creating uniformity however possible is helpful. You want the computer to recognize "T" and "t" as the same symbol. You might not also want your application to care about "*" or "|". It all depends on what you want to pick up on.

In [None]:
#isolate the text column
bodytext = data['description']
#make all letters lowercase
bodytext = bodytext.str.lower()
#remove non alphabetic characters 
bodytext = bodytext.apply(lambda x: re.sub("[^A-Za-z']+", ' ', str(x)))

In [None]:
#view the before
print(data['description'].iloc[0])

In [None]:
#view the after
print(bodytext.iloc[0])

#### - removing "stopwords"

Stopwords are very common, usually insignificant words that you want filtered out before you do any processing.

In [None]:
#Take a look at some of the "stopwords"
nltk.corpus.stopwords.words('english')[0:10]

In [None]:
#remove these from each document
bodytext = bodytext.apply(lambda x: x.split(" "))
no_stopwords = bodytext.apply(lambda x: sorted(set(x) - set(nltk.corpus.stopwords.words('english')), key=x.index))

In [None]:
#now view our sample text without any stopwords 
print(no_stopwords.iloc[0])

#### - stemming or lemmatizing 

Stemming is the process of taking a word down to its root. Lemmatizing is the process of changing a word to its base format. Either step is usually performed in order to help your model capture variations in how people might represent words. For example, if you wanted to know how often people were talking about change in a system, you would want to capture whenever people say "change", "changing", "changes", or "changed". You can see how this would happen for stemming vs lemmatizing below.

| Stemming | Lemmatizing |
| --- | --- | 
| change $\rightarrow$ chang | change $\rightarrow$ change | 
| changes $\rightarrow$ chang | changes $\rightarrow$ change | 
| changing $\rightarrow$ chang | changing $\rightarrow$ change | 
| changed $\rightarrow$ chang | changed $\rightarrow$ change | 

In [None]:
#stem each word 
#initialze Stemmer
stemmer = nltk.stem.PorterStemmer()
#apply to each word in each document
bodytext_stemmed = bodytext.apply(lambda x: [stemmer.stem(i) for i in x])

In [None]:
#view our sample text after being stemmed
print(bodytext_stemmed.iloc[0])

In [None]:
#lemmatize each word
#initialize Lemmatizer
wnl = nltk.stem.WordNetLemmatizer()
#apply to each word in each document
bodytext_lemm = bodytext.apply(lambda x: [wnl.lemmatize(i) for i in x])

In [None]:
#view our sample text after being lemmatized
print(bodytext_lemm.iloc[0])

#### NLTK has other powerful accessories!

nltk can help identify the part of speech to isolate nouns, verbs, adjectives, etc. It can also identify groupings of words that most often occur together!

The nltk POS codes are: 

| Code | Part of Speech || Code | Part of Speech |
| --- | --- || --- | --- |  
|CC:| conjunction, coordinating ||PDT:| pre-determiner |
|CD:| numeral, cardinal ||POS:| genitive marker |
|DT:| determiner ||PRP:| pronoun, personal |
|EX:| existential there ||RB:| adverb |
|IN:| preposition or conjunction, subordinating ||RP:| particle |
|JJ:| adjective or numeral, ordinal ||TO:| "to" as preposition or infinitive marker |
|JJR:| adjective, comparative ||UH:| interjection |
|JJS:| adjective, superlative ||VB:| verb, base form |
|LS:| list item marker ||VBD:| verb, past tense |
|MD:| modal auxiliary ||VBG:| verb, present participle or gerund |
|NN:| noun, common, singular or mass ||VBN:| verb, past participle |
|NNP:| noun, proper, singular || WDT:| WH-determiner |
|NNS:| noun, common, plural|

In [None]:
# Identify the part of speech and isolate adjectives, nouns, etc.
example_sentence = bodytext.iloc[0]
print(nltk.pos_tag(example_sentence)[0])

In [None]:
#look at all of the adjectives for the postings
def keep_pos(x,pos=['JJ','JJS','JJR']):
    tagged = nltk.pos_tag(x)
    words_to_keep = [t[0] for t in tagged if t[1] in pos]
    return words_to_keep

keep_pos(example_sentence, pos=['JJ','JJS','JJR'])

In [None]:
# Identify words that often appear together
number_of_words = 2
ngrams = no_stopwords.apply(lambda x: list(nltk.ngrams(x,number_of_words)))
count = Counter(list(chain.from_iterable(list(ngrams.values))))

In [None]:
count.most_common(15)

In [None]:
#Now your turn
#Identify the most common words across the whole dataset at each stage to see how the list changes
#With the original data, with lowercasing, with removing stopwords, with stemming


### 2. Introduction to TFIDF

Step 2 of the NLP process is determining what your "document" will be. This can be the whole text as one, each sentence individually, or even bi- or tri-grams of words. 

In [None]:
#split by sentence
def split_by_sent(text, split_criteria=['  ','.', '!', '?','\n']):
    for x in split_criteria:
        text = str(text).replace(x, '*')
    bodylist = str(text).split('*')
    bodylist = [w for w in bodylist if w != '']
    return bodylist    
    
sentences = data['description'].str.lower().apply(lambda x: split_by_sent(x))
sentencedf = sentences.explode()
sentencedf = sentencedf[~sentencedf.isna()]
print(sentences.iloc[0])
print('\n', sentencedf.iloc[0])

In [None]:
#split by bigram
bigrams = no_stopwords.apply(lambda x: list(nltk.ngrams(x,2)))
bigramdf = bigrams.explode()
print(bigrams.iloc[0])
print('\n', bigramdf.iloc[0])

One method of performing step 3, turning each document into a vector, is through Term Frequency-Inverse Document Frequency (TFIDF). TF-IDF measures how important each word is to each document. 

Term Frequency (tf) refers to how often a word occurs in a document, ranging from 0 to 1. Inverse document frequency (idf) refers to how often a word occurs in _any_ of the documents, where closer to 0 represents more common words (think: and, the, it) and closer to 1 represents rarer words (think: quire, ulotrichous).

The goal is to have a vector for each document that is 1 x n (n being the total number of words in the dataset dictionary) with values describing the tf * idf scores for each word.

In [None]:
#First, we need a vector that shows the counts of each word in each document. Most of it will be 0.
documents = bodytext.apply(lambda x: ' '.join(x))
count_vect = CountVectorizer()
data_counts = count_vect.fit_transform(documents)
#Then, we can create the tf-idf matrix
tfidf_transformer = TfidfTransformer()
data_tfidf = tfidf_transformer.fit_transform(data_counts)
#Inspect the shape of the matrix
print(data_counts.shape)
print(data_tfidf.shape)

In [None]:
#Now with the sentence dataframe
#First, we need a vector that shows the counts of each word in each document. Most of it will be 0.
count_vect_sent = CountVectorizer()
data_counts_sentences = count_vect_sent.fit_transform(sentencedf)
#Then, we can create the tf-idf matrix
tfidf_transformer_sent = TfidfTransformer()
data_tfidf_sentences = tfidf_transformer_sent.fit_transform(data_counts_sentences)
#Inspect the shape of the matrix
print(data_counts_sentences.shape)
print(data_tfidf_sentences.shape)

### 3. Introduction to word2vec

Another method of performing step 3, turning a document into a vector, is through a "word2vec" model, which as you might have guessed, turns words in2 vectors!


In [None]:
#First, load a model pretrained on Google News articles
wv = api.load('word2vec-google-news-300')

In [None]:
#see how the word "house" is embedded in the vector space
vector = wv['house'] 
vector

In [None]:
#You can see the most similar words in the corpus
wv.most_similar('house', topn=15)

### 4. Topic Modeling

Now that we have our documents represented as a matrix (m documents x n words in dictionary OR m documents x n features in word2vec vector), we want to understand what topics are present 

#### Latent Dirichlet Allocation (LDA)

LDA is an unsupervised topic modeling technique. We can use this technique to create clusters, or topics, that are commonly occuring across all of the documents. Then, we can understand what words describe those topics. Finally, we can trace the topics back to our documents (remember, this can be the full ad or a single sentence) and see what topics appear in each document. There can be more than one topic per document!   

In [None]:
#create a dictionary 
documents = sentencedf.apply(lambda x: x.split(" "))
documents = documents.apply(lambda x: sorted(set(x) - set(nltk.corpus.stopwords.words('english')), key=x.index))
all_text = list(documents)
all_dict = corpora.Dictionary(all_text)
doc_term_matrix = [all_dict.doc2bow(i) for i in all_text]

In [None]:
#choose number of topics and create model
num_topics = 12
ldamodel = gensim.models.ldamodel.LdaModel(corpus=doc_term_matrix,
                             id2word=all_dict,
                             num_topics=num_topics,
                             eval_every=None,
                             passes=1,
                             random_state=0)

#save the top num_words for each topic 
num_words = 15
print_topics = ldamodel.print_topics(num_topics=num_topics, num_words=num_words)


In [None]:
for topic in print_topics:
    print('Topic {}'.format(topic[0]))
    topwords = topic[1].split('"')[1::2]
    print(", ".join(topwords))

In [None]:
doc_top_topics = []
for i in range(len(documents)):
    topic_probs = ldamodel[doc_term_matrix[i]]
    max_score = 0
    top_topic = num_topics
    for topic, prob in topic_probs:
        if prob > max_score:
            max_score = prob
            top_topic = topic
    doc_top_topics.append(top_topic)


sentencedf2 = pd.DataFrame({'adindex': sentencedf.index, 
                            'sentence': sentencedf.values, 
                            'top_topic': doc_top_topics, 
                            'sent_len': documents.apply(len)})

In [None]:
#calculate what percentage of the ad is dedicated to each topic 
import numpy as np
percentages = np.zeros((len(data),num_topics))
#groupby the ad and the topic of the sentence. Sum the number of words per ad per topic
groupeddf = sentencedf2.groupby(['adindex', 'top_topic']).sent_len.sum()
#Put into a matrix
for idx in groupeddf.index:
    percentages[idx] = groupeddf[idx]
percentages = np.transpose(np.transpose(percentages)/percentages.sum(axis=1))

In [None]:
#plot the percentage of the ads dedicated to each topic 
pd.DataFrame(data=percentages, columns = range(num_topics)).boxplot()


#### Similar to keywords using word2vec model

We can also use our word2vec model to find how similar our document is to a predetermined keyword or topic! We can do this by testing how similar all of the words within the document are to the keyword.

In [None]:
#set of words to compare ad words to 
testwords = ['transit'] 

#get similarity scores of each word in each ad to keywords in testwords list
sims = []
for row in no_stopwords.values:
    tmp= [], 
    for w in row:
        #look at the similarity score between each word and the testwords
        try:
            tmp.append(wv.similarity(testwords[0], w))
        #not all words are in our corpus defined under the wv model
        except:
            continue
    sims.append(tmp)

In [None]:
#calculate mean of similarity scores for each keyword / document pair
means = np.zeros((len(sims), 1))
means[:,0] = list(map(lambda x: np.mean(x), sims))

#plot distributions of keyword similarity scores
plt.close()
fig, ax = plt.subplots()
plt.rcParams['savefig.dpi'] = 300
ax.patch.set_alpha(0)
sns.distplot(means[:,0])
plt.xlabel('Mean Similarity Score of Words in Body Text')
plt.ylabel('Density')
plt.xlim(0.05,0.2)
plt.tight_layout()

In [None]:
#Now your turn
#Look at the topics determined by our LDA method when using the whole ad 
#Or try out the similarity scores for another keyword
#Or try to use our tf-idf vectors in another clustering method you know (k-means, dbscan, etc.)



### 5. Putting it Together with Spatial Analysis

Once you have performed your text analysis, often you will end up with quantitative variables which can then be analyzed spatially as with any other data. 

You might have now integer values representing the most prominent topic for each document, the percent of the text dedicated to a word or topic, or even simply the boolean presence of a word or topic. If the documents contain some sort of spatial information (e.g., location of the Zillow ad), you can now perform your spatial analysis!

In [None]:
import folium
import branca.colormap as cm

#add the topic percentages to the original dataframe
data_new = data.join(pd.DataFrame(data=percentages, columns = range(num_topics)).fillna(0)) 
data_new = data_new[~data_new.latitude.isna()]

#create the map
centerlat = (data_new['latitude'].max() + data_new['latitude'].min()) / 2
centerlong = (data_new['longitude'].max() + data_new['longitude'].min()) / 2
center = (centerlat, centerlong)
colormap = cm.LinearColormap(colors=['green', 'yellow', 'red'], vmin=0, vmax=1)
map_nyc = folium.Map(location=center, zoom_start=10, tiles='Stamen Toner')

#topic_data1
topic_number1 = 0
for i in range(len(data_new)):
    folium.Circle(
        location=[data_new.iloc[i]['latitude'], data_new.iloc[i]['longitude']],
        radius=10,
        fill=True,
        color=colormap(data_new.iloc[i][topic_number1]),
        fill_opacity=0.2
    ).add_to(map_nyc)

# the following line adds the scale directly to our map
map_nyc.add_child(colormap)

map_nyc

In [None]:
map_nyc2 = folium.Map(location=center, zoom_start=10, tiles='Stamen Toner')

#topic_data2
topic_number2 = 10
for i in range(len(data_new)):
    folium.Circle(
        location=[data_new.iloc[i]['latitude'], data_new.iloc[i]['longitude']],
        radius=10,
        fill=True,
        color=colormap(data_new.iloc[i][topic_number2]),
        fill_opacity=0.2
    ).add_to(map_nyc2)

# the following line adds the scale directly to our map
map_nyc2.add_child(colormap)

map_nyc2

### 6. Your Turn

Work through the above examples to identify a pattern of your choosing. 
Separate the data initially and see how your topics vary. 

For example, what LDA topics emerge when you separate on listing price? on number of bedrooms? on square footage?

What keywords can you search on for similarity with the word2vec model? How do the distributions change across the above separations?

### 7. BONUS - Working with Different Languages

In [None]:
#detect the language(s) of your text along with a confidence score
from googletrans import Translator
def detect_lang(text):
    translator = Translator()
    detection=translator.detect(text)
    return  detection.confidence

In [None]:
detect_lang(bodytext.iloc[0])