# Part 3 - Text analysis and ethics

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [1]:
import pandas as pd
import numpy as np

import os
import re
from datetime import datetime
from tqdm import tqdm
from collections import defaultdict,Counter

import nltk
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

tqdm.pandas()

  from pandas import Panel


In [2]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\c2038737\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\c2038737\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\c2038737\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [3]:
# load stopwords
sw = set(stopwords.words('english'))

In [4]:
p = os.path.curdir
df = pd.read_csv(os.path.join(p,'reviews.csv'))
# deal with empty reviews
df.comments = df.comments.fillna('')

In [5]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...


In [6]:
df.shape

(452143, 6)

### 3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [7]:
def extend_stop_words(stopwords):
    """Add common punctuation marks to the stopwords
    :type stopwords: set
    :rtype: list
    """
    stopwords = []
    for stop_word in sw:
        stopwords.append(stop_word)
    stopwords += [",", ".", "!", ";", "’", "'", ":", "\"", "..", "...", "....", "......", "+", "-", "*"]
    return stopwords

In [8]:
def process_reviews(df):
    """Process the raw reviews
    :type df: dataframe
    :rtype: dataframe -- with three extra columns tokenized, tagged, lower_tagged
    """
    comments = df["comments"]
    tokenized_comments = []
    stopwords = extend_stop_words(sw)
    tokenized_list = []
    tagged_list = []
    lower_tagged_list = []
    for comment in comments:
        if not comment: 
            tokenized = "empty"
            tagged = "empty"
            lower_tagged = "empty"
        else:
            # Transform comment to be lower form
            comment = comment.lower()
            # Tokenize a review
            tokenized = word_tokenize(comment)
            # Tag words so as to differentiate nouns from adjectives or verbs
            tagged = pos_tag(tokenized)
            # Only remain meaningful words so as to reduce the size of the vocabulary
            meaningful_words = []
            for tokenized_word in tokenized:
                if tokenized_word not in stopwords:
                    meaningful_words.append(tokenized_word)
            if not meaningful_words: lower_tagged = "empty"
            else: lower_tagged = pos_tag(meaningful_words)
        tokenized_list.append(tokenized)
        tagged_list.append(tagged)
        lower_tagged_list.append(lower_tagged)
    df["tokenized"] = tokenized_list
    df["tagged"] = tagged_list
    df["lower_tagged"] = lower_tagged_list
    return df

In [9]:
df = process_reviews(df)

In [10]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,tokenized,tagged,lower_tagged
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...,"[daniel, is, really, cool, ., the, place, was,...","[(daniel, NN), (is, VBZ), (really, RB), (cool,...","[(daniel, NN), (really, RB), (cool, JJ), (plac..."
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...,"[daniel, is, the, most, amazing, host, !, his,...","[(daniel, NN), (is, VBZ), (the, DT), (most, RB...","[(daniel, NN), (amazing, VBG), (host, NN), (pl..."
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...,"[we, had, such, a, great, time, in, amsterdam,...","[(we, PRP), (had, VBD), (such, JJ), (a, DT), (...","[(great, JJ), (time, NN), (amsterdam, JJ), (da..."
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...,"[very, professional, operation, ., room, is, v...","[(very, RB), (professional, JJ), (operation, N...","[(professional, JJ), (operation, NN), (room, N..."
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...,"[daniel, is, highly, recommended, ., he, provi...","[(daniel, NN), (is, VBZ), (highly, RB), (recom...","[(daniel, NN), (highly, RB), (recommended, VBD..."


### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [11]:
def write_lower_tagged_to_csv(lower_tagged):
    """write lower tagged to csv to conquer memory error
    :type lower_tagged: series<[(word, pos),(word, pos),...]> 
    :rtype: None
    """
    if not os.path.isfile("./lower_tagged.csv"):
        f = open("lower_tagged.csv", "a", encoding="utf-8")
        f.write("word,pos\n")
        for list_words_poss in lower_tagged:
            for word, pos in list_words_poss: 
                f.write(word+","+pos[0]+"\n")
        f.close()

In [12]:
# conquer memory error
lower_tagged = df["lower_tagged"][df["lower_tagged"] != "empty"].to_numpy()
write_lower_tagged_to_csv(lower_tagged)

In [28]:
vocab_df = pd.read_csv("./lower_tagged.csv", error_bad_lines=False)

In [14]:
def get_vocab(df):
    """Get 1000 most frequent nouns and 1000 most frequent verbs or adjectives
    :type df: dataframe 
    :rtype: tuple<list, list> -- (1000 nouns, 1000 verbs or adjectives)
    """
    # center contains 1000 most frequent nouns
    center = vocab_df["word"][vocab_df.pos == "N"].value_counts().index[0:1000].to_list()
    # context contains 1000 most frequent verbs or adjectives
    temp = vocab_df["word"][(vocab_df.pos == "J") | ((vocab_df.pos == "V"))].value_counts()
    context = temp.index[0:1000].to_list()
    return center, context

In [15]:
cent_vocab, cont_vocab = get_vocab(df)

### 3.a3 Count co-occurrences between center and context words

What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

In [16]:
def initiate_coocs_between_center_and_context(cent_vocab, cont_vocab):
    """initiate co-occurrences between center and context words
    :type cent_vocab: list -- 1000 center words
    :type cont_vocab: list -- 1000 context words
    :rtype: dict<dict> -- keys in outer dict are center words, 
    which contains an inner dict whose keys are all context words 
    and values are initial co-occurrences (0) between center and context words
    """
    inner_dict = {}
    for context_word in cont_vocab:
        # initiate co-occurrences between center and context word
        inner_dict[context_word] = 0
    outer_dict = {}
    for center_word in cent_vocab:
        # initiate co-occurrences between center word and context words
        outer_dict[center_word] = inner_dict.copy()
    return outer_dict

In [17]:
def split_into_noun_or_av_list(lower_tagged):
    """Split lower_tagged into noun list and adj or verb list
    :type lower_tagged: list<tuple> -- [(word, pos), (word, pos), ...]
    :rtype: tuple<list, list> -- noun_list, av_list
    """
    noun_list = []
    av_list = []
    for word, pos in lower_tagged:
        if pos[0] == "N":
            noun_list.append(word)
        if pos[0] == "V" or pos[0] == "J":
            av_list.append(word)
    return noun_list, av_list

In [18]:
def get_coocs(df, cent_vocab, cont_vocab):
    """Count co-occurrences between center and context words
    :type df: dataframe
    :type df: cent_vocab -- 1000 center words
    :type cont_vocab: list -- 1000 context words
    :rtype: dict<dict> -- keys in outer dict are center words, 
    which contains an inner dict whose keys are all context words 
    and values are co-occurrences between center and context words
    """
    # This trick helps decrease loops 
    coocs = initiate_coocs_between_center_and_context(cent_vocab, cont_vocab)
    lower_taggeds = df["lower_tagged"][df["lower_tagged"] != "empty"].to_numpy()
    nums = len(lower_taggeds)
    i = 0
    while i < nums:
        lower_tagged = lower_taggeds[i]
        # Split it into noun list and adj or verb list
        # so as to solve the problem center and context words being 
        # part of your vocabulary because they are frequent both as a noun and as a verb
        noun_list, av_list = split_into_noun_or_av_list(lower_tagged)
        # noun_list transformed to set type
        # so as to avoid center words occurring more than once
        noun_set = set(noun_list)
        # I consider the context as the whole review,
        # Which means that I believe the vector built by the whole review
        # can disclose the meaning of center words by context words
        # and also can disclose the relationship between center words and center words
        for noun_word in noun_set:
            for av_word in av_list:
                try:
                    # This trick can avoid using "in" so as to decrease loops 
                    # because of the existence of coocs
                    coocs[noun_word][av_word] += 1
                except:
                    # If key error happens, let it continue, does not matter
                    pass
        i += 1
    return coocs  

In [19]:
coocs = get_coocs(df, cent_vocab, cont_vocab)

### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [20]:
def cooc_dict2df(coocs):
    """returns a DataFrame where each row corresponds to one center word, 
and each column corresponds to one context word, 
and cells are their corresponding co-occurrence value.
    """
    return pd.DataFrame(coocs).T

In [21]:
coocdf = cooc_dict2df(coocs)
coocdf.shape

(1000, 1000)

### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [22]:
def cooc2pmi(df):
    """Returns a DataFrame where each cell contains PMI valua
    :type df: dataframe
    :rtype: DataFrame 
    """
    # cumsums of center words
    center_cumsums = coocdf.sum(axis=1)
    # cumsums of context words
    context_cumsums = coocdf.sum(axis=0)
    # total
    total = center_cumsums.sum() 
    # (center,context) probability
    center_context_probability = df / total
    # center probaility
    center_probability = np.array(center_cumsums / total)[None] # "None" makes the shape as (1,1000) 
    # context probaility
    context_probability = np.array(context_cumsums / total)[None]
    # PMI(x,y) = log(p(x,y)/p(x)p(y))
    # and if less than 0 then let it be 0
    px_py = np.dot(center_probability.T, context_probability) + 10**(-6) # "10**(-6)" avoids divide by zero bug
    pmidf = np.maximum(np.log(center_context_probability / px_py), 0)
    return pmidf

In [23]:
pmidf = cooc2pmi(coocdf)
pmidf.shape

  pmidf = np.maximum(np.log(center_context_probability / px_py), 0)


(1000, 1000)

### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [24]:
def topk(df, center_word, N=10):
    """Returns a list of N strings, in order of their PMI score with the center_word
    :type df: dataframe
    :type center_word: str
    :type N: int
    :rtype: list 
    """
    return pmidf.loc[center_word].sort_values()[-N:].index.to_list()

In [25]:
topk(pmidf, 'coffee')

['wine',
 'cheese',
 'provided',
 'including',
 'fridge',
 'fresh',
 'microwave',
 'nespresso',
 'kettle',
 'tea']

In [26]:
topk(pmidf, 'dog')

['given',
 'loft',
 'table',
 'expect',
 'advertised',
 'good',
 'nice',
 'adorable',
 'cute',
 'friendly']