## NLP Tutorial

NLP - or *Natural Language Processing* - is shorthand for a wide array of techniques designed to help machines learn from text. Natural Language Processing powers everything from chatbots to search engines, and is used in diverse tasks like sentiment analysis and machine translation.

In this tutorial we'll look at this competition's dataset, use a simple technique to process it, build a machine learning model, and submit predictions for a score!

In [None]:
import pandas as pd

In [None]:
train_df = pd.read_csv("./train.csv").fillna('')
test_df = pd.read_csv("./test.csv").fillna('')
full_df = pd.concat([train_df, test_df]).fillna('')

In [None]:
print(train_df.shape, test_df.shape, full_df.shape)

In [None]:
#TODO: Normalize spelling of twitter words, many informalities (include accented words and garbage characters)
# Leave for now

In [None]:
# Add new column, capturing tags
# parse text for hashtags, and remove '#' symbol in the process
import string


def extract_tags(text):
    unaccented_alnum = string.ascii_letters + ''.join(str(i) for i in range(10))
    tags = []
    for word in text.lower().split():
        if word.startswith('#'):
            tag = ''.join([t for t in word[1:] if t in unaccented_alnum])
            tags.append(tag)
    return tags

In [None]:
full_df['tags'] = full_df['text'].apply(extract_tags)

In [None]:
full_df.sample(5)

In [None]:
print(full_df.loc[full_df['id'] == 9652].text.values[0])

In [None]:
full_df.loc[full_df['id'] == 9652].values[0]

In [None]:
full_df.loc[full_df['id'] == 9652].tags.values[0]

In [None]:
import tqdm
import spacy
from spacy.language import Language
from spacy.lang.en import STOP_WORDS

nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("merge_entities")
# nlp.add_pipe("merge_noun_chunks")
nlp.pipe_names

In [None]:
single_quote_unicode = ord("'")
translation_table_text = str.maketrans(
    {
        '`': single_quote_unicode,
        '‘': single_quote_unicode,
        '’': single_quote_unicode,
        '“': single_quote_unicode,
        '”': single_quote_unicode,
        '-': None,
    }
)
translation_table_token = str.maketrans(
    {
        "'": None,
        '"': None,
        '.': None
    }
)

def sub_token(token):
    token_lowered = token.lower()
    if 'http' in token_lowered:
        return 'url'
    elif '@' in token_lowered:
        return 'usermention'
    elif '&amp;' in token_lowered:
        return 'and'
    elif "ain't" in token_lowered:
        return 'am not'
    elif '\x89û_' in token_lowered:
        return f'{token_lowered[:-3]} ...'
    else:
        return token

def corpus2tokens(corpus_text, *args, **kwargs):
    corpus_text = [' '.join([sub_token(token) for token in text.split()]) for text in corpus_text]
    return [doc2tokens(doc) for doc in nlp.pipe(tqdm.notebook.tqdm(corpus_text), *args, **kwargs)]

def doc2tokens(doc):
    tokens = [token for token in doc if not (token.is_punct or token.is_space)]
    return process_tokens(tokens, doc.ents)

def show_ents(ents):
    for ent in ents:
        print(ent.text+' - ' +str(ent.start_char) +' - '+ str(ent.end_char) +' - '+ent.label_+ ' - '+str(spacy.explain(ent.label_)))

def process_tokens(tokens, ents, rm_stopwords=False):
    ent_vals_to_skip = ['#', '\\\\\\']
    ent_labels_to_sub = [
        "DATE", # Absolute or relative dates or periods
        "CARDINAL", # Numerals that do not fall under another type
        "PERCENT", # Percentage, including "%"
        "TIME", # Times smaller than a day
        "MONEY", # Monetary values, including unit
        "ORDINAL", # "first", "second", etc.
    ]
    tokens_processed = []
    stringed_ents = [ent.text.lower() for ent in ents if ent.text not in ent_vals_to_skip]
#     print([(ent.text.lower(), ent.label_) for ent in ents])
    ent_tokens = []
    for token in tokens:
        stringed_token = token.text.lower()
        if stringed_token in stringed_ents:
            ent_tokens.append(stringed_token)
            ent_label = ents[stringed_ents.index(stringed_token)].label_
            if ent_label in ent_labels_to_sub:
                tokens_processed.append(ent_label)
                continue
#             stringed_token = ent_label + "|" + stringed_token.translate(translation_table_token)
            stringed_token = stringed_token.translate(translation_table_token)
        if rm_stopwords:
            if stringed_token not in STOP_WORDS:
                tokens_processed.append(stringed_token)
        else:
            tokens_processed.append(stringed_token)
    len_ent_tokens = len(set(ent_tokens))
    len_stringed_ents = len(set(stringed_ents))
    if len_ent_tokens != len_stringed_ents:
        print(f'WARNING: Somehow the number of unique tokens which are ents ({len_ent_tokens}) does not match the total number of unique ents ({len_stringed_ents})')
        diff = list(set(stringed_ents) - set(ent_tokens))
        if not diff:
            diff = list(set(ent_tokens) - set(stringed_ents))
            print(diff, "exist in tokens but not in ents")
        print(diff, "exist in ents but not in tokens")
        print("tokens: ", "\n", tokens, "\n\n")
        print("ents: ", "\n", ents, "\n\n")
    return tokens_processed

In [None]:
%%time

corpus_text_full = [
    item.translate(translation_table_text)
    for item in full_df.text.to_list()
]
corpus_text_tokens_full = corpus2tokens(corpus_text_full)
corpus_tags_full = full_df.tags.to_list()

In [None]:
import ast

def friendly_tag_corpus(row):
    return [
        f'id:{row.id}',
        *[f'keyword:{k}' for k in [row.keyword] if k],
        *[f'tag:{t}' for t in row.tags if t]
    ]

In [None]:
corpus_tags_friendly_full = full_df[["id", "keyword", "tags"]].apply(friendly_tag_corpus, axis=1).to_list()

In [None]:
def build_tag_id_mapping(corpus_tags):
    tags = list(set(tag for tags in corpus_tags for tag in tags))
    return {tag: idx for idx, tag in enumerate(tags)}

In [None]:
tag_id_mapping = build_tag_id_mapping(corpus_tags_friendly_full)
id_tag_mapping = {v: k for k, v in tag_id_mapping.items()}

In [None]:
corpus_tags_full = [[tag_id_mapping[tag] for tag in tags] for tags in corpus_tags_friendly_full]

In [None]:
# NOTE: Spacy confused #hashtags for MONEY often
doc_idx = 3243
print('ORIGINAL_TEXT:', corpus_text_full[doc_idx])
print('TOKENS:', corpus_text_tokens_full[doc_idx])
print('KEYWORD:', full_df.iloc[doc_idx].keyword)
print('DOC2VEC TAG IDS:', corpus_tags_full[doc_idx])
print('DOC2VEC TAG IDS:', corpus_tags_friendly_full[doc_idx])

In [None]:
doc_idx = 21
print('ORIGINAL_TEXT:', corpus_text_full[doc_idx])
print('TOKENS:', corpus_text_tokens_full[doc_idx])
print('KEYWORD:', full_df.iloc[doc_idx].keyword)
print('DOC2VEC TAG IDS:', corpus_tags_full[doc_idx])
print('DOC2VEC TAG IDS:', corpus_tags_friendly_full[doc_idx])

In [None]:
doc_idx = 2936
print('ORIGINAL_TEXT:', corpus_text_full[doc_idx])
print('TOKENS:', corpus_text_tokens_full[doc_idx])
print('KEYWORD:', full_df.iloc[doc_idx].keyword)
print('DOC2VEC TAG IDS:', corpus_tags_full[doc_idx])
print('DOC2VEC TAG IDS:', corpus_tags_friendly_full[doc_idx])

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [None]:
def gen_tagged_docs(corpus_words, corpus_tags):
    return [TaggedDocument(doc_words, doc_tags) for doc_words, doc_tags in zip(corpus_words, corpus_tags)]

In [None]:
corpus_full = gen_tagged_docs(corpus_text_tokens_full, corpus_tags_full)

In [None]:
corpus_full[-1]

In [None]:
# https://groups.google.com/g/gensim/c/6JmSsx4iIv0
# projects with larger vocabularies tend to lean more towards negative-sampling than hierarchical-softmax
# VERY NB - https://stackoverflow.com/a/37502976/1782641
# https://radimrehurek.com/gensim/models/doc2vec.html
model = Doc2Vec(
    epochs=1000,
    workers=3
)

In [None]:
%%time
model.build_vocab(corpus_full)

In [None]:
print(f"Word 'airport' appeared {model.wv.get_vecattr('airport', 'count')} times in the full corpus.")

In [None]:
%%time
model.train(corpus_full, total_examples=model.corpus_count, epochs=model.epochs)

In [None]:
def display_similar_article_and_categories(corpus, doc_id=0, topn=10, by_article_tokens=True, by_article_tag=False):
    doc = corpus[doc_id].words
    print(' '.join(doc)[:200])

    if by_article_tokens:
        # Using words
        print("************")    
        print("Get simlarity based on tokens:")
        print()    
        inferred_vector = model.infer_vector(doc)
        sims = model.dv.most_similar([inferred_vector], topn=topn)
        for idx, factor in sims:
            print(factor, id_tag_mapping[idx])  

    if by_article_tag:
        # Using doc vector
        print("************")    
        print("Get simlarity based on article tag:")
        print()    
        inferred_vector = model.dv[corpus[doc_id].tags[0]]
        sims = model.dv.most_similar([inferred_vector], topn=topn)
        for idx, factor in sims:
            print(factor, id_tag_mapping[idx])
    
    print("************")
    print("Actual known tags:")
    print()
    print([id_tag_mapping.get(tag) for tag in corpus[doc_id].tags if tag])

In [None]:
import random
import collections
import matplotlib.pyplot as plt


def rank_by_inferredvector(corpus, sent_ids):
    ranks = []
    for sent_id in sent_ids:
        inferred_vector = model.infer_vector(corpus[sent_id].words)
        sims = model.dv.most_similar([inferred_vector], topn=len(id_tag_mapping))
        most_similar_tag_indices = [
            [docid for docid, _ in sims].index(tag)
            for tag in corpus[sent_id].tags if tag
        ]
        if most_similar_tag_indices:
            rank = min(most_similar_tag_indices)
            print(f'{sent_id}: Ranked {rank} ({id_tag_mapping[sims[rank][0]]}) out of {len(sims)}')
            ranks.append(rank)
    return ranks

            
def rank_by_random(corpus, sent_ids):
    return [random.randint(0, len(id_tag_mapping)) for _ in sent_ids]


def plot_matches(corpus, rank_func=rank_by_inferredvector, take_sample=True, sample_size=50, sample_seed=42, topn_perc=0.1):
    if take_sample:
        random.seed(sample_seed)
        sent_ids = random.sample(range(0, len(corpus)), sample_size)
    else:
        sent_ids = list(range(len(corpus)))
    ranks = rank_func(corpus, sent_ids)
    counter = collections.Counter(ranks)
    group_0 = []
    group_1 = []
    group_2 = []
    for k, v in counter.items():
        if k == 0:
            group_0.append(v)
        elif k < len(id_tag_mapping) / (100 / topn_perc):
            group_1.append(v)
        else:
            group_2.append(v)
        sum_0 = sum(group_0)
        sum_1_acceptable = sum(group_1)
        sum_all_else = sum(group_2)
    plt.bar([0,1,2], [sum_0, sum_1_acceptable, sum_all_else])
    print([sum_0, sum_1_acceptable, sum_all_else])
    print('Test example correctly matched (%): ', 100 * sum_0 / sum([sum_0, sum_1_acceptable, sum_all_else]))
    print(f'Test example matched in top {topn_perc}% (%): ', 100 * sum_1_acceptable / sum([sum_0, sum_1_acceptable, sum_all_else]))
    print('Test example badly matched (%): ', 100 * sum_all_else / sum([sum_0, sum_1_acceptable, sum_all_else]))

In [None]:
# import simplejson

# def json_load(filename):
#     with open(filename, 'r', encoding='utf-8') as f:
#         return simplejson.load(f)

In [None]:
# from gensim.models.doc2vec import TaggedDocument

# def gen_tagged_docs_from_save(corpus):
#     return [TaggedDocument(doc["words"], doc["tags"]) for doc in corpus]

In [None]:
# from gensim.models.doc2vec import Doc2Vec
# model = Doc2Vec.load('./doc2vec.model')

In [None]:
# from gensim.models import KeyedVectors
# wv = KeyedVectors.load('./doc2vec.wv')

In [None]:
# corpus_train_full = json_load('./doc2vec.corpus.full.json')

In [None]:
# corpus_full = gen_tagged_docs_from_save(corpus_train_full)

In [None]:
# tag_id_mapping = json_load('./doc2vec.tag_id_mapping.json')
# id_tag_mapping = {v: k for k, v in tag_id_mapping.items()}

In [None]:
%matplotlib inline
plot_matches(corpus_full)

In [None]:
full_df[full_df.id == 10873]

In [None]:
# tennessee lesbian couple faked hate crime and destroyed own home with arson url lesbian
# ************
# Get simlarity based on tokens:

# 0.6994612216949463 id:589
# 0.6994302868843079 tag:lesbian
# 0.6487046480178833 keyword:arson
# 0.6095342040061951 id:10539
# 0.6010091304779053 id:7573
# 0.5964393019676208 id:6186
# 0.5915811061859131 id:10873
# 0.5876895189285278 id:8394
# 0.5861586928367615 id:10562
# 0.5838110446929932 id:10817
# ************
# Get simlarity based on article tag:

# 0.9999999403953552 id:589
# 0.999990701675415 tag:lesbian
# 0.4069943130016327 id:8591
# 0.40621358156204224 id:73
# 0.40453746914863586 id:8779
# 0.4021869897842407 id:9698
# 0.4013387858867645 id:567
# 0.3876795172691345 tag:weddinghour
# 0.387553870677948 id:761
# 0.3804241120815277 id:823
# ************
# Actual known tags:

# ['id:589', 'keyword:arson', 'tag:lesbian']

display_similar_article_and_categories(corpus_full, doc_id=409, by_article_tag=True)

In [None]:
model.save("./doc2vec.model")

In [None]:
wv = model.wv
wv.save('./doc2vec.wv')

In [None]:
def corpus_to_dicts(corpus):
    for doc in corpus:
        yield {
            'words': doc.words,
            'tags': doc.tags
        }

In [None]:
import simplejson


def json_save(data, filename):
    with open(filename, 'w', encoding='utf-8') as f:
        simplejson.dump(data, f, separators=(',', ':'), iterable_as_array=True)

In [None]:
json_save(corpus_to_dicts(corpus_full), './doc2vec.corpus.full.json')

In [None]:
json_save(tag_id_mapping, './doc2vec.tag_id_mapping.json')

### A quick look at our data

Let's look at our data... first, an example of what is NOT a disaster tweet.

In [None]:
train_df[train_df["target"] == 0]["text"].values[1]

And one that is:

In [None]:
train_df[train_df["target"] == 1]["text"].values[1]

In [None]:
import gensim

### Building vectors

We have document vectors to use. We can infer vectors from any inputs (tokenized in the same way as in training the doc2vec model)

In [None]:
corpus_text_tokens_train = corpus_text_tokens_full[:7613]
corpus_text_tokens_test = corpus_text_tokens_full[7613:]
len(corpus_text_tokens_train), len(corpus_text_tokens_test), len(corpus_text_tokens_full)

In [None]:
train_vectors = [model.infer_vector(doc) for doc in corpus_text_tokens_train]

In [None]:
test_vectors = [model.infer_vector(doc) for doc in corpus_text_tokens_test]

### Our model

As we mentioned above, we think the words contained in each tweet are a good indicator of whether they're about a real disaster or not. The presence of particular word (or set of words) in a tweet might link directly to whether or not that tweet is real.

What we're assuming here is a _linear_ connection. So let's build a linear model and see!

In [None]:
from sklearn import feature_extraction, linear_model, ensemble, model_selection, preprocessing

clf = ensemble.RandomForestClassifier()

In [None]:
clf.fit(train_vectors, train_df["target"])

In [None]:
sample_submission = pd.read_csv("./sample_submission.csv")

In [None]:
sample_submission.head()

In [None]:
sample_submission["target"] = clf.predict(test_vectors)

In [None]:
sample_submission.head()

In [None]:
sample_submission.to_csv("submission.csv", index=False)

Now, in the viewer, you can submit the above file to the competition! Good luck!