# Week 1: Introduction and Overview
This notebook accompanies the week 1 lecture

In [None]:
# setup
import sys
import subprocess
import pkg_resources
import pandas as pd
from collections import Counter
import re


required = {'spacy', 'scikit-learn', 'spacy-transformers'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)

import spacy
import transformers
from sklearn.feature_extraction.text import CountVectorizer

## Natural Language Tool Kit vs SpaCy
For this course, we will mainly be using [SpaCy]().  There are a [number of other]() NLP libraries and probably one of the best known is the [Natural Language Tool Kit (NLTK)]().  SpaCy and NLTK both are very powerful, but here I'll show a couple of reasons why I prefer SpaCy.

Note: You will not be able to run these on your own without installing NLTK.  Since it's not used in the rest of the course, I'm not configuring it here.

In [None]:
# spacy
from spacy.lang.en import English
en = English()
text = 'We are doing NLP.'
doc = en(text)
print(type(doc))
print([(x, type(x)) for x in doc])

In [None]:
# nltk
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
doc = word_tokenize(text)
print(type(doc))
print([(x, type(x)) for x in doc])

In [None]:
# time comparison
%timeit en(text)
%timeit nltk.tokenize.casual_tokenize(text)

In [None]:
# preprocessing: lower case and removing non-alpha
text = 'We are doing NLP.'
# spacy
%timeit  [x.lower_ for x in en(text) if x.is_alpha]
# nltk (one possible way)
%timeit nltk.RegexpTokenizer(r'\w+').tokenize(text.lower())

You can see both libraries are very powerful.  But SpaCy's syntax is a bit simpler and it's generally a bit faster.  Have you tried NLTK or other libraries? Let's discuss.

### SpaCy's language models
We'll cover this more extensively in the slides.  But here's some illustration of what's going on under the hood with SpaCy

In [None]:
# taking a look at the different languages supported by spacy
#from spacy.lang import 

In [None]:
# what's in a language model
from spacy.lang.en import English
en = English()
print(en.tokenizer)
print(en.pipe_names)

In [None]:
# read in English model with tagging/entity pipeline components
# you will need to run the line below beforehand
#!python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')
print(nlp.pipe_names)

In [None]:
# turning text into a spacy document
doc = en('We are doing NLP')
print(doc)
print(type(doc))
doc_attrs = set(dir(doc))
print(doc_attrs)
# spans: subsets of doc
print(doc[:2])
print(type(doc[:2]))
span_attrs = set(dir(doc[:2]))
print(span_attrs - doc_attrs)
# tokens: individual units of doc
print(doc[0])
print(type(doc[0]))
print(dir(doc[0]))

In [None]:
# internally, spacy represents tokens as hash values
print(doc[0].lower)
en.vocab[doc[0].lower].text
# but you probably won't need that often
print(doc[0].lower_)

## Tokenization
This section shows some of the considerations to make when tokenizing your data.

Token = "Useful semantic unit"

But what does that mean? This section will detail some considerations here.

In [None]:
# importing different languages in spacy
# blank English model
from spacy.lang.en import English
en = English()
# blank Chinese model
# to run, will need to install jieba tokenizer (optional)
#!pip install jieba
from spacy.lang.zh import Chinese

zh = spacy.lang.zh.Chinese()
zh_text = '我们正在做NLP。'
print('Tokenize in Chinese:', [x.text for x in zh(zh_text)])
print('Tokenize in English:', [x.text for x in en(zh_text)])

In [None]:
# interesting tidbit: the base Chinese model uses the jieba tokenizer
zh.tokenizer
# any other languages that might require a custom tokenization strategy?

In [None]:
# lowercasing
text = 'We are doing NLP.'
print('Base python: ', text.lower())
print('SpaCy:', [x.lower_ for x in en(text)])

In [None]:
# handling non-alpha
text = 'We are doing NLP.'
# base python
strip_punct = '[^A-Za-z0-9 ]'
print(re.sub(strip_punct, '', text))
# spacy
print([x.text for x in en(text) if x.is_alpha])

In [None]:
# but what about contractions?
text = "We're doing NLP."
# base python
strip_punct = '[^A-Za-z0-9 ]'
print('Just removing punctuation:', re.sub(strip_punct, '', text))
# spacy
print('Removing non-alpha', [x.text for x in en(text) if x.is_alpha])

You can see here that the is_alpha flag is False for any tokens that have non-alpha characters.  We'll look into a better way for dealing with contractions later.

### Exercise: Create a tokenizer
In this exercise, you will make a function that uses spaCy's base English model to tokenize a dataset according to specific parameters.  The functions will take a list of documents and output a list of tokens.  In this case we're interested in outputting strings, rather than spaCy tokens.

In [None]:
# data
text_data = ["I'm taking a course at Harvard.",
            "I'm learning about Natural Language Processing.",
            "We are studying tokenization, vectorization and modelling.",
            "Check out the course on Github: https://github.com/bpben/nlp_lessons"]

### Lemmatization and Stemming
Though word tense can sometimes carry with it a lot of useful information, a lot of time it might be useful to reduce words to their common root.  For example, the word "be" has various forms like "are", "is", "been".  We might not want our vocabulary to contain all these forms and rather count them all as instances of "be".

In [None]:
# read in English model with tagging/entity pipeline components
# you will need to run the line below beforehand
#!python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')
text = 'I am taking an NLP course.'
print(text)
print([x.lemma_ for x in nlp(text)])

In [None]:
# guess how lemmatization might change this
text = 'We are recording this class with Zoom.'
guess = ''
print(text)
print([x.lemma_ for x in nlp(text)])

### Stop words
Dealing with stop words involves making some pretty impactful decisions with your data.  Refer to the slides for details.  Here, we just remove stop words based on [spaCy's default set](https://github.com/explosion/spaCy/blob/master/spacy/lang/en/stop_words.py).

In [None]:
en = English()
text = 'In June 2020, I took a course at Harvard Extension School in Cambridge.'
print(text)
print([x.text for x in en(text) if not x.is_stop])

### Non-standard tokens (e.g. named-entities)
In text, some some n-grams should not be treated as a concatenation of unigrams.  For example, New York City is fundamentally different from the individual words "new", "york" and "city".

Here we attempt to deal with some of these non-standard tokens

In [None]:
# urls
# base python
# regex from textacy: https://github.com/chartbeat-labs/textacy
SHORT_URL_REGEX = re.compile(
    r"(?:^|(?<![\w/.]))"
    # optional scheme
    r"(?:(?:https?://)?)"
    # domain
    r"(?:\w-?)*?\w+(?:\.[a-z]{2,12}){1,3}"
    r"/+",
    flags=re.IGNORECASE)
text = 'Check out these courses: https://www.summer.harvard.edu/'
print(text)
print(SHORT_URL_REGEX.sub('', text))
# spacy
print([x for x in en(text) if not x.like_url])
# spacy - replace with a standard token
print(['-URL-' if x.like_url else x for x in en(text)])

In [None]:
# named-entities
# read in English model with tagging/entity pipeline components
nlp = spacy.load('en_core_web_sm')
text = 'I am taking an NLP course at Harvard starting July 19th, 2020'
parsed = nlp(text)
# look at the individual tokens
tokens = [t for t in parsed]
print(tokens)
# look at the identified named-entities and their types
for e in parsed.ents:
    print(e, type(e), e.label_, spacy.explain(e.label_))

### Exercise: A comprehensive tokenization pipeline

In [None]:
# data
text_data = ["I'm taking a course at Harvard.",
            "I'm learning about Natural Language Processing.",
            "We are studying tokenization, vectorization and modelling.",
            "Check out the course on Github: https://github.com/bpben/nlp_lessons"]

## Word counts
A very basic way to use a sanitized list of tokens is to do a word count.  This unlocks a lot of insights right off and is an important step in exploratory data analysis in text.

In [None]:
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from spacy.lang.en import English
en = English()

def simple_tokenizer(doc, model=en):
    # a simple tokenizer for individual documents (different from above)
    tokenized_docs = []
    parsed = model(doc)
    return([t.lower_ for t in parsed if (t.is_alpha)&(not t.like_url)])

In [None]:
# data
text_data = ["I'm taking a course at Harvard.",
            "I'm learning about Natural Language Processing.",
            "We are studying tokenization, vectorization and modelling.",
            "Check out the course on Github: https://github.com/bpben/nlp_lessons"]
tokenized = [simple_tokenizer(doc) for doc in text_data]

In [None]:
# base python: create an make use of a Counter object
counts = [Counter(d) for d in tokenized]
print('List of counts:', counts)
# sum together all counts
all_counts = Counter()
for d in tokenized:
    all_counts += Counter(d)
print(counts)
print('\nCombined count:', all_counts)

In [None]:
# scikit-learn's countvectorizer
CountVectorizer()

In [None]:
# use our custom tokenizer
cv = CountVectorizer(tokenizer=simple_tokenizer)
# outputs sparse array, want to use a normal numpy array
v = cv.fit_transform(text_data).toarray()
# get_feature_names gets the vocabulary of the vectorizer in order
dict(zip(cv.get_feature_names(), v.sum(axis=0)))
# result is the same as above

In [None]:
# other neat uses of CountVectorizer
# turning it into a dataframe for easier manipulation
cv = CountVectorizer()
v = cv.fit_transform(text_data).toarray()
pd.DataFrame(v, columns=cv.get_feature_names())

In [None]:
# character-level
cv = CountVectorizer(analyzer='char')
v = cv.fit_transform(text_data).toarray()
pd.DataFrame(v, columns=cv.get_feature_names())

In [None]:
# n-grams
cv = CountVectorizer(ngram_range=(1,2))
v = cv.fit_transform(text_data).toarray()
pd.DataFrame(v, columns=cv.get_feature_names())

In [None]:
# pre-specified vocabulary
cv = CountVectorizer(vocabulary=['natural', 'language', 'processing', 'harvard'])
v = cv.fit_transform(text_data).toarray()
pd.DataFrame(v, columns=cv.get_feature_names())

### Exercise: Sentiment analysis with word counts
Imagine you are a hot dog restaurant owner and you want to analyze a corpus of reviews from diners to see whether people generally think your hot dogs are "good" or "bad".  Specifically, you're going to count up the number of times the word "good" and word "bad" appears.  Depending on how you process the text, you will arrive at different conclusions.  Try a couple ways to see what I mean.

You might also want to think about whether all the reviews are relevant.  Those sorts of choices may also affect your results.  Is there an automatic way you can remove non-relevant reviews?

In [None]:
reviews = ['These hot dogs are really good.',
          'These hot dogs are really bad.',
          'Good hot dogs!',
          'The hot dogs pair well with a Good Humor bar.',
          "I didn't eat anything, I felt bad.",
          "I had a good time!"]

### Simple statistical test: Log-likelihood ratio
Above we just compared the count of the word good to the count of the word bad.  But we can actually test the significance of this difference using a test of log-likelihood ratio.  Refer to the slides for a bit more information, but here's a calculation based on the above example

In [None]:
from numpy import log, mean
def log_likelihood(analysis, reference, word):
    # count of word in source
    a = analysis[word].sum()
    # count of word in reference
    b = reference[word].sum()
    # count of all words in source
    c = analysis.sum().sum()
    # count of all words in reference
    d = reference.sum().sum()
    print('counts analysis:', a)
    print('counts reference:', b)
    e1 = c*(a+b)/(c+d)
    e2 = d*(a+b)/(c+d)
    g = 2*((a*log(a/e1)) + (b*log(b/e2)))
    print('G2: ', g)

In [None]:
# analysis: hot dog texts
analysis = count_df[(count_df['hot'] + count_df['dogs'])==2]
print(analysis.shape)
# reference: non hot dog texts
reference = count_df[(count_df['hot'] + count_df['dogs'])!=2]
print(reference.shape)
log_likelihood(analysis, reference, 'good')

In [None]:
# this is pretty non-significant
# G2=3.84 means 5% of getting this difference by chance, this is closer to 30%
# if we added a few to analysis
analysis['good'] = analysis['good']+3
log_likelihood(analysis, reference, 'good')

## Intro to advanced models
In this section, we'll be setting up some of the requirements for the more advanced techniques we will cover later in the course.  Particularly, we'll be working with:

- [huggingface's transformers library](https://github.com/huggingface/transformers)
- [spaCy-transformers (based on the above)](https://github.com/explosion/spacy-transformers)

These require some additional downloads.  For these examples you'll need:

[BERT uncased large model](https://github.com/google-research/bert)

SpaCy's medium English model (with word vectors from GloVe)


In [None]:
# install extra models
# only need to run this once per session
#!python -m spacy download en_trf_bertbaseuncased_lg
#!python -m spacy download en_core_web_md
# you will likely need to restart your kernel to load the models
#import spacy

In [None]:
# load BERT model
nlp_bert = spacy.load("en_trf_bertbaseuncased_lg")
# load medium English model
nlp = spacy.load("en_core_web_md")

In [None]:
# the spaCy Doc-Span-Token structure is still in place
text = "This is a sentence."
for model in [nlp_bert, nlp]:
    print(model.meta['name'])
    parsed = model(text)
    print(type(parsed), type(parsed[0]))
    # but the vector representation is different
    print('Vector shape:', parsed.vector.shape)
    # documents under the transformer model have additional attributes
    print(parsed._.trf_last_hidden_state)

In [None]:
# adapted from spaCy's example
# compare two different senses of the word "Apple" based on similarity
# similarity: higher values = more similar
apple_org = "Apple sold fewer iPhones this quarter."
apple_food = "Apple pie is delicious."
for model in [nlp_bert, nlp]:
    print(model.meta['name'])
    print('Similarity between senses:', model(apple_org)[0].similarity(model(apple_food)[0]))

### Additional example: SciSpaCy
I noticed a lot of people in the class are working in pharma and settings where you'd be dealing with scientific text.  When dealing with that, it might make sense to use a model trained on that sort of data.  Enter [scispaCy](https://allenai.github.io/scispacy/)

In [None]:
# you'll need to run these on collab
#!pip install scispacy
#!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz
# after this, you may need to restart the runtime
import spacy

In [None]:
nlp_sci = spacy.load("en_core_sci_sm")
nlp_web = spacy.load("en_core_web_sm")
text = """
Myeloid derived suppressor cells (MDSC) are immature 
myeloid cells with immunosuppressive activity. 
They accumulate in tumor-bearing mice and humans 
with different types of cancer, including hepatocellular 
carcinoma (HCC).
"""
doc = nlp_sci(text)
print(doc.ents)
doc = nlp_web(text)
print(doc.ents)