<h1>Week 01. Text Data Essentials<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Week-01.-Introduction-to-Text-Data" data-toc-modified-id="Week-01.-Introduction-to-Text-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Week 01. Introduction to Text Data</a></span></li><li><span><a href="#Loading-and-Inspecting-Data-with-Pandas" data-toc-modified-id="Loading-and-Inspecting-Data-with-Pandas-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Loading and Inspecting Data with Pandas</a></span><ul class="toc-item"><li><span><a href="#Iterating-over-documents-in-a-dataframe" data-toc-modified-id="Iterating-over-documents-in-a-dataframe-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Iterating over documents in a dataframe</a></span></li><li><span><a href="#Saving-data" data-toc-modified-id="Saving-data-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Saving data</a></span></li></ul></li><li><span><a href="#Web-Scraping" data-toc-modified-id="Web-Scraping-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Web Scraping</a></span><ul class="toc-item"><li><span><a href="#Downloading-URL's" data-toc-modified-id="Downloading-URL's-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Downloading URL's</a></span></li><li><span><a href="#Parsing-HTML" data-toc-modified-id="Parsing-HTML-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Parsing HTML</a></span></li><li><span><a href="#Removing-unicode-characters" data-toc-modified-id="Removing-unicode-characters-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Removing unicode characters</a></span></li></ul></li><li><span><a href="#Quantity-of-Text" data-toc-modified-id="Quantity-of-Text-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Quantity of Text</a></span></li><li><span><a href="#Dictionary-/-Matching-Methods" data-toc-modified-id="Dictionary-/-Matching-Methods-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Dictionary / Matching Methods</a></span><ul class="toc-item"><li><span><a href="#Sentiment-Analysis" data-toc-modified-id="Sentiment-Analysis-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Sentiment Analysis</a></span></li><li><span><a href="#Sentiment-Analysis-with-Huggingface" data-toc-modified-id="Sentiment-Analysis-with-Huggingface-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Sentiment Analysis with Huggingface</a></span></li><li><span><a href="#StopWords" data-toc-modified-id="StopWords-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>StopWords</a></span></li><li><span><a href="#RegEx" data-toc-modified-id="RegEx-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>RegEx</a></span></li><li><span><a href="#WordNet" data-toc-modified-id="WordNet-5.5"><span class="toc-item-num">5.5&nbsp;&nbsp;</span>WordNet</a></span></li></ul></li></ul></div>

# Week 01. Introduction to Text Data

Natural Language Processing for Law and Social Science<br>
Elliott Ash, NYU

In [1]:
# set random seed
import numpy as np
np.random.seed(4)

# Loading and Inspecting Data with Pandas

In [None]:
# If you are using Google Colab, here's the code to load the zip file from local. 
# Or you can load from other source, see: https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92
from google.colab import files
uploaded = files.upload()

In [2]:
#import warnings; warnings.simplefilter('ignore')
# !pip install pandas
import pandas as pd
df = pd.read_csv('sc_cases.zip',compression='gzip')

In [None]:
df.info()

In [None]:
df.head()

In [None]:
# drop missing
df = df.dropna()
df.head()

In [None]:
# Number of label categories (e.g. judges)
df['authorship'].describe()

In [None]:
# tabulations of label categories 
df['authorship'].value_counts()

In [8]:
df['authorship'] = df['authorship'].str.upper()

In [None]:
df['authorship'].value_counts()

In [None]:
# keep all judges through ALITO
keep_judges = df['authorship'].value_counts().index[:11]
print(keep_judges)

In [None]:
df = df[df['authorship'].isin(keep_judges)]
df['authorship'].value_counts()

In [None]:
df.date_standard

In [None]:
df['date_standard'] = pd.to_datetime(df['date_standard'])
df['date_standard']

In [None]:
df['year'] = df['date_standard'].dt.year
df['year'].value_counts()

In [None]:
import matplotlib
df['cite_count'].hist()

In [None]:
import numpy as np
df['log_cite_count'] = np.log(df['cite_count'])
df['log_cite_count'].hist()

Save what we have done so far.

In [None]:
df.to_pickle('sc_cases_cleaned.pkl',compression='gzip')
print(df)

## Iterating over documents in a dataframe

In the following, we show how to iterate over a dataframe and three different ways of how to tokenize documents.

In [None]:
import spacy
# more infos at https://spacy.io/
nlp = spacy.load('en_core_web_sm')

In [19]:
processed = {} # empty python dictionary for processed data
# iterate over rows
for i, row in df.iterrows():
    if i >= 10:
        break
    docid = i # make document identifier
    text = row['opinion_text']     # get text snippet
    document = nlp(text) # get sentences/tokens
    processed[docid] = document # add to dictionary    

In [None]:
# first and second opinions
print ("opinion 1:", processed[0][:50], "\n\n", "opinion 2:", processed[1][:50])

Let's see in more detail what information we can extract from documents procesesd using spaCy: 

In [None]:
for token in processed[0][:50]:
       print(token.text, token.pos_, token.dep_)

alternatively, we can preprocess with gensim

In [None]:
from gensim.utils import simple_preprocess

processed = {} # empty python dictionary for processed data
# iterate over rows
for i, row in df.iterrows():
    docid = i # make document identifier
    text = row['opinion_text']     # get text snippet
    document = simple_preprocess(text) # get sentences/tokens
    processed[docid] = document # add to dictionary    
    if i > 100:
        break
# first and second opinions
print ("opinion 1:", processed[0][:50], "\n\n", "opinion 2:", processed[1][:50]) # note how simple preprocess drops punctuation

or with nltk

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
processed = {} # empty python dictionary for processed data
# iterate over rows
for i, row in df.iterrows():
    docid = i # make document identifier
    text = row['opinion_text']     # get text snippet
    document = word_tokenize(text.lower()) # get sentences/tokens
    processed[docid] = document # add to dictionary    
    if i > 100:
        break
# first and second opinions
print ("opinion 1:", processed[0][:50], "\n\n", "opinion 2:", processed[1][:50]) # note that we just tokenize and keep all tokens


## Saving data

In [24]:
# save as python pickle
pd.to_pickle(processed, 'processed_corpus.pkl')
# delete it
import os 
os.remove('processed_corpus.pkl')

In [38]:
# Merging Data-frames Example
# Perform a left join:
# df_merged = pd.merge(df1,df2,on='id', how='left', validation='m:1')

# Web Scraping

## Downloading URL's

In [None]:
import urllib.request as urllib # Python's module for accessing web pages
url = 'https://www.example.com' # shortened URL for court case
page = urllib.urlopen(url) # open the web page

html = page.read() # read web page contents as a string
print(html[:400])  # print first 400 characters
print()
print(html[-400:]) # print last 400 characters
print()
print(len(html),'characters in string.')   # print length of string

## Parsing HTML

In [None]:
%pip install -U beautifulsoup4

In [None]:
# Parse raw HTML
# !pip install beautifulsoup4
from bs4 import BeautifulSoup # package for parsing HTML
soup = BeautifulSoup(html) # parse html of web page
print(soup.title) # example usage: print title item

In [None]:
# extract text
text = soup.get_text() # get text (remove HTML markup)
lines = text.splitlines() # split string into separate lines
print(len(lines)) # print number of lines

In [None]:
lines = [line for line in lines if line != ''] # drop empty lines
print(len(lines)) # print number of lines

In [None]:
print(lines[:20]) # print first 20 lines

## Removing unicode characters

In [None]:
!pip install unidecode
from unidecode import unidecode # package for removing unicode
uncode_str = 'Visualizations\xa0'
fixed = unidecode(uncode_str) # example usage
print([uncode_str],[fixed]) # print cleaned string (replaced with a space)

# Quantity of Text

Count words per document.

In [None]:
def get_words_per_doc(txt):
    # split text into words and count them.
    return len(txt.split()) 

# apply to our dataframe
df['num_words'] = df['opinion_text'].apply(get_words_per_doc)
df['num_words'].hist()

In [None]:
# plot length by year
ax = df.groupby('year')['num_words'].mean().plot()
ax.set_ylabel('Average Opinion Length')
import matplotlib.pyplot as plt
plt.show()

In [None]:
df['log_words'] = np.log(df['num_words'])
import seaborn as sns
sns.jointplot(data=df,x='year', y='log_words',kind='hex')

Build a frequency distribution over words with `Counter`.

In [None]:
from collections import Counter
freqs = Counter()
for i, row in df.iterrows():
    freqs.update(row['opinion_text'].lower().split())
    if i > 100:
        break
freqs.most_common()[:20] # can use most frequent words as style/function words

# Dictionary / Matching Methods

## Sentiment Analysis

In [None]:
#!pip install spacytextblob
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
print (spacy.__version__)

In [None]:
# Dictionary-Based Sentiment Analysis
nltk.download('vader_lexicon')

# textblob sentiment analysis: https://github.com/sloria/TextBlob
# pip install spacytextblob

import spacy
from spacytextblob.spacytextblob import SpacyTextBlob


nlp = spacy.load('en_core_web_sm')
# spacy_text_blob = SpacyTextBlob()
nlp.add_pipe('spacytextblob')
doc = nlp(df.iloc[0]["opinion_text"])
#from nltk.sentiment.vader import SentimentIntensityAnalyzer
#sid = SentimentIntensityAnalyzer()
#polarity = sid.polarity_scores(text)
print("polarity", doc._.blob.polarity ) # sentimentintensityanalayzer nltk: {'neg': 0.134, 'neu': 0.785, 'pos': 0.081, 'compound': -0.9999}
print ("subjectivity", doc._.blob.subjectivity)

In [31]:
# sample 10% of the dataset
dfs = df.sample(frac=.1) 
# apply compound sentiment score to data-frame
def get_sentiment(snippet):
    #return sid.polarity_scores(snippet)['compound']
    return nlp(snippet)._.blob.polarity
dfs['sentiment'] = dfs['opinion_text'].apply(get_sentiment)

In [None]:
dfs.sort_values('sentiment',inplace=True)
# print beginning of most positive documents
[x[50:150] for x  in dfs[-5:]['opinion_text']]

In [None]:
# print beginning of most negative documents
[x[50:150] for x  in dfs[:5]['opinion_text']]

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
# sample 20% of the dataset
dfs = df.sample(frac=.1)

# apply compound sentiment score to data-frame
def get_sentiment(snippet):
    return sid.polarity_scores(snippet)['compound']
dfs['sentiment_vader'] = dfs['opinion_text'].apply(get_sentiment)
dfs.sort_values('sentiment_vader',inplace=True)
# print beginning of most positive documents
[x[50:150] for x  in dfs[-5:]['opinion_text']]

## Sentiment Analysis with Huggingface 

In [None]:
#!pip install transformers
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
pipe = pipeline("sentiment-analysis")

In [None]:
from torch.utils.data import Dataset
from tqdm.auto import tqdm

class OpinionDataset(Dataset):
    def __init__(self, df):
        super().__init__()
        self.df = df
    def __len__(self):
        return len(df)

    def __getitem__(self, i):
        return df.iloc[i]["opinion_text"][:512] # BERT max seq length


dataset = OpinionDataset(df)
sentiments = []

for out in tqdm(pipe(dataset, batch_size=16), total=len(dataset)):
        if out['label'] == "NEGATIVE":
            sentiments.append(-1*out['score'])
        else:
            sentiments.append(out['score'])

In [37]:
df['sentiments'] = sentiments

In [None]:
df.sort_values('sentiments',inplace=True)
# print beginning of most positive documents
[x[50:150] for x  in df[-5:]['opinion_text']]

In [None]:
# print beginning of most negative documents
[x[50:150] for x  in df[:5]['opinion_text']]

## StopWords

In [None]:
#from nltk.corpus import stopwords
#stopwords = set(stopwords.words('english'))
#stopwords
from spacy.lang.en import stop_words
print(stop_words.STOP_WORDS)

In [None]:
#stopfreq = np.sum([freqs[x] for x in stopwords])
#stopfreq # 174132 for NLTK stopwords
stopwords = stop_words.STOP_WORDS
stopfreq = np.sum([freqs[x] for x in stopwords])
stopfreq

In [None]:
otherfreq = np.sum([freqs[x] for x in freqs if x not in stopwords])
otherfreq

## RegEx

Please refer to [RegExOne Regular Expressions Lessons](regexone.com) and [the python documentation](https://docs.python.org/3/howto/regex.html).

In [None]:
import re

docs = dfs[:5]['opinion_text']

# Extract words after justice.
for doc in docs:    
    print(re.findall(r'Justice \w+ ', # pattern to match. always put 'r' in front of string so that backslashes are treated literally.
                     doc,            # string
                     re.IGNORECASE))  # ignore upper/lowercase (optional)

In [None]:
# Extract hyphenated words
for doc in docs:    
    print(re.findall(r'[a-z]+-[a-z]+', 
                     doc,            
                     re.IGNORECASE))  

In [None]:
# extract citations
for i, doc in enumerate(docs):
    finder = re.finditer('\d+ [^\s]+ \d+', # pattern to match ([^\s] means non-white-space)
                     doc)            # string
    for m in finder: 
        print(i, m.span(),m.group()) # location (start,end) and matching string

In [46]:
# baker-bloom economic uncertainty
pattern1 = r'(\b)uncertain[a-z]*'
pattern2 = r'(\b)econom[a-z]*'
pattern3 = r'(\b)congress(\b)|(\b)deficit(\b)|(\b)federal reserve(\b)|(\b)legislation(\b)|(\b)regulation(\b)|(\b)white house(\b)'



In [None]:
re.search(pattern1,'The White House tried to calm uncertainty in the markets.')

In [None]:
re.search(pattern2,'The Congress tried to calm uncertainty in the economy.')

In [49]:
re.search(pattern3,'The Congress tried to calm uncertainty in the markets.')

In [None]:
re.search(pattern3,'The Congress tried to calm uncertainty in the markets.', re.IGNORECASE)

In [51]:
def indicates_uncertainty(doc):
    m1 = re.search(pattern1, doc, re.IGNORECASE)
    m2 = re.search(pattern2, doc, re.IGNORECASE)
    m3 = re.search(pattern3, doc, re.IGNORECASE)
    if m1 and m2 and m3:
        return True
    else:
        return False

In [None]:
indicates_uncertainty('The White House tried to calm uncertainty in the economy.')

In [None]:
indicates_uncertainty('The White House tried to calm uncertainty in the markets.')

In [54]:
df['uncertainty'] = df['opinion_text'].apply(indicates_uncertainty)

In [None]:
df.uncertainty.mean()

In [None]:
df.groupby('year')['uncertainty'].mean().plot()

## WordNet

These examples are based on the [NLTK tutorial](https://www.nltk.org/howto/wordnet.html).

In [None]:
nltk.download('wordnet')
from nltk.corpus import wordnet as wn

In [None]:
nltk.download('omw-1.4')
wn.synsets('judge')

In [None]:
wn.synsets('judge', pos='v') # can filter on part of speech

In [None]:
judge = wn.synset('judge.n.01')
judge

In [None]:
judge.definition()

In [None]:
wn.synset('estimate.v.01').examples()

In [None]:
# categories to which "judge.n.01" belongs
judge.hypernyms()

In [None]:
# the root category of "judge.n.01"
judge.root_hypernyms()

In [None]:
wn.synset('estimate.v.01').root_hypernyms()

In [None]:
# members of the "judge.n.01" category
judge.hyponyms()

In [None]:
# "holonym" is a part of a whole
juror = wn.synset('juror.n.01')
juror.member_holonyms()

In [None]:
# can find "lowest common hypernyms":
judge.lowest_common_hypernyms(juror)

In [None]:
# "lemmas" are specific senses of a specific word.
judge.lemmas()

In [None]:
[lemma.name() for lemma in judge.lemmas()]

In [None]:
# lemmas have additional properties
judge_lemma = judge.lemmas()[0]
judge_lemma.derivationally_related_forms()

In [None]:
good = wn.synset('good.a.01').lemmas()[0]
good.antonyms()

In [None]:
# verb frames summarize the different semantic contexts that a verb can be used
judge_verb = wn.synset('estimate.v.01').lemmas()[4]
judge_verb.frame_strings()

In [None]:
# measure similarity in the dictionary between words
judge.path_similarity(wn.synset('juror.n.01'))

In [None]:
judge.path_similarity(wn.synset('cat.n.01'))

In [None]:
# Wu-Palmer similarity.
judge.wup_similarity(juror)

In [None]:
judge.wup_similarity(wn.synset('cat.n.01'))

In [None]:
# Can iterate over all synsets; e.g., all nouns:
for synset in list(wn.all_synsets('n')):
    if 'judg' in str(synset):
        print(synset)

**Exercise**. Use wordnet to expand the set of words in the Baker-Bloom-Davis dictionary and re-compute policy uncertainty scores by year. 