**<center>Text as Data</center>**
***<center>Basic Text Classification</center>***

<center>Snorre Ralund</center>

# Motivation
** Quantifying the Qualitative **  
- **Behavior** (ethnography) instead of **Attitudes**(Survey)
- Dynamics of **Ideas** and Forms (Historical analysis) instead of rationalized **Answers** (Interview) and **Numbers** (Register).
- Social relationships and their content.

## Examples
- Measuring historical development of jobdemand from jobpostings.
- Opportunism in politics:
    - Shifting political topics based on Response.
- Ethnic and Political Polarization through expressed social ties.
    - Processing the social signals in textual social media content.
    - Respect and Hostility towards the Stranger


## Overview of Computational Content Analysis
![](computational_content_analysis_overview.png)

# Agenda
** Natural Language Processing basics - Representation and Classification **
- Baseline classifiers using the BoW representation.
    - Tokenization
    - Stemming and Lemmatization
    - Ngram and Collocations.
- Lexical and rulebased approaches to classification

** Computational Content Analysis Methodology ** 
- Natural language processing as a ***Measurement device***.
    - Validation of automated procedure.
- Understanding and Accounting for the ***Measured Category***
    - Computationally grounded qualitative analysis


# Simple Text Classification
![](stanford_pajamas_constituency_parsing.png)

- Language Understanding is hard, but... 
    - Many useful classification tasks just need simple recognition of keywords.
    - Many interesting questions can come from simple statistical descriptions.
        - lengths of sentences or documents
        - lix and wordlength
        - the use of ALLCAPS, commas and punctuation (,!?;:."").
        - the use of emoticons/emojies
        - $ € kr

In [39]:
## Simple description example.
### Download data
import requests,pandas as pd,numpy as np
path2data = 'https://raw.githubusercontent.com/snorreralund/scraping_seminar/master/english_review_sample.csv'
df = pd.read_csv(path2data)
sentence = df.iloc[3267].reviewBody
sentence

"... be prepared to spent 30-40 minutes fighting with the unpacking, and cutting up a HUGE shipping box!!!!  I'M SERIOUS!!!  I recently ordered a frame (12x36) with matt board and plexiglass (not glass).  My wife helped me carry a HUGE box into the house, and after about half an hour of cutting through layers of very thick cardboard, the unpacking was finally complete and we had a HUGE pile of cut up cardboard and packing material... what an incredible waste!!  I contacted them and sent pictures.  I told them to please stop this unnecessary overkill... they ignored my suggestion.  I just received a smaller order (11x14) and you would think I ordered quantity 10 by the size of the box!!!  LOVE their products, I am tired of hauling piles of cardboard to the recycle center because there is so much with each order it will not fit in my recycle bin!!"

In [42]:
# ALLCAPS
print('YEAH'.isupper(),'yeah'.islower())
upper = 0
for w in sentence.split(): # naive tokenization
    upper+=w.isupper()
print(upper)
#Counting simple pattern
print(sentence.count('!'))
# wordlengths
words = sentence.split() # naive tokenization
lengths = [len(w) for w in words]
print(np.mean(lengths))

True True
12
14
4.629139072847682


## Simple Classification (2) - BoW Baseline
### Getting from Text to vector
![](https://image.slidesharecdn.com/bcw-cochrane-tech-2013-130927075508-phpapp02/95/text-mining-machine-learning-nlp-and-all-that-in-10-minutes-10-638.jpg?cb=1380268595)
<center>Credit Byron C Wallace</center>

## Vector representation 
## Bag of Words (BoW) or Document Term Matrix
* Throw out the word order.
* And let each word be a feature. --> map word to an index in a matrix.


doc1: "i really love bacon"

doc2: "i really don't like bacon"

<center> **As a Document Term Matrix** <center>

document | really | i | love | bacon | don't
--- | --- | --- | --- | --- | --- | ---
*doc1* | 0 | 0 | 1 | 1 | 0 |
*doc2* | 1 | 1 | 0 | 1 | 1 |


![](https://cdn-images-1.medium.com/max/1200/1*unP1qDkUPSSUsa2-530zEg.png)
- Baseline models: Naive Bayes, Logistic Regresion, K-nearest Neighbor and Support Vector Machines
- **Wang and Manning 2012** *"Baselines and Bigrams: Simple, Good Sentiment and Topic Classification"*: 
    - State-of-the-art (2012) Topic and Sentiment Classification using only atomized Words as input(BoW) to a linear model. 
    - No grammar or reasoning.    

## Bag of Words(2) - Problem with polesemy

Consider the following to documents: 

doc1: *"River A/S declared default by the bank."*

doc2: *"When camping my default is by the river bank."*

<center> **As a Document Term Matrix** <center>

document | declared | by | default | bank | river | a/s | when | camping | my
--- | --- | --- | --- | --- | --- | --- | --- | --- | ---
*doc1* | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 
*doc2* | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1


## Bag of Words (3) - Lack of word order
doc1 = 'this was not the best movie'

doc2 = 'was this not the best movie ever?'

Will have very similar representations.

### Ngrams to the Rescue
#### Word Ngram 
![](https://i.stack.imgur.com/8ARA1.png)


### Ngrams to the Rescue(2)
** Problem with dimensionality ** 
Quad-grams Qvint Grams etc. Generates exponential number of features.

*** Solution *** Pick only the ngrams using statistical analysis of the word co-occurences.
Check out methods for doing so in the Natural Language Processing Toolkit package nltk: `nltk.collocations`

# Tokenization: From string to words to concepts
![](https://www.incimages.com/uploaded_files/image/970x450/getty_850704072_375370.jpg)

In [4]:
"What are the boundaries between-words and meaning?".split()

['What', 'are', 'the', 'boundaries', 'between-words', 'and', 'meaning?']

## WORDS (1)
How we "split" / locate words in text determines the number of dimensions (columns in the Document-Term Matrix)
* Computational inefficiency 
* Parameters are not shared among equivalent words.
    * It makes a difference especially for low N tasks.

## Words (2)
**ISSUES**
* Spelling mistakes, weird uses of punctuation,
* Emoticons: </3 , (:) , :-]
* Multiwords: #no-more-work, New York, Federal Bureau of Finance, word/concept



## Representation (1) 
**How to encode all relevant information in our tokens?**
* lower-casing: DO YOU REALLY WANT TO IGNORE MY ALLCAPS?!?!
    * Our featurespace can potentially double if we don't lowercase.
* Numbers: Infinite combinations of digits
* Filtering to reduce dimensions: Which words to lose?
    * Common or Rare?


## Representation (2) 
**Grammatical Forms: Do we need grammar?**
- Stemming
    - Rulebased: Strips common endings: 'ing','ly','s'
- Lemmatization
    - Lookup in Lexical Ressources: ran --> run, running --> run

** Trade-off precision and coverage **

## WORDS (4) - choosing a tokenizer and preprocessing scheme
** There are many tokenizers: which one to choose depends on task **
- no. of samples (lemmatization and stemming)
- how much subtlety do you need
     - e.g. ['"truth"'] or ['"' ,'truth','"']

## Tokenizer Experiment (1) - which is best?


![](tokenizer_similarity.png)

In [445]:
print("Klatret©ªsen(Catch That Girl) is really great movie! It's a 'happy' movie. I watched this movie in 'Puchon International Fantastic Film Festival(PiFan)' on July 12nd, 2003. There is Action + Adventure + Comedy + Thrill + Happy + Romance(cute kids' love Triangle!). You must see this movie. :)")

Klatret©ªsen(Catch That Girl) is really great movie! It's a 'happy' movie. I watched this movie in 'Puchon International Fantastic Film Festival(PiFan)' on July 12nd, 2003. There is Action + Adventure + Comedy + Thrill + Happy + Romance(cute kids' love Triangle!). You must see this movie. :)


![](cv_tokenizer_performance.png)

## WORDS (4) - choosing a tokenizer and preprocessing scheme
For user generated content and social media data use:
- If enough data: nltk.tokenize.causal.tweet 
- If smaller data: spacy lemmatizer.

For more formal text (e.g. scientific articles) test a few others.
<pre><code> 
import nltk
tweet_tokenizer = nltk.tokenize.casual.TweetTokenizer()
tweet_tokenizer.tokenize('hello I speak emoticon and #hashtag :)'
</code></pre>


## Lexical approaches to text mining
** Automatic classification based on Rules and Lookups**

**Pros**
* Comes with very low cost. 
* Fast and scalable.
* Good for prototyping results.

* Good for certain tasks (e.g. topic classification)

** Example ** 
- 300 million documents, more than 5 million unique tokens. How to inquire?

![](https://raw.githubusercontent.com/snorreralund/sds_dump/master/toneDK.png)


## Precompiled Lexicons
**Sentiment**
* [Afinn (is danish!)](http://neuro.imm.dtu.dk/wiki/AFINN), 
* [Liu Hu (lexical)](http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html) and 
* [Vader (lexical and rulebased)](https://github.com/cjhutto/vaderSentiment).


* ** Purely Lexical ** Naively Matching positive words. *"You are beautiful."* 
* ** Rule-based ** Can Adopt hard-coded rules to counter more or less simple negations. *"You are not particularly beautiful."*

** See notebook [lexical_mining.ipynb](https://github.com/abjer/tsds/blob/master/material/12_text2/lexical_mining.py) for a compilation of lexicon classifiers. ** 

### issues with the lexical-based approach (1)

*"Pretty. Pretty actresses and actors. Pretty bad script. Pretty frequent "let's strip to our undies" scenes. Pretty fair F/X. Pretty jarring location decisions (the college dorm room looks like a high-end hotel room - probably because it was shot at a hotel). Pretty bland storyline. Pretty awful dialog. Pretty locations. Pretty annoying editing, unless you like the music video flash-cut style.This one isn't a guilty pleasure - this is more an embarrassing one. If you must watch this, pick a good dance/techno album and turn the sound off on the movie - you'll see the pretty people in their pretty black undies, and probably follow the story just fine.The cast may be able to act - I doubt that anyone could look skilled given the lines/plot that they had to deal with."*

### issues with the lexical-based approach (2)
- Atomized words: How well can meaning be derived from atomized words?  
     - Not applicable to more complex rulebased versions:
         - e.g. VADER
         - Argument dictionary phrase based.
       - Use in connection with Parsers: e.g. who is the emotion directed. 
- What is the Recall?
    - Bad practice using dictionaries without explicit validation.
    


### use for prototyping!


### issues with the lexical-based approach (3)
**Conglomerates of words = Concept?**
- What is the theoretical validity of a collection of words, scraped from many sources, validated at some historical point in time, given some score by a number people (students, amazonturks)?




## Datadriven discovery and generation of lexicons
- Efficient generation of your own lexicons using an active learning approach [(King, Lam and Robert 2017)](https://gking.harvard.edu/publications/computer-assisted-keyword-and-document-set-discovery-fromunstructured-text) and word similarity search [(e.g. Marquez et al. 2016)](https://www.cs.waikato.ac.nz/~eibe/pubs/emo_lex_wi.pdf).


## Python Ressources for Text as Data
- `nltk`: General purpose and educational natural language processing library.
    - Includes many tokenizers, stemmers and a collocation module. 
- `spacy`: For more advanced natural language processing pipeline.
- `stanfordnlp`: For state-of-the-art dependency parsing in more than 50 languages. 
    - Parsing of grammar, named-entity-extraction, subject-verb-object.
- `gensim`: Unsupervised learning and Clustering of text
    - Including Topic Modelling implementations, and Word2Vec style word-similarity search.

# Methodological Considerations - (i.e. EXAM)


# Computational Content Analysis (1)
![](https://matlab1.com/wp-content/uploads/2017/11/tolomerok-kalibralasa-tevekenysegek.jpg)
- NLP is a **Measurement device**
    - How well is it calibrated?
    - Does it have any biases that might be relevant to the results produced?

### For the Exam
**Discuss potential biases in classification, and their impact on results.**
-Issues could be: 
    - Missing keywords using the lexical approach
    - Sarcasm not equally distributed accross social classes.
     
- Take a ***small*** random sample of the models predictions for annotation (multiple coders). 
    - Report intercoder-agreement (http://www.nltk.org/api/nltk.metrics.html) and accuracy. 

# Computational Content Analysis (2)
![](https://systweak1.vo.llnwd.net/content/wp/systweakblogsnew/uploads_new/2018/04/98993-610x377-Why-People-Dont.jpg)
- Before we trust a **Machine**, we need to trust the **Instructor** first.


### Construct Validity in (Automated) Content Analysis (2)
Krippendorf 2004: *"Content Analysis"*

- When constructing a variable from an interpretation of text (as opposed to survey, or registerbased research).
    - We have the *possibility*,
    - and the *obligation* to show that our interpretation are **correct**.
    - be **precise**.
        - What is it that we are measuring?
        - What are we not measuring?

## Trusting the Instructor (Exam)
- A clear definition (and delimitation) of the Category one wish to measure.
    - Give examples from the data.
    - Discuss the potential challenges involved in the automatization of classification
