**<center>Text as Data 2</center>**
***<center>Datadriven Discovery</center>***

<center>Snorre Ralund</center>

## Discovery and exploratory analysis of large text data 
### Agenda

* Recap fundamentals.
* Fundamentals continued: Words and representation. 

** "Off-the-shelf" tools for prototypical analysis **
- Question driven vs Data-driven vs Modeldriven vs Tool-driven

** Datadriven exploration **
- Topic modelling and clustering in text
- "Free" off-the-shelf tools for prototyping measurements in text.
    - Lexical approaches.
    - NLP Parsers.

## Recap: Fundamentals (1)
-----

What do we expect of **social** data scientists:
- Computationally grounded analysis: 
    - Thourough exploratory analysis for both transparency and validation
    - Knowing the breath and depth of your dataset. 
- Bias investigation before performance maximization.
    - Is the bias equally distributed accross e.g. social groups
    - How will it effect my measurements, 
- Ethical conscience:
    - A reflexivity about model implications for e.g. privacy.        


## Recap: Fundamentals (2)
-----
- Regular expressions: ** Define, Inspect, Refine, Repeat (DIR-R)** 
     - Systematic development and Transparency of the process.
- Data validation and visualization: **Transparancy and quality control ** 
    - Random inspections (doc, sentence, context)
    - Wordclouds and Wordtrees.
- Descriptive analysis:
    - Collocations and distinctive terms: PMI, $X^2$ and Mann-Whitney.


### Using the expected frequency
If the probability of a word w, given y, p(w|y) is higher than we would expect we can use it to learn something about the data.

**Measures**: $PMI$ , $X^2$




####  Distinctive Terms 
** Learning about Classes or Subgroups in data ** 

## $X^2$ - chisquared
The statistical co-occurence strengths of two or more events. 
Used to find Word-Phrases of two words being co-located next to each other (e.g. New York), or a word occurring when a certain group of people speak. 

Consider the contingency table O:


|   | **O**  |   | bacon  |  $\neg{bacon}$  |   |   |   |
|---|---|---|----|:----:|---|---|---|
|   | **vegan** |   | 10 | 990 |   |   |   |
|   | **$\neg{vegan}$** |   | 2500 | 10000 |   |   |   |
|   |   |   |    |      |   |   |   |
|   |   |   |    |      |   |   |   |
    

p(w=bacon|cat=vegan) = 10 / (990+10)

p(w=bacon|cat=not_vegan) = 10 / (2500+10000)

p(bacon) = (2500+10) / (10000+1000)

$X^{2}=\sum_{i, j} \frac{\left(O_{i j}-E_{i j}\right)^{2}}{E_{i j}}$
where $i$ ranges over rows of the table, $j$ ranges over columns, $O_{i j}$ is the observed value for cell ($i,j$) and $E_ij$ is the expected value.

and 
$E_{i j} = N(p_i*p_j$)

$X^2$ express the difference between the co-occurence we observe and what we would have expected to see if two events independent, relative to the latter. I.e. how many times more prevalent is the co-occurrence we observe than what we would expect if they were independent.


## Fundamentals (3) -  WORDS
![](https://www.incimages.com/uploaded_files/image/970x450/getty_850704072_375370.jpg)

In [4]:
"What are the boundaries between-words and meaning?".split()

['What', 'are', 'the', 'boundaries', 'between-words', 'and', 'meaning?']

## WORDS (1)
* How we "split" / locate words in text determines the number of dimensions
    * Computational inefficiency 
    * Parameters are not shared among equivalent words.
        * It makes a difference especially for low N tasks.

## Words (2)
**ISSUES**
* Spelling mistakes, weird uses of punctuation,
* Emoticons: </3 , (:) , :-]
* Multiwords: #no-more-work, New York, Federal Bureau of Finance, word/concept



** Representation **
How to encode all relevant information in our tokens?
* lower-casing: DO YOU REALLY WANT TO IGNORE MY ALLCAPS?!?!
    * Our featurespace can potentially double if we don't lowercase.
* Numbers: Infinite combinations of digits
* Filtering: Which words to lose?

# Words(3)
** Solutions ** 
- ngrams collocations (PMI)
- characterbased representation (e.g. LSTM)
- subword tokenization (character ngrams)
    - Good for language detection.
    - Efficiency + captures natural similarity between different grammatical forms of the same word.
    - **examples** : e.g. FastText - BERT: BPE (https://github.com/google/sentencepiece)
     

In [445]:
print("Klatret©ªsen(Catch That Girl) is really great movie! It's a 'happy' movie. I watched this movie in 'Puchon International Fantastic Film Festival(PiFan)' on July 12nd, 2003. There is Action + Adventure + Comedy + Thrill + Happy + Romance(cute kids' love Triangle!). You must see this movie. :)")

Klatret©ªsen(Catch That Girl) is really great movie! It's a 'happy' movie. I watched this movie in 'Puchon International Fantastic Film Festival(PiFan)' on July 12nd, 2003. There is Action + Adventure + Comedy + Thrill + Happy + Romance(cute kids' love Triangle!). You must see this movie. :)


![](tokenizer_similarity.png)

![](cv_tokenizer_performance.png)

## WORDS (4) - choosing tokenizer
** depends on task **
- no. of samples
- do you need it to fit a pretrained model
- how much subtlety do you need
     - e.g. ['"truth"'] or ['"' ,'truth','"']


I suggest you use (from the top3):
- nltk.tokenize.causal.tweet
- deepmoji (compatible with the deepmoji encoder we will use next term) 
- potts (simple and efficient script optimized for sentiment analysis)
    - both are python2 based see below for a "quickfix" and download

<pre><code> 
import nltk
tweet_tokenizer = nltk.tokenize.casual.TweetTokenizer()
tweet_tokenizer.tokenize('hello I speak emoticon and #hashtag :)'
</code></pre>


In [None]:
### download potts tokenizer
url = 'http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py'
import requests
script = requests.get(url).text
# remove testing part in python2
#section_marker = '###############################################################################'
#script = section_marker.join(script.split(section_marker)[0:-1])
# remove python2 code and superflous
remove_parts = ['''# Fix HTML character entitites:
        s = self.__html2unicode(s)''','import htmlentitydefs','''# Try to ensure unicode:
        try:
            s = unicode(s)
        except UnicodeDecodeError:
            s = str(s).encode('string_escape')
            s = unicode(s)''']
for part in remove_parts:
    script = script.replace(part,'')
script = script.replace('return words','return list(words)')
script = script.split('    def tokenize_random_tweet(self):')[0]
with open('potts_tokenizer.py','w',encoding='utf-8') as f:
    f.write(script)
f.close()

In [17]:
## download deepmoji tokenizer
url = 'https://raw.githubusercontent.com/bfelbo/DeepMoji/master/deepmoji/tokenizer.py'
import requests
script = requests.get(url).text
with open('deepmoji_tokenizer.py','w',encoding='utf-8') as f:
    f.write(script.replace("ur\'","r'").replace(' = ur"',' = r"'))
f.close()
import deepmoji_tokenizer

** depends on model / representation **
- If model is sequential (e.g. lstm) smaller parts is better and more efficient.
    * e.g. subwords: "tion","un", 
    * "don't" or "don","'"."t"
- If model is based on BOW representation:
    - Depends on the no. of samples.

## Lexical approaches to text mining
**Pros**
* Comes with very low cost. 
* Fast and scalable.
* Good for prototyping results.

* Good for certain tasks (e.g. topic classification)

** Example ** 
- 300 million documents, more 5 million unique tokens. How to inquire?

![](https://raw.githubusercontent.com/snorreralund/sds_dump/master/toneDK.png)



* [Afinn (is danish!)](http://neuro.imm.dtu.dk/wiki/AFINN), 
* [Liu Hu (lexical)](http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html) and 
* [Vader (lexical and rulebased)](https://github.com/cjhutto/vaderSentiment).


* ** Purely Lexical ** Naively Matching positive words. *"You are beautiful."* 
* ** Rule-based ** Can Adopt hard-coded rules to counter more or less simple negations. *"You are not particularly beautiful."*

** See notebook lexical_mining.ipynb for a compilation of lexicon classifiers. ** 

### issues with the lexical-based approach (1)

*"Pretty. Pretty actresses and actors. Pretty bad script. Pretty frequent "let's strip to our undies" scenes. Pretty fair F/X. Pretty jarring location decisions (the college dorm room looks like a high-end hotel room - probably because it was shot at a hotel). Pretty bland storyline. Pretty awful dialog. Pretty locations. Pretty annoying editing, unless you like the music video flash-cut style.This one isn't a guilty pleasure - this is more an embarrassing one. If you must watch this, pick a good dance/techno album and turn the sound off on the movie - you'll see the pretty people in their pretty black undies, and probably follow the story just fine.The cast may be able to act - I doubt that anyone could look skilled given the lines/plot that they had to deal with."*

### issues with the lexical-based approach (2)
- Atomized words: How well can meaning be derived from atomized words?  
     - Not applicable to more complex rulebased versions:
         - e.g. VADER
         - Argument dictionary phrase based.
       - Use in connection with Parsers: e.g. who is the emotion directed. 
- What is the Recall?
    - Bad practice using dictionaries without explicit validation.
    


### use for prototyping! 


### issues with the lexical-based approach (3)
- Conglomerates of words 
   - What is the theoretical validity of a collection of words, scraped from many sources, validated at some historical point in time, given some score by a number people (students, amazonturks)?




### Construct Validity in (Automated) Content Analysis (2)
Krippendorf 2004: *"Content Analysis"*

- When constructing a variable from an interpretation of text (as opposed to survey, or registerbased research).
    - We have the *possibility*,
    - and the *obligation* to show that our interpretation are **correct**.
    - be **precise**.
        - What is it that we are measuring?
        - What are we not measuring?
    - situate it in contexts.
        - E.g negative expressions at a soccer match, or at a workplace should potentially be treated differently in the analysis. 


### Construct Validity in (Automated) Content Analysis (2)
- Define, delimit and expose the variation in expression, and context.
    - Give examples.
    - Border cases / hard calls.
- Show that you understand your corpus and the context that the communication is made *within*.
    - Do this in a datadriven manner.
    
    - Compare to other dictionaries and show in the context of clusters.

## Discovery and datadriven learning of text data


### Exploratory analysis on high dimensional data
- 300 million documents, more than 5 million unique tokens. How to inquire?

** Clustering to reduce dimensionality and efficiently seek out the variation in your corpus ** 


1. inquiry: I want to learn about my data. - Not overlook important and interesting phenomenons.

2. Omitted variable bias:
    * e.g. Expressions of sentiment cannot be viewed as uneffected by.
    - Should be seen in context with other phenomenons.
    
3. Grounding your theoretical construct: What are you measuring?


Annotation/Coding scheme/categorizatino/theoretical categorization is fundamentally controversial. 
Unless you deal with a phenomenon so common sense that "everyone" can recognize it.
Therefore performance in relation to a human gold standard, is not meaningful without a thorough and empirically grounded argument/proof that you are an expert in all its manifestations / ambivalences and problems. 
+ ** Transparency ** in relation to this. 
 
What is performance? What is aggreement with human annotators? If no theoretical validity.


# Modelbased discovery

## Topic modelling

![](https://cdn-images-1.medium.com/max/1200/1*pZo_IcxW1GVuH2vQKdoIMQ.jpeg)

## Topic modelling (2)
- Widely used tool in both research and industry. 
    - powering search algorithms, document retrieval, and recommendation systems. 

- Used for both measurement and discovery in the Social Sciences. 

** Pros ** 
- Praised for its inductive and datadriven properties.
    - I.e. we did not come to the data with preconcieved theoretical ideas about what exists and what is important.
- Beyond atomized words, and can handle polysemi of words.
    - But still based on the BOW assumption. 

## Generative models (1)
- Define a model that you believe describe the data generation process.
    - E.g. which parameters determine the probability of a network tie,
    - Word in a document.
 
- Define the variables and their dependencies.
    - Network: Same school, ethnicity, culture, gender. 
    - Words: Mood, speaker, social situation.

## Generative models (2)

### Naive Bayes : p(x,y) 

Simple generative model for the probability of a drawing "word" given a categorical variable. 

** Example ** 

p(w=Yes | y=tired)

Following me every morning we can observe and approximate p(yes|tired)

- Using bayes rule
- p(y|x) = p(x|y)*p(x) / p(y)
- We observe p(x|y), p(y) and p(x).

## Generative models (3)
### LDA
- Variables in the generative model does not have to be observable.
    - They can be **latent** (similar to neural networks)

Model definition:
- The probability of a "drawing" a word is dependent on the topic.
    - Topics are **Latent** - i.e. not observed.
    - Topics are distributions of word probabilities.
        - Words can be present in more than one topic.
- Documents consist of multiple topics.

** Can be extended with any latent structure as well as known variable **

![](https://cdn-images-1.medium.com/max/1500/0*II7wZlKViCt4ssBm.png)


** Latent Dirichlet Allocation ** 

$p\left(\beta_{1 : K}, \theta_{1 : D}, z_{1 : D}, w_{1 : D}\right)$

$=\prod_{i=1}^{K} p\left(\beta_{i}\right) \prod_{d=1}^{D} p\left(\theta_{d}\right)$

$\quad\left(\prod_{n=1}^{N} p\left(z_{d, n} | \theta_{d}\right) p\left(w_{d, n} | \beta_{1 : K}, z_{d, n}\right)\right)$

$\beta_{1 : K}$ distribution of topics. 

where each $\beta_{K}$ is a ditribution over the vocabulary


the per-document topic proportions
$\theta_{d} .$ 

where $\theta_{d, k}$ is the topic proportion
for topic $k$ in document $d$ 

topic assignment $z_{d, n}$ of the n-th word in the document.

the observed word $w_{d, n}$ depends on the topic assignment $z_{d, n}$ and all of the topics $\beta_{1 : K}$


** actually $\theta$ and $\beta$ are superflous if we know the topic assignments $z_{d_n}$ ** 
collapsed gibbs sampling approach. Hence what is called a gibbs sampling procedure, where.


## But... 

# Useful Models
*"All model are wrong, but some are useful"*
- We did actually come with preconceived theoretical ideas. 
    - How inductive and datadriven is the model really?


## Still remember 
**Anscombe famous example on why you need to plot your data**
<img src="https://d2f99xq7vri1nk.cloudfront.net/Anscombe_1_0_0.png"/>



** Models with unrealistic assumptions can produce wildly misleading results ** 

Lanchichetti et. al. 2014:

LDA collapses different languages into the same cluster given a wrong prior $\alpha$ prior.

** Complex models are hard to fit **
- Instability of solutions and local minima. 
    - (Lancietti et al. 2014; Chuang et al. 2015, Roberts et al. 2016, Wilkerson and Casas 2017; Gentzkow et al. 2017: 27; Agrawal et al. 2018), 


** Anscombe's dinosaur friend.**
[Figure taken from here](https://www.autodeskresearch.com/publications/samestats)

![](https://d2f99xq7vri1nk.cloudfront.net/DinoSequentialSmaller.gif)

## To the rescue (the 2020 model)
![](https://article.images.consumerreports.org/prod/content/dam/CRO%20Images%202018/Cars/October/CR-Cars-InlineHero-2018-Tesla-Model-3-f-driving-10-18)


** a network based model ** - 2018 model

Hierarchical Stochastic Block Modelling - hSBM 
- Gerlach, Peixoto and Altmann 2018

Less assumption and more adaptive: 
- You do not need to specify K (no. of topics).
- You do not need specify a prior for $alpha$ determining the size distribution of topics.


- Not as many prebuilt tools around it.
- Hard to install (use docker)

## Prototyping(2)
- Remember that this model is also wrong. 
- Don't use as final measurement model without explicit validation.
