## Category Development, Weak Supervision and CLassification [test]
Today we are gonna practice the Discovery and Exploration steps involved in a Content Analysis project. 
And finally we are gonna look at the Implementation of baseline text classifiers as described in the ["baseline_classification.ipynb"](https://github.com/ulfaslak/sds_tddl_2020/blob/master/baseline_classification.ipynb) notebook. 

Overall you will learn how to
- setup a HSBM topic model within a docker environment (HSBM is unfortunatly hard to install)
- Practice Computer Assisted Query Building - ["Computer-Assisted Keyword and Document Set Discovery from Unstructured Text"](https://gking.harvard.edu/publications/computer-assisted-keyword-and-document-set-discovery-fromunstructured-text) and ["Building a Twitter opinion lexicon from automatically-annotated tweets"](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwi4vabi56PoAhWEjqQKHUalAzEQFjAAegQIAxAB&url=https%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS095070511630106X&usg=AOvVaw11N9i8bUc2fwbbq9vLKsLY)
- Practice Weak Supervision techniques combining lexical appraoches with NLP parsing systems (Stanfordnlp).


Todays exercise will use the Kaggle Toxicity Classification dataset: https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data
- Follow the url. Sign in and download the zip file.

In [0]:
# load data
import pandas as pd
path2tox_data = 'jigsaw-unintended-bias-in-toxicity-classification/train.csv'
df = pd.read_csv(path2tox_data)
# subsample data to allow faster prototyping
df = df.sample(5000)

In [0]:
df.head()

Unnamed: 0,id,target,comment_text,severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,...,article_id,rating,funny,wow,sad,likes,disagree,sexual_explicit,identity_annotator_count,toxicity_annotator_count
1732208,6246168,0.166667,"Mostly agree. No authentic ""cowgirl"" would ev...",0.0,0.0,0.0,0.166667,0.0,,,...,394506,approved,0,0,0,0,0,0.0,0,6
321106,635727,0.0,Operational knowledge does not require a billi...,0.0,0.0,0.0,0.0,0.0,,,...,153272,approved,0,0,0,1,0,0.0,0,4
654,240885,0.0,I understand completely- I was also affected b...,0.0,0.0,0.0,0.0,0.0,,,...,32634,approved,0,0,0,0,0,0.0,0,4
1226258,5613454,0.0,Why do they do this when mango season is over?...,0.0,0.0,0.0,0.0,0.0,,,...,356626,approved,0,0,0,0,0,0.0,0,4
601897,978733,0.0,That was proved false many years ago. I could ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,167349,approved,0,0,0,3,0,0.0,4,4


# Discovery

## Setting up HSBM
HSBM is based on the Network Package graph-tool which unfortunatly is notoriously hard to install on your ordinary laptops. 
For those of you using Linux, you can try installing on your own computer using similar commands as instructed in the Google Cloud example. 

Instead we have two possibilties: 
1. Create a local "server" on your own computer using Docker.
2. Use a Google Cloud server as introduced in Week 1 (see exercise 1 on how to setup Google Cloud).

### Docker solution
#### First Install docker (https://docs.docker.com/install/)
#### Enable file sharing.
Docker runs an operating system in a closed environment, so you have to create a port to share files. 

Go to the Docker settings: Right click the docker icon in press settings.
Press the /Ressources tab.
Then /File sharing tab.
Select the drive to be shared.

### Setup the Graph-tool docker image used for HSBM.
#### pull the image.
`docker pull tiagopeixoto/graph-tool`

#### Run the image while mounting the drive to be shared. # first allow shared drives in settings of the docker app
#### run this.
`docker run -v c:\:/mnt/c -p 8888:8888 -p 6006:6006  --name graphtool -it -u root -w /home/root tiagopeixoto/graph-tool bash`
##### In this example it was a c\: directory that I shared. c:\  specified the path to the shared directory on the host (your computer), and /mnt/c refered to the path in the Docker container.
 
#### Run Jupyter notebook and navigate to port localhost://8888 in your browser
`jupyter notebook --ip 0.0.0.0 --allow-root`

#### When using it the second time use the following commands, to simple start and attach to the container.
`docker container start graphtool`

`docker attach graphtool`



## Google Cloud Solution
- Login to the server you set up in week 1. See these instructions for setting it up (https://course.fast.ai/start_gcp.html)
- Run the following conda commands:
    - ```conda config --add channels conda-forge
conda config --add channels ostrokach-forge
conda config --add channels pkgw-forge
conda install gtk3 pygobject graph-tool cairo```




## Exercise 7.1: Running a HSBM topic model.


In [0]:
# Open a New .ipynb and copy the following cell to download the TOPSBM tutorial created by the Authors of HSBM.
import requests
for filename in ['corpus.txt','titles.txt','sbmtm.py','TopSBM-tutorial.ipynb']:
    url = 'https://raw.githubusercontent.com/martingerlach/hSBM_Topicmodel/master/%s'%filename
    with open(filename,'w') as f:
        f.write(requests.get(url).text)

Following the Tutorial, 'TopSBM-tutorial.ipynb' to test out the model:



### 7.1.extra: Run the HSBM model on the Toxicity Dataset.

In [0]:
import sbmtm

""" TO DO """
# prepare documents
    # remove urls
    # remove infrequent words.

In [0]:
# create model and graph

In [0]:
# fit model

In [0]:
# investigate topics

## Exercise 7.2: Computer-Assisted Category Development
First you need to install the [Gensim package](https://radimrehurek.com/gensim/)

The package implements a range of unsupervised text methods, including classic topic modelling like LDA, Wordembeding models like Word2Vec and GloVe,  and an implementation of one of the early Paragraph Embedder "Doc2Cec". The package furthermore comes with a set of pretrained models that can be used for wordbased similarity search.

In this exercise you will practice using pretrained Word Embeddings for similarity search.


- Load a model from the Gensim (if not installed `! conda install -c anaconda gensim`) [pretrained models](https://github.com/RaRe-Technologies/gensim-data)
    - I suggest you use the `"glove-wiki-gigaword-300"` model.

```## Load gensim model
import gensim
import gensim.downloader as api
glove = api.load("glove-wiki-gigaword-300")```

- Play around with the `most_similar()` function of the model.
    - Try defining both the `positive=` and `negative=` argument of the function. 
>> [See here for docs](https://radimrehurek.com/gensim/models/keyedvectors.html)

In [3]:
import gensim
import gensim.downloader as api
glove = api.load("glove-wiki-gigaword-300")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [1]:
# # Play around with the most_similar() function of the model.
result = glove.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0])) 
for i in range(1,9):
  print("{}: {:.4f}".format(*result[i]))

print('\n')
result = glove.most_similar(positive=['cello', 'orchestra'], negative=['daughter'])
print("{}: {:.4f}".format(*result[0])) 
for i in range(1,9):
  print("{}: {:.4f}".format(*result[i]))

print('\n')
result = glove.most_similar(positive=['conductor', 'orchestra'], negative=['driver'])
print("{}: {:.4f}".format(*result[0])) 
for i in range(1,9):
  print("{}: {:.4f}".format(*result[i]))

print('\n')
result = glove.most_similar(positive=['bed', 'sleep'], negative=['desk'])
print("{}: {:.4f}".format(*result[0])) 
for i in range(1,9):
  print("{}: {:.4f}".format(*result[i]))

print('\n')
result = glove.most_similar(positive=['sound'], negative=['light','silence'])
print("{}: {:.4f}".format(*result[0])) 
for i in range(1,9):
  print("{}: {:.4f}".format(*result[i]))

print('\n')
result = glove.most_similar(positive=['pencil', 'paper'], negative=['bread'])
print("{}: {:.4f}".format(*result[0])) 
for i in range(1,9):
  print("{}: {:.4f}".format(*result[i]))

NameError: ignored

## Build a lexicon from scratch
- Inspect 5-10 documents in the dataset to create an Initial list of positive and negative words, or come up with a topic yourself (e.g. politics).
- Create an input function for evaluating new words.
    - Use the model.get_similar(word) function to get candidate words.
    - Run through the candidates.
    - print the word to be evaluated.
    - use the builtin method `input()` to get manual input.
    - use the input to either save or discard candidate words.
    - repeat process until you have >20 words.
- Apply a lookup function to the dataset.
    - First you tokenize the data using a tokenizer of choice. I suggest nltk.tokenize.casual.TweetTokenizer()
    - create a function that loops through the words and counts matching with your lexicon.
Extra: Train your own word2vec model on the Full toxicity dataset.

In [0]:
# Build a lexicon from scratch

In [0]:
# Extra: train your own word2vec model on the full toxicity dataset

## 7.3: Weak supervision combining sentiment lexicon and dependency parser.
This exercise we will create a weak supervision scheme, that combines the Vader sentiment analysis tool implemented in the nltk package, with a nlp parser to extract relationships - i.e. who is the negative sentiment directed against.

First install the stanza package(`! pip install stanza`), formerly known as stanfordnlp.

The package supports a [multitude of languages](https://stanfordnlp.github.io/stanza/models.html), but each model should be downloaded first using the `stanza.download()` function. Here we need to download the english one: `stanza.download('en')`.

1.import nltk.sentiment and initialize the vader sentiment analyzer.

2. Apply the vader sentiment analyzer extracting only the 'neg' value, and plot the results in relation to the columns: `['black','female','asian','homosexual_gay_or_lesbian']` as a barchart. The columns express the social groups involved/addressed in the comment. 

3. Make a column in the dataset that expresses whether 'woman' + relevant synonyms found using the Glove model, is mentioned in the text.

4. Compute the precision and recall of the woman identifier in relation to the 'female' column.

## Stanfordnlp to match adjectives and women.

5. Run the nlp pipeline by initializing the parser `stanza.Pipeline('en').

6. Transform the Parse object to a list of sentences using the function `nlp_to_dict`.
    - This will return a list of sentences, which by themselves are lsit of words including the parsed properties such as Word class.  

7. Now we should get acquinted with the information involved. Count the most common lemmas of each wordclass(i.e. the 'upos' property), and print the top 10 words.

8. Apply the `extract_adj_noun()` that returns a dataframe of noun-adjective pairs. 

9. Keep only rows where "women" + synonmyms is found in the noun column.

10. Inspect the adjectives used, and compute which are significantly more common using the *Chi Squared measure*:

$Chi^2$ express the difference between the co-occurence we observe and what we would have expected to see if two events independent, relative to the latter. I.e. how many times more prevalent is the co-occurrence we observe than what we would expect if they were independent.

$X^{2}=\sum_{i, j} \frac{\left(O_{i j}-E_{i j}\right)^{2}}{E_{i j}}$
where $i$ ranges over rows of the table, $j$ ranges over columns, $O_{i j}$ is the observed value for cell ($i,j$) and $E_ij$ is the expected value.
 

11. Match the adjectives to two positive and negative word lists precompilled in the NLTK package. 
```
from nltk.corpus import opinion_lexicon
positive_w = set(opinion_lexicon.positive())
negative_w = set(opinion_lexicon.negative())
```
12. Finally wrap this into a function, that returns true if both women+synonyms is present, and a negative word.



In [0]:
import stanza

In [0]:
# 1
 
# download English model
# initialize English neural pipeline
# apply to df

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.0.0.json: 116kB [00:00, 18.5MB/s]                    
2020-03-18 20:06:15 INFO: Downloading default packages for language: en (English)...
Downloading http://nlp.stanford.edu/software/stanza/1.0.0/en/default.zip: 100%|██████████| 402M/402M [00:52<00:00, 7.61MB/s] 
2020-03-18 20:07:16 INFO: Finished downloading models and saved to /home/snorre/stanza_resources.
2020-03-18 20:07:16 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | ewt       |
| pos       | ewt       |
| lemma     | ewt       |
| depparse  | ewt       |
| ner       | ontonotes |

2020-03-18 20:07:17 INFO: Use device: cpu
2020-03-18 20:07:17 INFO: Loading: tokenize
2020-03-18 20:07:17 INFO: Loading: pos
2020-03-18 20:07:18 INFO: Loading: lemma
2020-03-18 20:07:18 INFO: Loading: depparse
2020-03-18 20:07:19 INFO: Loading: ner
2020-03-18 20:07:19 INFO: Done loadi

In [0]:
"""Helper function"""

def nlp_to_dict(doc,keep_sentence=True):
    dependencies = []

    if keep_sentence == True:
        for sent in doc.sentences:
            #keys = [i for i in dir(sent.words[0]) if '_' !=i[0] and i!='misc']
            temp = []
            for word in sent.words:
                d = word.to_dict()
                #d = {key:getattr(word,key) for key in keys}
                d['index'] = int(d['id'])
                temp.append(d)
            dependencies.append(temp)
    else:
        max_id = 0
        for sent in doc.sentences:
            keys = [i for i in dir(sent.words[0]) if '_' !=i[0] and i!='misc']
            words = sent.words
            
            for word in words:
                d = word.to_dict()
                #d = {key:getattr(word,key) for key in keys}
                d['index'] = int(d['id'])
                dependencies.append(d)
    return dependencies

df['nlp_parse'] = df.nlp.apply(nlp_to_dict)

In [0]:
# 2

# apply nlp_to_dict function

# 3

# count lemmas


In [0]:
"""Helper functions"""

def make_network(d):
    i2d = {}
    g = nx.Graph()
    for num,l in enumerate(d): 
        for i in l:
            id_ = '%d_%d'%(num,int(i['id']))
            i2d[id_] = i
            parent = i['head'] 
            
            if parent==0:
                continue
            parent = '%d_%d'%(num,parent)
            g.add_edge(id_,parent)
    # make full network
    for i in g:
        for j in g[i]:
            for k in g[i]:
                g.add_edge(j,k)
    return g,i2d


def extract_adj_noun(d):
    g,i2d = make_network(d)
    temp = []
    for i in i2d:
        d = i2d[i]
        type = d['upos']
        text_a,lemma_a = d['text'],d['lemma']
        if type=='ADJ':
            if not i in g:
                continue
            for n in g[i]:
                if i2d[n]['upos'] == 'NOUN':
                    d = i2d[n]
                    text,lemma = d['text'],d['lemma']
                    temp.append({'adj_text':text_a,'adj_lemma':lemma_a,'noun_text':text,'noun_lemma':lemma})
    return pd.DataFrame(temp)

adjnoun_df = extract_adj_noun(sample.nlp_parse.values[0])

In [0]:
# 4: Apply the extract_adj_noun()

# Use the resulting df and (5) keep only rows where "women" + synonmyms is found in the noun column.

# 6: Inspect the adjectives used, and compute which are significantly more common using the Chi Squared measure


# 7.4 Compiling the lexicons. 
- Here you should follow the instructions from the lexical_methods.ipynb notebook of compilling and downloading the lexicons. 
- Alternatively you can download the precompilled lexicon functions that the notebook builds using the link supplied on absalon. 
- download the lexical_mining.py script that wraps around all individual lexicon functions.
- import and appy the lexical_mining script using the `.lexical_mining()` function to the toxicity_dataset.
- Compare the different variables in relation to the Groups: `['black','female','asian','homosexual_gay_or_lesbian']` and see if the different sentiment analytical methods agree on which group is recieving most hostility.
    - Do this by computing the mean of the different sentiment variables: `['vader_neg','afinn_afinn','liu_negative_count','hedometer_happiness']` in relation to each group. 
    - Then compare the ranking made by each sentiment variable.

In [0]:
import lexical_mining as lex

# your code