Table of contents:
TODO

## Removal of giveaway posts

a) __Naive Bayes classification__ of FB posts to detect viral marketing.  
b) __remove whole threads__ that started with a giveaway post. 

Before removal: 114,826 documents  
After removal: 59,207 documents


In [7]:
import pandas as pd

from src.giveaway.GiveawayClassifier import GiveawayClassifier
from src.utility.general import export_serialized

In [2]:
# load in dataset you with to work on
df = pd.read_csv(
    'data/hpv_data_reactions_copy.csv',
    parse_dates = ['time']
)

  interactivity=interactivity, compiler=compiler, result=result)


Load training data for the classifier (494 documents).  

POST-level content found to contain Marie Louise's stopwords.  
Hand labeled by one person.

In [3]:
labeled = (pd.read_csv('data/200414_giveaway_training.csv')
           # drops 2 rows with a missing label (496 rows in original file)
           .dropna(subset=['giveaway']))

X = labeled['text']
y = labeled['giveaway']

Train the Giveaway Classifier.

In [4]:
gc = GiveawayClassifier(X=X, y=y)
gc.train()
gc.report

Unnamed: 0,accuracy,brier_n,brier_giveaway,recall_n,recall_giveaway,precision_n,precision_giveaway
train,0.973913,0.973913,0.026087,0.992832,0.893939,0.975352,0.967213
test,0.973154,0.973154,0.026846,0.984252,0.909091,0.984252,0.909091


Classify only POST-level content in the loaded dataset.  
The model classifies short comments unreliably.

In [5]:
df_post = df.query('content_type == "POST"')

giveawas_df = (gc
               .predict_new(df_post.text, negative_for_url=True)
               .query('predicted == 1')
               .rename(columns={'index': 'id_orig'})
              )

giveawas_df

Unnamed: 0,id_orig,text,predicted
21,109826,Et GODT svar :)\n\nhttps://www.facebook.com/g...,1.0
608,110413,VIND 2 PLADSER TIL VORES OVERDÅDIGE SKALDYRSBU...,1.0
636,110441,*** TILLYKKE TIL DEN HELDIGE VINDER : Christin...,1.0
668,110473,Velkommen til Ærø 😊\nhttps://www.facebook.com/...,1.0
705,110510,"Konkurrence! I vores nye elektronikbutik, Capi...",1.0
...,...,...,...
4932,114737,Konkurrence: Vind et valgfrit ur fra Wooden Wo...,1.0
4990,114795,Stadig ledige pladser til årets julegave-works...,1.0
5008,114813,Yoga i bjergtagende landskaber. Et alternativt...,1.0
5010,114815,Nu er det snart jul - og det vil vi gerne fejr...,1.0


Filter found threads from the original dataset  
a) find post_id's that were labeled as a giveaway  
b) filter threads with such post ids out  

In [6]:
bad_threads = df.query('@giveawas_df.id_orig').post_id
bad_threads = [num for num in bad_threads]

# remove bad threads
S1_giveaway_removed = df.query('post_id != @bad_threads')

# save whole dataframe
S1_giveaway_removed.to_csv('data/S1_giveaway_removed.csv')

# save texts with ID
export_serialized(
    df=S1_giveaway_removed,
    column='text',
    path='data/S2_text_id.ndjson'
)

<br>

## Preprocessing
_[text_to_x](https://github.com/centre-for-humanities-computing/text_to_x)_

a) __tokens__, __lemmas__, __POS__ & __dependency parsing__ using [Stanza](https://github.com/stanfordnlp/stanza)  
b) __NER__ using [Flair](https://github.com/flairNLP/flair)

Takes a lot of time to run. 
It is recommended that you run this part from the terminal.

```bash
cd hpv-vaccine
python3 src/preprocessing.py -p data/S2_text_id.ndjson -o data/S3_prep.ndjson --lang 'da' --jobs 4 --bugstring True
```


<br>

## Feature selection 

a) __Filter out non-meaningful Parts of Speech from all texts__.   
Only NOUN, PROP-NOUN, ADJ, VERB and ADVERB will be kept


b) __Neural detection of phrases__.  
If two tokens appear together often, they will be concatenated into a single token.

In [1]:
import ndjson

from src.utility import phraser
from src.utility.general import load_data

In [2]:
# import preprocessed data
texts_id = load_data('data/infomedia_prep.ndjson')

# phraser has both a) & b) functionality
texts_phrased = phraser.train(
    texts_id,
    lang='da',
    out_path='data/S4_infomedia_phrase.ndjson'
)

# texts only
texts = [doc['text'] for doc in texts_phrased]
# ids only
ids = [doc['id'] for doc in texts_phrased]

In [3]:
### in case you don't want to run the phraser each time
# text data
import pickle

with open('data/Infomedia/da_hpv_seed_model.pcl', 'rb') as f:
    data_im = pickle.load(f)

# texts only
texts_im = data_im['data']
# ids only
ids_im = data_im['dates']

In [11]:
texts_im = [doc.split() for doc in texts_im]

<br>

## Seed selection

a) __Train a CBOW model__  
To be used for finding related words to query.  
Intentions behind the parameters:
- words that appear together in the whole FB post (window=20)
- frequent words, so that the seeds are generalizable (min_count=100)

_comment: potentially this could be taken care of by PmiSvdEmbeddings._

b) __Enhance phrase list__  
Add synonyms and related words to a given phrase list. This will be used as guide the topic model.

In [3]:
from gensim.models import Word2Vec, KeyedVectors

# from src.embeddings.pmisvd import PmiSvdEmbeddings
from src.embeddings.query_ops import import_query, get_related

Import desired seeds in a long csv format.  
The seeds to be enhanced are in a single column (col).

In [4]:
# import phrase list
query_list = import_query(
    ordlist_path='data/200818_hpv_query.csv',
    lang='da',
    col='term'
)

2020-08-26 22:32:06 INFO: Loading these models for language: da (Danish):
| Processor | Package |
-----------------------
| tokenize  | ddt     |
| pos       | ddt     |
| lemma     | ddt     |

2020-08-26 22:32:06 INFO: Use device: gpu
2020-08-26 22:32:06 INFO: Loading: tokenize
2020-08-26 22:32:09 INFO: Loading: pos
2020-08-26 22:32:09 INFO: Loading: lemma
2020-08-26 22:32:09 INFO: Done loading processors!


Train the CBOW model and get {topn} related words to each term.  
A related word must appear at least {cutoff} times  least 50 times in the dataset.

In [5]:
# train a cbow model
cbow_texts = Word2Vec(
    texts,
    size=100, window=20, min_count=100,
    sg=0, hs=0,
    iter=500, workers=4
)

# get a list of words similar to those in the phrase list
query_related = get_related(cbow_texts.wv, query_list, topn=10, cutoff=50)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hf_related['similarity'] = round(hf_related['similarity'], 2)


The model can also be browser from here

In [None]:
get_related(cbow_texts.wv, ['køn'], topn=10, cutoff=50)

Add topic labels & export

In [8]:
# add topic labels to the enhanced list
topic = pd.read_csv('data/200818_hpv_query.csv')
enhanced_topic = pd.merge(query_related, topic, on='term')

# save
(enhanced_topic
 .to_csv('data/S5_infomedia_query_related.csv'))

Now the seeds have to be __manually redacted__.

<br>

## Topic modeling

In [7]:
from itertools import product

import pandas as pd

from src.lda.asymmetric import grid_search_lda_ASM
from src.lda.seeded import grid_search_lda_SED
from src.utility.general import compile_report

In [5]:
# extract topic seeds
S6_query_redacted = pd.read_csv('data/S6_query_redacted.csv')
seeds = (S6_query_redacted
         .dropna(subset=['related'])
         .groupby('topic')['related']
         .apply(list)
         .to_frame()
         .related
         .tolist())

In [6]:
len(seeds)

12

### Seeded LDA

a) pick folder to save the resutls to (`batch_sed`)  
b) pick priors (`priors_range`). Each tuple is a pair of alpha and eta.  
c) train using `grid_search_lda_SED()`  
d) evaluate models by topic coherence using `compile_report()`  

In [7]:
# please change destination folder here
batch_sed = 'models/200826_seed_prior_test/'

In [8]:
# pick priors
alpha_range = [0.05, 0.1, 0.5, 1, 5]
eta_range = [0.05, 0.1, 0.5, 1, 5]

priors_range = list(product(alpha_range, eta_range))

In [None]:
# train
grid_search_lda_SED(
    texts=texts,
    seed_topic_list=seeds,
    n_topics_range=[16, 17, 18, 19, 21, 22, 23, 24, 26, 27, 28, 29],
    priors_range=priors_range,
    out_dir=batch_sed,
    n_top_words=20,
    seed_confidence=0.5,
    iterations=2000,
    save_doc_top=True,
    verbose=False
)

  if sparse and not np.issubdtype(doc_word.dtype, int):
  if sparse and not np.issubdtype(doc_word.dtype, int):


In [10]:
# evaluate
compile_report(batch_sed + 'report_lines/')

Unnamed: 0,model,n_top,alpha,eta,training_time,coh_score,coh_topic
0,25T_005A_1E_seed,25,0.05,1.00,129.797871,0.580126,"[0.6230598251022388, 0.3769565218385562, 0.594..."
1,12T_005A_1E_seed,12,0.05,1.00,78.838708,0.579669,"[0.6005169490417496, 0.536123988406887, 0.5490..."
2,12T_5A_05E_seed,12,5.00,0.50,101.886114,0.577311,"[0.456008644879629, 0.6529585071548172, 0.6124..."
3,15T_5A_01E_seed,15,5.00,0.10,114.265681,0.576365,"[0.3722449200291431, 0.6426204464777522, 0.503..."
4,20T_05A_005E_seed,20,0.50,0.05,131.331287,0.574940,"[0.5590831074416361, 0.6404184041254657, 0.344..."
...,...,...,...,...,...,...,...
139,20T_5A_5E_seed,20,5.00,5.00,127.369119,0.359965,"[0.29378898939245285, 0.31640463774554195, 0.3..."
140,30T_1A_1E_seed,30,1.00,1.00,163.682590,0.356737,"[0.39160667937727334, 0.3144211502160263, 0.33..."
141,30T_1A_5E_seed,30,1.00,5.00,154.842928,0.347539,"[0.3182289190822557, 0.3421898944467148, 0.409..."
142,15T_5A_5E_seed,15,5.00,5.00,104.974241,0.323242,"[0.3415776480756073, 0.24683031095031857, 0.25..."


### "Asymmetric" LDA

In [4]:
# please change destination folder here
batch_asm = 'models/200903_asm_infomedia/'

In [None]:
grid_search_lda_ASM(
    texts=texts_im,
    n_topics_range=range(5, 31, 1),
    iterations=2000,
    passes=2,
    out_dir=batch_asm,
    verbose=False,
    save_doc_top=True,
)

In [None]:
compile_report(batch_asm + 'report_lines/')

<br>

## Model evolution

In [None]:
import src.topicevolution.run_ntr as ntr 

In [None]:
# is there a better way of solving this?
# couldn't we use some batch_asm trick?
import ndjson

with open('models/200811_asm/doctop_mats/10T_ASM_mat.ndjson') as f:
    doctop = ndjson.load(f)

In [None]:
len(doctop) == len(ids)

In [None]:
ntr.process_windows(
    doc_top_prob=doctop,
    ID=ids,
    window=[50, 100, 200],
    out_dir='models/200811_asm/ntr/10T_ASM/'
)

<br>

## Topic usage