Table of contents: TBA

## Removal of giveaway posts

a) __Naive Bayes classification__ of FB posts to detect viral marketing.  
b) __remove whole threads__ that started with a giveaway post. 

Before removal: 114,826 documents  
After removal: 59,207 documents

TODO: hide code

In [24]:
import pandas as pd

from src.giveaway.GiveawayClassifier import GiveawayClassifier
from src.utility.general import export_serialized

In [None]:
# load in dataset you with to work on
df = pd.read_csv(
    'data/hpv_data_reactions_copy.csv',
    parse_dates = ['time']
)

Load training data for the classifier (494 documents).  

POST-level content found to contain Marie Louise's stopwords.  
Hand labeled by one person.

In [None]:
labeled = (pd.read_csv('data/200414_giveaway_training.csv')
           # drops 2 rows with a missing label (496 rows in original file)
           .dropna(subset=['giveaway']))

X = labeled['text']
y = labeled['giveaway']

Train the Giveaway Classifier.

In [None]:
gc = GiveawayClassifier(X=X, y=y)
gc.train()
gc.report

Classify only POST-level content in the loaded dataset.  
The model classifies short comments unreliably.

In [None]:
df_post = df.query('content_type == "POST"')

giveawas_df = (gc
               .predict_new(df_post.text, negative_for_url=True)
               .query('predicted == 1')
               .rename(columns={'index': 'id_orig'})
              )

giveawas_df

Filter found threads from the original dataset  
a) find post_id's that were labeled as a giveaway  
b) filter threads with such post ids out  

In [None]:
bad_threads = df.query('@giveawas_df.id_orig').post_id
bad_threads = [num for num in bad_threads]

# remove bad threads
S1_giveaway_removed = df.query('post_id != @bad_threads')

# save whole dataframe
S1_giveaway_removed.to_csv('data/S1_giveaway_removed.csv')

# save texts with ID
export_serialized(
    df=S1_giveaway_removed,
    column='text',
    path='data/S1_fb_texts.ndjson'
)

<br>

## Preprocessing
_[text_to_x](https://github.com/centre-for-humanities-computing/text_to_x)_

a) __tokens__, __lemmas__, __POS__ & __dependency parsing__ using [Stanza](https://github.com/stanfordnlp/stanza)  
b) __NER__ using [Flair](https://github.com/flairNLP/flair)

Takes a lot of time to run. 
It is recommended that you run this part from the terminal.

```bash
cd hpv-vaccine
python3 src/preprocessing.py -p data/S2_text_id.ndjson -o data/S3_prep.ndjson --lang 'da' --jobs 4 --bugstring True
```


<br>

## Feature selection 

a) __Filter out non-meaningful Parts of Speech from all texts__.   
Only NOUN, PROP-NOUN, ADJ, VERB and ADVERB will be kept


b) __Neural detection of phrases__.  
If two tokens appear together often, they will be concatenated into a single token.

In [10]:
import ndjson

from src.utility import phraser
from src.utility.general import load_data

In [12]:
# import preprocessed data
texts_id = load_data('data/S3_prep.ndjson')

# phraser has both a) & b) functionality
texts_phrased = phraser.train(
    texts_id,
    lang='da',
    out_path='data/S4_fb_phrase.ndjson'
)

# texts only
texts = [doc['text'] for doc in texts_phrased]

<br>

## Seed selection

a) __Train a CBOW model__  
To be used for finding related words to query.  
Intentions behind the parameters:
- words that appear together in the whole FB post (window=20)
- frequent words, so that the seeds are generalizable (min_count=100)

_comment: potentially this could be taken care of by PmiSvdEmbeddings._

b) __Enhance phrase list__  
Add synonyms and related words to a given phrase list. This will be used as guide the topic model.

In [14]:
from gensim.models import Word2Vec, KeyedVectors

# from src.embeddings.pmisvd import PmiSvdEmbeddings
from src.embeddings.query_ops import import_query, get_related

In [15]:
# import phrase list
query_list = import_query(
    ordlist_path='data/200729_hpv_query.csv',
    lang='da',
    col='term'
)

# train a cbow model
cbow_texts = Word2Vec(
    texts,
    size=100, window=20, min_count=100,
    sg=0, hs=0,
    iter=500, workers=4
)

# get a list of words similar to those in the phrase list
query_related = get_related(cbow_texts.wv, query_list, topn=10, cutoff=50)

# save + display a sample
(query_related
 .to_csv('data/S5_query_related.csv'))

2020-08-11 11:43:13 INFO: Loading these models for language: da (Danish):
| Processor | Package |
-----------------------
| tokenize  | ddt     |
| pos       | ddt     |
| lemma     | ddt     |

2020-08-11 11:43:13 INFO: Use device: cpu
2020-08-11 11:43:13 INFO: Loading: tokenize
2020-08-11 11:43:13 INFO: Loading: pos
2020-08-11 11:43:13 INFO: Loading: lemma
2020-08-11 11:43:13 INFO: Done loading processors!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hf_related['similarity'] = round(hf_related['similarity'], 2)


Now the seeds have to be __manually redacted__.

<br>

## Topic modeling

In [7]:
from src.lda.asymmetric import grid_search_lda_ASM
from src.lda.seeded import grid_search_lda_SED
from src.utility.general import compile_report

In [26]:
# extract topic seeds
# HIDE IT SOMEWHERE
S5_query_redacted = pd.read_csv('data/S6_query_redacted.csv')
seeds = (S5_query_redacted
         .dropna(subset=['seed'])
         .groupby('topic')['seed']
         .apply(list)
         .to_frame()
         .seed
         .tolist())

KeyError: ['seed']

### Seeded LDA

In [None]:
grid_search_lda_SED(
    texts=texts,
    seed_topic_list=[], #feed
    n_topics_range=[5, 10, 15, 20, 25, 30],
    priors_range=[(0.1, 0.01), (0.5, 0.1)], #feed
    out_dir='models/200730_seed_no5/',
    n_top_words=10,
    vectorizer_type='count',
    iterations=500,
    save_doc_top=True,
    verbose=False
)

In [None]:
compile_report('models/test_seeded/report_lines/')

### "Asymmetric" LDA

In [16]:
grid_search_lda_ASM(
    texts=texts,
    n_topics_range=[5, 10, 15, 20, 25, 30],
    iterations=500,
    passes=2,
    out_dir='models/200811_asm/',
    verbose=False,
    save_doc_top=True,
)

In [18]:
compile_report('models/200811_asm/report_lines/')

Unnamed: 0,model,n_top,alpha,eta,training_time,coh_score,coh_topic
0,5T_ASM,5,"[0.21222230792045593, 0.6326810717582703, 1.16...","[8.239446640014648, 3.691840648651123, 1.27305...",22.60281,0.512433,"[0.41975814144694273, 0.6035394996919792, 0.55..."
1,10T_ASM,10,"[0.8251640200614929, 0.14950509369373322, 0.05...","[0.27749359607696533, 0.3539278507232666, 0.16...",24.408235,0.482693,"[0.5898778488018219, 0.33301404058971595, 0.54..."
2,15T_ASM,15,"[0.139406219124794, 0.08482097089290619, 0.075...","[0.21338166296482086, 0.3552437126636505, 0.10...",25.72934,0.452854,"[0.6071159311373029, 0.33946731414466275, 0.25..."
3,25T_ASM,25,"[0.08592333644628525, 0.057619400322437286, 0....","[0.05525301396846771, 0.08396890014410019, 0.0...",29.371126,0.409234,"[0.47739644112148233, 0.26070581603274945, 0.3..."
4,20T_ASM,20,"[0.06974601000547409, 0.05858924984931946, 0.0...","[0.08188818395137787, 0.09149575233459473, 0.0...",27.228027,0.391767,"[0.3563420427449716, 0.2563456794991608, 0.251..."
5,30T_ASM,30,"[0.05350644141435623, 0.0893891230225563, 0.04...","[0.03923603892326355, 0.04739493131637573, 0.0...",32.591976,0.390904,"[0.2390997313596967, 0.5861149647371491, 0.307..."


<br>

## Model evolution

In [1]:
import src.topicevolution.run_ntr as ntr 

In [19]:
import ndjson

with open('models/200811_asm/doctop_mats/10T_ASM_mat.ndjson') as f:
    doctop = ndjson.load(f)

with open('data/S2_text_id.ndjson') as f:
    ids = ndjson.load(f)

In [21]:
len(ids)

59207

In [29]:
len(doctop)

51925

<br>

In [None]:
import pandas as pd
dt_df = pd.DataFrame(doctop)

In [None]:
dt_df.to_csv('10T_ASM_mat.csv')

In [None]:
ntr.kz(dt_df[0], window=3, iterations=2)

In [None]:
with open('data/S1_fb_texts.ndjson') as f:
    tid = ndjson.load(f)

In [28]:
ids = [doc['ID'] for doc in ids]

In [32]:
ntr.calculate_ntr(
    doc_top_prob=doctop,
    ID=range(51925),
    window=[200],
    out_dir='models/200811_asm/10T_ASM/'
)

<br>

## Topic usage