Table of contents: TBA

## Removal of giveaway posts

a) __Naive Bayes classification__ of FB posts to detect viral marketing.  
b) __remove whole threads__ that started with a giveaway post. 

Before removal: 114,826 documents  
After removal: 59,207 documents

TODO: hide code

In [None]:
import pandas as pd

from src.giveaway.GiveawayClassifier import GiveawayClassifier
from src.utility.general import export_serialized

In [None]:
# load in dataset you with to work on
df = pd.read_csv(
    'data/hpv_data_reactions_copy.csv',
    parse_dates = ['time']
)

Load training data for the classifier (494 documents).  

POST-level content found to contain Marie Louise's stopwords.  
Hand labeled by one person.

In [None]:
labeled = (pd.read_csv('data/200414_giveaway_training.csv')
           # drops 2 rows with a missing label (496 rows in original file)
           .dropna(subset=['giveaway']))

X = labeled['text']
y = labeled['giveaway']

Train the Giveaway Classifier.

In [None]:
gc = GiveawayClassifier(X=X, y=y)
gc.train()
gc.report

Classify only POST-level content in the loaded dataset.  
The model classifies short comments unreliably.

In [None]:
df_post = df.query('content_type == "POST"')

giveawas_df = (gc
               .predict_new(df_post.text, negative_for_url=True)
               .query('predicted == 1')
               .rename(columns={'index': 'id_orig'})
              )

giveawas_df

Filter found threads from the original dataset  
a) find post_id's that were labeled as a giveaway  
b) filter threads with such post ids out  

In [None]:
bad_threads = df.query('@giveawas_df.id_orig').post_id
bad_threads = [num for num in bad_threads]

# remove bad threads
S1_giveaway_removed = df.query('post_id != @bad_threads')

# save whole dataframe
S1_giveaway_removed.to_csv('data/S1_giveaway_removed.csv')

# save texts with ID
export_serialized(
    df=S1_giveaway_removed,
    column='text',
    path='data/S1_fb_texts.ndjson'
)

<br>

## Preprocessing
_[text_to_x](https://github.com/centre-for-humanities-computing/text_to_x)_

a) __tokens__, __lemmas__, __POS__ & __dependency parsing__ using [Stanza](https://github.com/stanfordnlp/stanza)  
b) __NER__ using [Flair](https://github.com/flairNLP/flair)

Takes a lot of time to run. 
It is recommended that you run this part from the terminal.

```bash
cd hpv-vaccine
python3 src/preprocessing.py -p data/S1_fb_texts.ndjson -o data/S2_fb_prep.ndjson --lang 'da' --jobs 4
```


<br>

## Feature selection 

a) __Filter out non-meaningful Parts of Speech from all texts__.   
Only NOUN, PROP-NOUN, ADJ, VERB and ADVERB will be kept


b) __Neural detection of phrases__.  
If two tokens appear together often, they will be concatenated into a single token.

In [6]:
import ndjson

from src.utility import phraser
from src.utility.general import load_data

In [7]:
# import preprocessed data
texts_id = load_data('data/S3_prep_SUB1.ndjson')

# phraser has both a) & b) functionality
texts_phrased = phraser.train(
    texts_id,
    lang='da',
    out_path='data/S3_fb_phrase.ndjson'
)

# texts only
texts = [doc['text'] for doc in texts_phrased]

<br>

## Seed selection

a) __Train a CBOW model__  
To be used for finding related words to query.  
Intentions behind the parameters:
- words that appear together in the whole FB post (window=20)
- frequent words, so that the seeds are generalizable (min_count=100)

_comment: potentially this could be taken care of by PmiSvdEmbeddings._

b) __Enhance phrase list__  
Add synonyms and related words to a given phrase list. This will be used as guide the topic model.

In [3]:
from gensim.models import Word2Vec, KeyedVectors

# from src.embeddings.pmisvd import PmiSvdEmbeddings
from src.embeddings.query_ops import import_query, get_related

In [5]:
# import phrase list
query_list = import_query(
    'data/200729_hpv_query.csv',
    'da',
    'term'
)

# train a cbow model
cbow_texts = Word2Vec(
    texts,
    size=100, window=20, min_count=100,
    sg=0, hs=0,
    iter=500, workers=4
)

# get a list of words similar to those in the phrase list
query_related = get_related(cbow_texts.wv, query_list, topn=10, cutoff=0)

# save + display a sample
(query_related
 .to_csv('data/S4_query_related.csv')
 .head())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hf_related['similarity'] = round(hf_related['similarity'], 2)


Unnamed: 0,query,related,similarity,count
0,bivirkning,bivirkning,1.0,1.0
1,bivirkning,supe,1.0,1.0
2,bivirkning,spøge,1.0,1.0
3,bivirkning,studi,1.0,1.0
4,bivirkning,argument,1.0,2.0


Now the seeds have to be __manually redacted__.

<br>

## Topic modeling

In [3]:
from src.lda.asymmetric import grid_search_lda_ASM
from src.lda.seeded import grid_search_lda_SED
from src.utility.general import compile_report

In [None]:
# extract topic seeds
S5_query_redacted = pd.read_csv('data/S5_query_redacted.csv')
seeds = (S5_query_redacted
         .dropna(subset=['seed'])
         .groupby('topic')['seed']
         .apply(list)
         .to_frame()
         .seed
         .tolist())

In [4]:
grid_search_lda_SED(
    texts=texts,
    seed_topic_list=[],
    n_topics_range=[5, 10, 15],
    priors_range=[(0.1, 0.01), (0.5, 0.1)],
    out_dir='models/test_seeded/',
    n_top_words=10,
    vectorizer_type='count',
    iterations=100,
    save_doc_top=True,
    verbose=False
)

  if sparse and not np.issubdtype(doc_word.dtype, int):
  if sparse and not np.issubdtype(doc_word.dtype, int):
  if sparse and not np.issubdtype(doc_word.dtype, int):
  if sparse and not np.issubdtype(doc_word.dtype, int):
  if sparse and not np.issubdtype(doc_word.dtype, int):
  if sparse and not np.issubdtype(doc_word.dtype, int):
  if sparse and not np.issubdtype(doc_word.dtype, int):
  if sparse and not np.issubdtype(doc_word.dtype, int):
  if sparse and not np.issubdtype(doc_word.dtype, int):
  if sparse and not np.issubdtype(doc_word.dtype, int):
  if sparse and not np.issubdtype(doc_word.dtype, int):
  if sparse and not np.issubdtype(doc_word.dtype, int):


In [2]:
compile_report('models/test_seeded/report_lines/')

Unnamed: 0,model,n_top,alpha,eta,training_time,coh_score,coh_topic
0,5T_1I_,5,0.1,0.01,0.166858,-6.194405,"[-15.067980609758628, -10.474334094788585, -5...."
1,10T_5I_,10,0.1,0.01,0.146791,-7.527829,"[-12.058322101790363, -15.19120677519151, 6.00..."
2,5T_5I_,5,0.5,0.1,0.149735,-9.062353,"[-16.20098896782203, -0.2618556015388681, -9.7..."
3,10T_10I_,10,0.5,0.1,0.14686,-9.875912,"[-5.1678523293352985, 6.000089314266196e-12, -..."
4,15T_10I_,15,0.1,0.01,0.145731,-10.372673,"[-13.78093954490413, -5.1678523293352985, -13...."
5,15T_15I_,15,0.5,0.1,0.147313,-11.491019,"[-9.76149884430534, -5.1678523293352985, -9.76..."


In [4]:
grid_search_lda_ASM(
    texts=texts,
    n_topics_range=[5, 10, 15],
    iterations=50,
    passes=1,
    out_dir='models/test_asm_2/',
    verbose=False,
    save_doc_top=True,
)



In [7]:
compile_report('models/test_asm_2/report_lines/')

Unnamed: 0,model,n_top,alpha,eta,training_time,coh_score,coh_topic
0,5T_ASM,5,"[0.1476479172706604, 0.18441998958587646, 0.27...","[0.2924271523952484, 0.2244250625371933, 0.290...",0.017658,0.764469,"[0.9974218827628706, 0.5100127005182109, 0.317..."
1,15T_ASM,15,"[0.054733991622924805, 0.07854487746953964, 0....","[0.08141887933015823, 0.0693705677986145, 0.08...",0.005427,0.643493,"[0.5205976647031606, 0.5191620689348048, 0.998..."
2,10T_ASM,10,"[0.09186872839927673, 0.07223592698574066, 0.0...","[0.11311981081962585, 0.10570827126502991, 0.1...",0.007009,0.600299,"[0.5205976647031606, 0.5100127005182109, 0.998..."


<br>

## Model evolution

In [1]:
import src.topicevolution.run_ntr as ntr 

In [2]:
import ndjson
with open('models/test_asm_2/doctop_mats/5T_ASM_mat.ndjson') as f:
    doctop = ndjson.load(f)

In [3]:
import pandas as pd
dt_df = pd.DataFrame(doctop)

In [5]:
ntr.kz(dt_df[0], window=3, iterations=2).mean()

0    0.077478
1    0.077479
2    0.077644
3    0.077633
4    0.077622
5    0.077447
Name: 0, dtype: float64

In [5]:
ntr.calculate_ntr(
    doc_top_prob=doctop,
    ID=range(6),
    window=[1, 2, 3],
    out_dir='models/test_asm_2/5T_ASM/'
)