Table of contents: TBA

## Removal of giveaway posts

a) __Naive Bayes classification__ of FB posts to detect viral marketing.  
b) __remove whole threads__ that started with a giveaway post. 

Before removal: 114,826 documents  
After removal: 59,207 documents

TODO: hide code

In [1]:
import pandas as pd

from src.giveaway.GiveawayClassifier import GiveawayClassifier
from src.utility.general import export_serialized

In [2]:
# load in dataset you with to work on
df = pd.read_csv(
    'data/hpv_data_reactions_copy.csv',
    parse_dates = ['time']
)

  interactivity=interactivity, compiler=compiler, result=result)


Load training data for the classifier (494 documents).  

POST-level content found to contain Marie Louise's stopwords.  
Hand labeled by one person.

In [3]:
labeled = (pd.read_csv('data/200414_giveaway_training.csv')
           # drops 2 rows with a missing label (496 rows in original file)
           .dropna(subset=['giveaway']))

X = labeled['text']
y = labeled['giveaway']

Train the Giveaway Classifier.

In [4]:
gc = GiveawayClassifier(X=X, y=y)
gc.train()
gc.report

Unnamed: 0,accuracy,brier_n,brier_giveaway,recall_n,recall_giveaway,precision_n,precision_giveaway
train,0.973913,0.973913,0.026087,0.992832,0.893939,0.975352,0.967213
test,0.973154,0.973154,0.026846,0.984252,0.909091,0.984252,0.909091


Classify only POST-level content in the loaded dataset.  
The model classifies short comments unreliably.

In [5]:
df_post = df.query('content_type == "POST"')

giveawas_df = (gc
               .predict_new(df_post.text, negative_for_url=True)
               .query('predicted == 1')
               .rename(columns={'index': 'id_orig'})
              )

giveawas_df

Unnamed: 0,id_orig,text,predicted
21,109826,Et GODT svar :)\n\nhttps://www.facebook.com/g...,1.0
608,110413,VIND 2 PLADSER TIL VORES OVERDÅDIGE SKALDYRSBU...,1.0
636,110441,*** TILLYKKE TIL DEN HELDIGE VINDER : Christin...,1.0
668,110473,Velkommen til Ærø 😊\nhttps://www.facebook.com/...,1.0
705,110510,"Konkurrence! I vores nye elektronikbutik, Capi...",1.0
...,...,...,...
4932,114737,Konkurrence: Vind et valgfrit ur fra Wooden Wo...,1.0
4990,114795,Stadig ledige pladser til årets julegave-works...,1.0
5008,114813,Yoga i bjergtagende landskaber. Et alternativt...,1.0
5010,114815,Nu er det snart jul - og det vil vi gerne fejr...,1.0


Filter found threads from the original dataset  
a) find post_id's that were labeled as a giveaway  
b) filter threads with such post ids out  

In [6]:
bad_threads = df.query('@giveawas_df.id_orig').post_id
bad_threads = [num for num in bad_threads]

# remove bad threads
S1_giveaway_removed = df.query('post_id != @bad_threads')

# save whole dataframe
S1_giveaway_removed.to_csv('data/S1_giveaway_removed.csv')

# save texts with ID
export_serialized(
    df=S1_giveaway_removed,
    column='text',
    path='data/S1_fb_texts.ndjson'
)

<br>

## Preprocessing
_[text_to_x](https://github.com/centre-for-humanities-computing/text_to_x)_

a) __tokens__, __lemmas__, __POS__ & __dependency parsing__ using [Stanza](https://github.com/stanfordnlp/stanza)  
b) __NER__ using [Flair](https://github.com/flairNLP/flair)

Takes a lot of time to run. 
It is recommended that you run this part from the terminal.

```bash
cd hpv-vaccine
python3 src/preprocessing.py -p data/S1_fb_texts.ndjson -o data/S2_fb_prep.ndjson --lang 'da' --jobs 4
```


<br>

## Feature selection 

a) __Filter out non-meaningful Parts of Speech from all texts__.   
Only NOUN, PROP-NOUN, ADJ, VERB and ADVERB will be kept


b) __Neural detection of phrases__.  
If two tokens appear together often, they will be concatenated into a single token.

In [None]:
import ndjson

from src.utility import phraser
from src.utility.general import load_data

In [None]:
# import preprocessed data
texts = load_data('fb_prep.ndjson')

# phraser has both a) & b) functionality
texts_phrased = phraser.train(texts)

# display a sample


<br>

## Seed selection

a) __Train a CBOW model__  
To be used for finding related words to query.  
Intentions behind the parameters:
- words that appear together in the whole FB post (window=20)
- frequent words, so that the seeds are generalizable (min_count=100)

_comment: potentially this could be taken care of by PmiSvdEmbeddings._

b) __Enhance phrase list__  
Add synonyms and related words to a given phrase list. This will be used as guide the topic model.

In [None]:
from gensim.models import Word2Vec, KeyedVectors

# from src.embeddings.pmisvd import PmiSvdEmbeddings
from src.embeddings.query_ops import import_query, get_related

In [None]:
# import phrase list
query_list = import_query('path.csv')

# train a cbow model
cbow_texts = Word2Vec(text_phrased,
                      size=300, window=20, min_count=100,
                      sg=0, hs=0,
                      iter=500, workers=4)

# get a list of words similar to those in the phrase list
query_related = get_related(cbow_texts.wv, query_list, topn=10, cutoff=100)

# save + display a sample
(query_related
 .to_csv('path.csv')
 .head())

Now the seeds have to be __manually redacted__.

<br>

## Topic modeling

In [None]:
from src.lda.asymmetric import lda_grid_search_ASM
from src.lda.seeded import lda_grid_search_SED