Table of contents: TBA

## Removal of giveaway posts

a) __Naive Bayes classification__ of FB posts to detect viral marketing.  
b) __remove whole threads__ that started with a giveaway post. 

Before removal: 114,826 documents  
After removal: 59,207 documents

TODO: hide code

In [None]:
import pandas as pd

from src.giveaway.GiveawayClassifier import GiveawayClassifier
from src.utility.general import export_serialized

In [None]:
# load in dataset you with to work on
df = pd.read_csv(
    'data/hpv_data_reactions_copy.csv',
    parse_dates = ['time']
)

Load training data for the classifier (494 documents).  

POST-level content found to contain Marie Louise's stopwords.  
Hand labeled by one person.

In [None]:
labeled = (pd.read_csv('data/200414_giveaway_training.csv')
           # drops 2 rows with a missing label (496 rows in original file)
           .dropna(subset=['giveaway']))

X = labeled['text']
y = labeled['giveaway']

Train the Giveaway Classifier.

In [None]:
gc = GiveawayClassifier(X=X, y=y)
gc.train()
gc.report

Classify only POST-level content in the loaded dataset.  
The model classifies short comments unreliably.

In [None]:
df_post = df.query('content_type == "POST"')

giveawas_df = (gc
               .predict_new(df_post.text, negative_for_url=True)
               .query('predicted == 1')
               .rename(columns={'index': 'id_orig'})
              )

giveawas_df

Filter found threads from the original dataset  
a) find post_id's that were labeled as a giveaway  
b) filter threads with such post ids out  

In [None]:
bad_threads = df.query('@giveawas_df.id_orig').post_id
bad_threads = [num for num in bad_threads]

# remove bad threads
S1_giveaway_removed = df.query('post_id != @bad_threads')

# save whole dataframe
S1_giveaway_removed.to_csv('data/S1_giveaway_removed.csv')

# save texts with ID
export_serialized(
    df=S1_giveaway_removed,
    column='text',
    path='data/S1_fb_texts.ndjson'
)

<br>

## Preprocessing
_[text_to_x](https://github.com/centre-for-humanities-computing/text_to_x)_

a) __tokens__, __lemmas__, __POS__ & __dependency parsing__ using [Stanza](https://github.com/stanfordnlp/stanza)  
b) __NER__ using [Flair](https://github.com/flairNLP/flair)

Takes a lot of time to run. 
It is recommended that you run this part from the terminal.

```bash
cd hpv-vaccine
python3 src/preprocessing.py -p data/S1_fb_texts.ndjson -o data/S2_fb_prep.ndjson --lang 'da' --jobs 4
```


<br>

## Feature selection 

a) __Filter out non-meaningful Parts of Speech from all texts__.   
Only NOUN, PROP-NOUN, ADJ, VERB and ADVERB will be kept


b) __Neural detection of phrases__.  
If two tokens appear together often, they will be concatenated into a single token.

In [1]:
import ndjson

from src.utility import phraser
from src.utility.general import load_data

In [2]:
# import preprocessed data
texts_id = load_data('data/S3_test.ndjson')

# phraser has both a) & b) functionality
texts_phrased = phraser.train(texts_id)

# texts only
texts = [doc['text'] for doc in texts_phrased]

In [4]:
# with open('data/S1_fb_texts.ndjson') as f:
#     S1_fb = ndjson.load(f)

<br>

## Seed selection

a) __Train a CBOW model__  
To be used for finding related words to query.  
Intentions behind the parameters:
- words that appear together in the whole FB post (window=20)
- frequent words, so that the seeds are generalizable (min_count=100)

_comment: potentially this could be taken care of by PmiSvdEmbeddings._

b) __Enhance phrase list__  
Add synonyms and related words to a given phrase list. This will be used as guide the topic model.

In [3]:
from gensim.models import Word2Vec, KeyedVectors

# from src.embeddings.pmisvd import PmiSvdEmbeddings
from src.embeddings.query_ops import import_query, get_related

In [4]:
query_list = import_query('data/200729_hpv_query.csv', 'da', 'svimmel')

2020-07-29 16:53:15 INFO: Loading these models for language: da (Danish):
| Processor | Package |
-----------------------
| tokenize  | ddt     |
| pos       | ddt     |
| lemma     | ddt     |

2020-07-29 16:53:15 INFO: Use device: cpu
2020-07-29 16:53:15 INFO: Loading: tokenize
2020-07-29 16:53:15 INFO: Loading: pos
2020-07-29 16:53:16 INFO: Loading: lemma
2020-07-29 16:53:16 INFO: Done loading processors!


In [5]:
# import phrase list
# query_list = import_query('data/200729_hpv_query.csv', 'da')

# train a cbow model
# cbow_texts = Word2Vec(texts,
#                       size=300, window=20, min_count=100,
#                       sg=0, hs=0,
#                       iter=500, workers=4)
cbow_texts = Word2Vec(texts,
                      size=2, window=20, min_count=1,
                      sg=0, hs=0,
                      iter=500, workers=4)

# get a list of words similar to those in the phrase list
query_related = get_related(cbow_texts.wv, query_list, topn=10, cutoff=0)

# save + display a sample
(query_related
 #.to_csv('S3_query_related.csv')
 .head())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hf_related['similarity'] = round(hf_related['similarity'], 2)


Unnamed: 0,query,related,similarity,count
0,bivirkning,bivirkning,1.0,1.0
1,bivirkning,supe,1.0,1.0
2,bivirkning,spøge,1.0,1.0
3,bivirkning,studi,1.0,1.0
4,bivirkning,argument,1.0,2.0


Now the seeds have to be __manually redacted__.

<br>

## Topic modeling

In [None]:
from src.lda.asymmetric import lda_grid_search_ASM
from src.lda.seeded import lda_grid_search_SED