# **The GC19HG Project - Dataset Preprocessing**
GC19HG (GeoCovid19 Heterogeneous Graph) is a social media research project that seeks to understand how the micro-level behaviors of millions of users combine to form macro-level topic trends. The work is based on the CrisisNLP GeoCov19 dataset - a set of tweets relating to Covid19 during the early months of 2020.

The work is done by Chris Winsor as part of the Graph Data Analytics Research Group under professor Tingjian Ge at University of Massachusetts Lowell.

Our approach uses a heterogeneous graph machine learning based on PyTorch Geometric (PyG) and self-supervised learning. The code here performs preprocessing of the raw data and preliminary attempts at modeling.

## Installation

The cell below installs libraries and clones the repository which has the GC19HG preprocessing modules.

The full code for GC19HG can be found on [Github](https://github.com/cwinsor/uml_twitter.git).

In [1]:
import os
import sys
print("version 3")
if 'google.colab' in sys.modules:
    print("host is colab")
    %cd -q /content
    _ = !git clone https://github.com/cwinsor/uml_twitter.git
    %cd -q uml_twitter/
    _ = !git pull
    !pip install torch_geometric
    !pip install sentence-transformers
    !pip install ijson
else:
    print("host is traditional server")

version 3
host is colab


## Download some sample raw data (2 files)

In [28]:
import gdown
!gdown '1cKrf5K1yDE4V8W1L5ISicMeNffdCZOIB'
!gdown '1cCp4mXRvfhYVndJ9AO8RA2O-tYGBQaVU'

Downloading...
From: https://drive.google.com/uc?id=1cKrf5K1yDE4V8W1L5ISicMeNffdCZOIB
To: /content/uml_twitter/ids_geo_2020-02-01.jsonl
100% 1.25G/1.25G [00:08<00:00, 144MB/s]
Downloading...
From: https://drive.google.com/uc?id=1cCp4mXRvfhYVndJ9AO8RA2O-tYGBQaVU
To: /content/uml_twitter/ids_geo_2020-02-02.jsonl
100% 2.84G/2.84G [00:30<00:00, 91.6MB/s]


## Parse the raw data

In [35]:
%run g40_preprocess.py --do_parse \
--parse_src_folder ./ \
--parse_dst_folder ./ \
--parse_file_list \
    ids_geo_2020-02-01.jsonl \
    ids_geo_2020-02-02.jsonl

INFO:root:args: Namespace(do_parse=True, parse_src_folder='./', parse_dst_folder='./', parse_file_list=['ids_geo_2020-02-01.jsonl', 'ids_geo_2020-02-02.jsonl'], perform_merge=False, merge_src_folder=None, merge_dst_folder=None, merge_file_list=None, do_filter=False, filter_src_folder=None, filter_dst_folder=None)
INFO:root:start


read raw files...
 ids_geo_2020-02-01.jsonl
 ids_geo_2020-02-02.jsonl
write results...


INFO:root:done


done


In [36]:
  %run g40_preprocess.py --do_filter \
  --filter_src_folder ./ \
  --filter_dst_folder ./

INFO:root:args: Namespace(do_parse=False, parse_src_folder=None, parse_dst_folder=None, parse_file_list=None, perform_merge=False, merge_src_folder=None, merge_dst_folder=None, merge_file_list=None, do_filter=True, filter_src_folder='./', filter_dst_folder='./')
INFO:root:start


after filtering:
number of original tweets: 2314
number of retweets: 14833
number of edges: 14838


INFO:root:done


write results...
done


In [44]:
%run g41_train_test.py --src_folder ./

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: bert-base-uncased


create embeddings for original tweets using BERT-base-uncased


Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu


sequence encoding using SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)


Batches:   0%|          | 0/73 [00:00<?, ?it/s]

KeyboardInterrupt: ignored