# **The GC19HG Project - Preprocessing and Model Development**
GC19HG (GeoCovid19 Heterogeneous Graph) is a social media research project that seeks to understand how the micro-level behaviors of millions of users combine into macro-level topic trends. The work uses the CrisisNLP GeoCov19 dataset - a set of tweets relating to Covid19 from February to May of 2020.

Our approach combines heterogeneous graph machine learning, NLP pre-trained models and self-supervised learning using the PyTorch Geometric (PyG) library. The code here performs preprocessing of the raw data and establishes an initial modeling attempt.

The work is done by Chris Winsor as part of the Graph Data Analytics Research Group under professor Tingjian Ge at University of Massachusetts Lowell.

## Installation

The cell below installs libraries and clones the repository which has the GC19HG preprocessing modules.

The full code for GC19HG can be found on [Github](https://github.com/cwinsor/uml_twitter.git).

In [1]:
import os
import sys
print("version 3")
if 'google.colab' in sys.modules:
    print("host is colab")
    %cd -q /content
    _ = !git clone https://github.com/cwinsor/uml_twitter.git
    %cd -q uml_twitter/
    _ = !git pull
    !pip install torch_geometric
    !pip install sentence-transformers
    !pip install ijson
else:
    print("host is traditional server")

version 3
host is colab
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch_geometric
  Downloading torch_geometric-2.3.1.tar.gz (661 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m661.6/661.6 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: torch_geometric
  Building wheel for torch_geometric (pyproject.toml) ... [?25l[?25hdone
  Created wheel for torch_geometric: filename=torch_geometric-2.3.1-py3-none-any.whl size=910459 sha256=67baedcf834bf97c3d1f44d407c0437ed9229b0c377d3aa182e1cc08770ee9eb
  Stored in directory: /root/.cache/pip/wheels/ac/dc/30/e2874821ff308ee67dcd7a66dbde912411e19e35a1addda028
Successfully built torch_geometric
Installing collected packages: torch_geometric
Successfull

## Download some sample raw data (2 files)

In [5]:
import gdown
!gdown '1hOvrXUkmXi76Woq8A3fVoprB-rgazHuD'
!gdown '1hMGpv6ZSVknL9u5x-BnjHEMe0x2TGc-d'
# os.rename("ids_geo_2020-02-01.jsonl", "data_raw/ids_geo_2020-02-01.jsonl")
# os.rename("ids_geo_2020-02-02.jsonl", "data_raw/ids_geo_2020-02-02.jsonl")

Downloading...
From: https://drive.google.com/uc?id=1hOvrXUkmXi76Woq8A3fVoprB-rgazHuD
To: /content/uml_twitter/ids_geo_2020-02-01.jsonl
100% 1.25G/1.25G [00:09<00:00, 126MB/s]
Downloading...
From: https://drive.google.com/uc?id=1hMGpv6ZSVknL9u5x-BnjHEMe0x2TGc-d
To: /content/uml_twitter/ids_geo_2020-02-02.jsonl
100% 2.84G/2.84G [00:42<00:00, 66.2MB/s]


## Parse the raw data

In [None]:
%run g40_preprocess.py --do_parse \
--parse_src_folder data_raw/ \
--parse_dst_folder data_parsed/ \
--parse_file_list \
    ids_geo_2020-02-01.jsonl \
    ids_geo_2020-02-02.jsonl

INFO:root:args: Namespace(do_parse=True, parse_src_folder='./', parse_dst_folder='./', parse_file_list=['ids_geo_2020-02-01.jsonl', 'ids_geo_2020-02-02.jsonl'], perform_merge=False, merge_src_folder=None, merge_dst_folder=None, merge_file_list=None, do_filter=False, filter_src_folder=None, filter_dst_folder=None)
INFO:root:start


read raw files...
 ids_geo_2020-02-01.jsonl
 ids_geo_2020-02-02.jsonl
write results...


INFO:root:done


done


In [None]:
%run g40_preprocess.py --do_filter \
  --filter_src_folder data_parsed/ \
  --filter_dst_folder data_filtered/

INFO:root:args: Namespace(do_parse=False, parse_src_folder=None, parse_dst_folder=None, parse_file_list=None, perform_merge=False, merge_src_folder=None, merge_dst_folder=None, merge_file_list=None, do_filter=True, filter_src_folder='./', filter_dst_folder='./')
INFO:root:start


after filtering:
number of original tweets: 2314
number of retweets: 14833
number of edges: 14838


INFO:root:done


write results...
done


In [None]:
%run g41_train_test.py --src_folder data_filtered/

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: bert-base-uncased


create embeddings for original tweets using BERT-base-uncased


Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu


sequence encoding using SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)


Batches:   0%|          | 0/73 [00:00<?, ?it/s]

KeyboardInterrupt: ignored