Skip to content

bilingual-MET/hybrid

Repository files navigation

Hybrid Architecture for Labelling Bilingual Māori-English Tweets

This repository contains code used to develop a hybrid architecture for labelling bilingual Māori-English tweets. A small sample file is provided for demonstration purposes, along with a script for downloading additional tweets.

Architecture

hybrid_updated

Collect and Pre-process Tweets

  1. Apply for a Twitter developer account if you do not have one already.
  2. Download or clone the nga-kupu repository, which is bound by the Kaitiakitanga Licence.
  3. Ensure that Python 3 is installed on your machine, then run the following commands in the terminal:
pip install requests
pip install requests-oauthlib
pip install yelp_uri
pip install beautifulsoup4
pip install emot
  1. Copy all files in the preprocessing folder of this repository to nga-kupu-master/scripts.
  2. Update the four word lists in nga-kupu-master/taumahi/__init.py__ according to the instructions in update_word_lists.txt.
  3. Run python3 setup.py install from nga-kupu-master.
  4. Configure your API bearer token by running the following command in the terminal: export 'BEARER_TOKEN'='<your_bearer_token>'
  5. Run the collect_and_clean_tweets.py script that you moved to nga-kupu-master/scripts. This script gathers tweets from the past week from a predefined list of users, then cleans the tweets and generates the RMT labels that are needed as input to the hybrid architecture (below).

Run Experiments

Sample Data

tweet_id user_id modified_text maori_words_rmt
1001 x10 Living by the Moon: Te Maramataka a Te Whānau-ā-Apanui, Wiremu Tāwhai Te Whānau-ā-Apanui, Te Whakatōhea, Ngāti Awa 'te', 'maramataka', 'te', 'wiremu', 'tāwhai', 'te', 'te', 'whakatōhea', 'ngāti', 'awa'
1016 x25 oh man, take me, take me!! 'take', 'me', 'take', 'me'

Token-level Labels

Step 1

Step 2

Tweet-level Labels

Steps