This repository contains code used to develop a hybrid architecture for labelling bilingual Māori-English tweets. A small sample file is provided for demonstration purposes, along with a script for downloading additional tweets.
- Apply for a Twitter developer account if you do not have one already.
- Download or clone the nga-kupu repository, which is bound by the Kaitiakitanga Licence.
- Ensure that Python 3 is installed on your machine, then run the following commands in the terminal:
pip install requests
pip install requests-oauthlib
pip install yelp_uri
pip install beautifulsoup4
pip install emot
- Copy all files in the
preprocessing
folder of this repository tonga-kupu-master/scripts
. - Update the four word lists in
nga-kupu-master/taumahi/__init.py__
according to the instructions inupdate_word_lists.txt
. - Run
python3 setup.py install
fromnga-kupu-master
. - Configure your API bearer token by running the following command in the terminal:
export 'BEARER_TOKEN'='<your_bearer_token>'
- Run the
collect_and_clean_tweets.py
script that you moved tonga-kupu-master/scripts
. This script gathers tweets from the past week from a predefined list of users, then cleans the tweets and generates the RMT labels that are needed as input to the hybrid architecture (below).
tweet_id | user_id | modified_text | maori_words_rmt |
---|---|---|---|
1001 | x10 | Living by the Moon: Te Maramataka a Te Whānau-ā-Apanui, Wiremu Tāwhai Te Whānau-ā-Apanui, Te Whakatōhea, Ngāti Awa | 'te', 'maramataka', 'te', 'wiremu', 'tāwhai', 'te', 'te', 'whakatōhea', 'ngāti', 'awa' |
1016 | x25 | oh man, take me, take me!! | 'take', 'me', 'take', 'me' |