This repository contains code used to develop a hybrid architecture for labelling bilingual Māori-English tweets. A small sample file is provided for demonstration purposes, along with a script for downloading additional tweets.
- Apply for a Twitter developer account if you do not have one already.
- Download or clone the nga-kupu repository, which is bound by the Kaitiakitanga Licence.
- Ensure that Python 3 is installed on your machine, then run the following commands in the terminal:
pip install requests
pip install requests-oauthlib
pip install yelp_uri
pip install beautifulsoup4
pip install emot
- Copy all files in the
preprocessingfolder of this repository tonga-kupu-master/scripts. - Update the four word lists in
nga-kupu-master/taumahi/__init.py__according to the instructions inupdate_word_lists.txt. - Run
python3 setup.py installfromnga-kupu-master. - Configure your API bearer token by running the following command in the terminal:
export 'BEARER_TOKEN'='<your_bearer_token>' - Run the
collect_and_clean_tweets.pyscript that you moved tonga-kupu-master/scripts. This script gathers tweets from the past week from a predefined list of users, then cleans the tweets and generates the RMT labels that are needed as input to the hybrid architecture (below).
| tweet_id | user_id | modified_text | maori_words_rmt |
|---|---|---|---|
| 1001 | x10 | Living by the Moon: Te Maramataka a Te Whānau-ā-Apanui, Wiremu Tāwhai Te Whānau-ā-Apanui, Te Whakatōhea, Ngāti Awa | 'te', 'maramataka', 'te', 'wiremu', 'tāwhai', 'te', 'te', 'whakatōhea', 'ngāti', 'awa' |
| 1016 | x25 | oh man, take me, take me!! | 'take', 'me', 'take', 'me' |
