Scraper, Twitter streaming API collectors and nltk scripts, used in IEM2201D Corpus Linguistics Research Project
Python Ruby
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Tweet Data
sgbeat
.gitignore
README.markdown
classifier.py
engfilter.py
engfilter_sgb.py
english.rb
jb_pure_ids_left.txt
jbschema.sql
johortweet.py
notes.txt
results.txt
schema.sql
scrapy.cfg
singaporetweet.py
unigram_classifier.py

README.markdown

sgBeat Scraper

This is a Scrapy-based scraper for sgbeat.com. Is designed to grab tweets and store it in a MySQL database with schema as defined in schema.sql. Included in database.py is Tornado's MySQL database wrapper.

Dependencies:

Scrapy
MySQLdb (the Python wrapper for MySQL)

Remember to create a details.py file in /sgbeat/ with the following details:

HOST_NAME = ""
MYSQL_DB_NAME = ""
MYSQL_USER_NAME = ""
MYSQL_PASSWORD = ""

This code is used in a IEM2201D research project - to build a classifier for Singaporean vs Malaysian tweets. Due to sgBeat's unique nature, all tweets pushed to the site are Singaporean, thus making for a good source for a Singaporean corpus.

A separate script to grab Malaysian tweets via Twitter's streaming API exists in johortweet.py. Depedencies:

tweetstream

Remember to supply a details.py in /, containing a Twitter USERNAME, PASSWORD and database details, as above.