-
Notifications
You must be signed in to change notification settings - Fork 0
bshurick/w205project
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Steps to install: 1) load code into /project folder (NOTE: this folder is required) git clone https://github.com/bshur2008/w205project.git mv w205project /project 2) ensure that python 2.7 is installed and install python dependencies with install_dependencies.sh script 3) install redis and run in daemon mode locally (http://redis.io/) example redis.conf file is included 4) add twitter credentials to ~/.bash_profile or run the below in bash CLI export TWITTER_KEY=[key] export TWITTER_SECRET=[secret] export TWITTER_OAUTH_TOKEN=[oauth token] export TWITTER_OAUTH_SECRET=[oauth secret] 5) ensure that Hadoop and Hive are installed properly 6) create /data/tmp/raw/tweets folder(s) on local file system 7) create /user/w205/tweets/es folder(s) within HDFS and open up access to all linux users with hdfs dfs -chmod 777 -r /user/w205/tweets/es (this allows cron job to move files from local file system to HDFS) 8) install lein if not installed (http://leiningen.org/) 9) for wikipedia data, attach EBS volume snapshot snap-8041f2e9 and load eswiki-20090812-pages-meta-current.xml.bz2 data file into HDFS folder /user/w205/wp/es 10) use crontab.txt example file to schedule processes to run automatically OR run manually: ./launch.sh # check for running Storm topology and start if not running ./hdfs_load.sh # run each hour (or more) to load tweets from local files to HDFS ./tweet_ETLs.sh # run ETLs for twitter data ./wp_ETLs.sh # run Wikipedia ETLs (NOTE: may need to execute sparse run from the /project/lang_analysis folder to compile job and ensure streamparse works before it will run correctly) 11) to view real-time tweets use the serving scripts provided: python serving-allwords.py # dump all words from Redis (NOTE: use sparingly) python serving-allwords.py | sort -k2,2nr | head -n20 # display the top 20 words python serving-getword.py [word] # display counts for an individual word NOTE: use test_scripts/authorize.py script to authenticate with OAuth for the first time and test_scripts/test.py to test your twitter stream
About
Streaming storage of foreign-language tweets for use in linguistic analysis and foreign language instruction.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published