Description

This repository is for the participation in the Code4Venezuela Hackathon. https://www.codeforvenezuela.org/sf-hackathon

We participated specifically in the Data Ingestion Pipeline and Data Enrichment for the project: MVP-INF Using Twitter Data to Help Public Health.

What the code does

This code has different parts:

Twitter data extraction
Stream Processing (via Flink)
NLP processing (using CoreNLP)
Storage to database (MongoDB)

We provide 2 ways to get data from Twitter pending approval for Premium API access. One is in a polling way using current free apis, and the other via the Flink connector. Since the main solution is based on Flink, what we do is to use the first (poll) and publish to Kafka which is then consumed by Flink (AppKafka.java), the other (streaming) uses Flink's Twitter Connector (AppStream). Both use the same Flink Pipeline (Pipeline.java)

We have trained a simple model (in the resources/train_data folder) in Spanish, using existing tweets from the initial data set in the project, as well as other sources, and were able to tag based on:

NEEDS: people needing medicine
MED: medicine names
OFFERS: people offering medicine
LOC: to indicate location
CONTACT: contact information
SICK: sickness or diseases

The Pipeline then extracts the text from the tweet, processes it using the NLP model and stores it in the DB.

The project includes more things, and we might expand it in the future to include things like: deduplication, a better tagging model, storing geolocalization data, and more things that could help AI and data mining.

How to run?

docker-compose up (needed to have local Kafka running) .
Start AppKafka.java (just run the class which has a main method).
Once AppKafka is started, run TwitterConnectUtil (once again run main method) to feed the stream.
Run KafkaToMongoCosumer.java. This will save it to MongoDB
You can check the data being written to Mongo by going to localhost:9000 and accessing the admin database, processedTweets colleciton.

You can uncomment the output via console line to see how the ingested tweets are tagged.

Sample output

You can check in src/main/resources/mongodb_export/processedTweets a sample of the processed data.

It has 3 fields: originalTweet, tweetText, and tags. Tags contains all the fields that were identified by the NLP model.

Sample MongoDB queries

Get all tweets reporting someone is sick or has a disease/condition:

db.getCollection('processedTweets').find({"tags.tags.SICK": {$exists: true}})

Get all tweets reporting someone is requiring medicines:

db.getCollection('processedTweets').find({"tags.tags.NEEDS": {$exists: true}})

Get all tweets reporting someone requires or offers a specific medicine (memantina):

db.getCollection('processedTweets').find({"tags.tags.MED": {$exists: true, $in:['Memantina']}})

Extra

Dr.Julio Castro's data has been de-duplicated. In order to download it, follow this link: https://drive.google.com/open?id=1tbuw0KfmNMxuwLTmRISec7k9pghIU-9e

The python notebook provided in this repo removes duplicates from Dr. Castro's data.

The python server provided in this repo is for real time classification of new tweets, existing tweets, and near-duplicates (This is to address tweets that are just retweets of other tweets and other cases). By default, the server is started on localhost:30303. However, you can pass arguments to change the port. To run it, run

python server.py

To test it you can run the following command:

curl -X POST -d '{"text":"juan luis necesita medicinas en el tachira"}' http://localhost:30303

It will return a json with a response indicating whether or not the message is already in the corpus.

Is the near-duplicate thing actually working? clone the repo, and run this command and see what comes out!:

curl -X POST -d '{"text":"RT @juan RT @AgenciaCN: #3Feb #ServicioPúblico Se solicitan donantes de sangre ORH+ https://t.co/eE8twKWIut #ACN https://t.co/zK5nhyydUJ"}' http://localhost:30303

License

CoreNLP is distributed under GPL so we are obliged to do so as well. If you don't use any of the CoreNLP then you can consider the rest of the code MIT licensed (unless we have missed any other restrictions on dependant libraries).

The Team

Christian Vielma
Manuel Salgado
Marcos Grillo
Karina Celis
Juan Lopez Marcano

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
py		py
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
docker-compose.yml		docker-compose.yml
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

py

py

src

src

.gitignore

.gitignore

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

build.gradle

build.gradle

docker-compose.yml

docker-compose.yml

gradlew

gradlew

gradlew.bat

gradlew.bat

settings.gradle

settings.gradle

Repository files navigation

Description

What the code does

How to run?

Sample output

Sample MongoDB queries

Extra

License

The Team

Helpful Links

About

Releases

Packages

Contributors 4

Languages

License

cvielma/Code4Venezuela

Folders and files

Latest commit

History

Repository files navigation

Description

What the code does

How to run?

Sample output

Sample MongoDB queries

Extra

License

The Team

Helpful Links

About

Resources

License

Stars

Watchers

Forks

Languages