Project for the Web Information Retrieval course at La Sapienza, Università di Roma held by Prof. Andrea Vitaletti and Prof. Luca Becchetti.
The project wants to detect an emergency situation in real time through tweets flow scanning using machine learning algorithms and users as sensors.
Link to the Slideshare presentation. Scientific paper on the repository.
- Daniele Davoli - Linkedin profile
- Danilo Marzilli - Linkedin profile
- Andrea Lombardo - Linkedin profile
For training and validating our machine learning system, we have used a dataset of 5,642 manually annotated tweets in the Italian language. The tweets are related to 4 different natural disasters occurred in Italy between 2009 and 2014. For each tweet is reported:
- tweet ID;
- text;
- source;
- author’s screen name;
- author’s ID;
- latitude and longitude (if available);
- time;
- disaster ID (see below);
- class.
Tweets have been manually annotated by humans and divided among 3 classes according to the information they convey:
- damage class: tweets related to the disaster and carrying information about damages to the infrastructures or on the population;
- no damage class: tweets related to the disaster but not carrying relevant information for the assessment of damages;
- not relevant class: tweets collected while building the dataset, but not related to any disaster (noise).
The dataset ins also available in this repository. Validations with your datasets are welcome :)
We process our dataset in this order:
- Import data from the .csv file;
- Preprocessing our tweets in order to remove punctuation, stop words and digits and to implement the stemming algorithm;
- Trasform the tweets in vectors in a space vector where the axis are the vocabulary terms and give to each vector a TF-IDF (Term Frequency and Inverted Document Frequency) weight;
- Cluster our tweets (now vectors) in main topics;
- Train a SVM classifier in order to distinguish the tweets in relevant and not relevant.