GitHub - brunowdev/disaster-response: This project uses NLP to classify disaster-related messages.

Requirements

Python 3
The comple list of requirents can be found at requirements.txt

Project overview

This project uses NLP to classify disaster related messages. The dataset is provided by Figure Eight.

Figure 1

Figure 2

Figure 3

File Descriptions

.
├── app
│   ├── run.py----------------------------# Flask app
│   └── templates
│       ├── go.html-----------------------# Template to display the result labels
│       └── master.html-------------------# Main page template with the search and menu
├── data
│   ├── DisasterResponse.db---------------# The database containing the cleaned dataset
│   ├── disaster_categories.csv-----------# Raw data with the message-id and categories
│   ├── disaster_messages.csv-------------# Raw data with the message text and genre
│   └── process_data.py-------------------# ETL script
│   └── language_utils.py-----------------# Script with a set of text-related functions
│   └── test_etl_pipeline.py--------------# Test script for the ETL pipeline
│   └── test_language_utils.py------------# Test script for the language utils script
├── models
│   └── train_classifier.py---------------# Train the model 
│   └── nlp_extractors.py-----------------# Script with a set of functions related to NLP (Tokenizers, extractors) 
│   └── classifier.pkl--------------------# The trained model
│   └── linear_model_metrics.csv----------# The file with training score for each category (precision, recall, f1-score)
├── images
│   └── Images for the documentation

Running

A live version of the app is available here.

If you want to run locally, just execute the following command:

python app/run.py

To execute the pipeline:

python process_data.py messages.csv categories.csv DisasterResponse.db

To train the model:

python train_classifier.py ../data/DisasterResponse.db classifier.pkl

WARNING: The default training parameters could take too long to run. In my machine, it takes about 12.2 hours to train. Consider changing the train_classifier.py to remove some parameters from the Grid Search.

Results and discussion

The dataset is imbalanced. Some categories have about 100 samples, others over 10K;
By the visualizations, the recall is high related to the number of samples available;
The category child_alone was removed from training since it has no samples from it;
- Motivation:
  - Could confuse the user of the model, since always will return false for this category.
  - Some models and classification reports require at least a true/false sample for each category.
While working with the data, I realized that several messages had partially translated or are in several other languages such as Portuguese, Spanish, and others. So, I try to translate the messages using the Yandex Translator API.
- In the end, the overall performance was about the same as the original model;
- The model available at this repository was trained over the original dataset messages.csv;
- Anyways, the translated messages are available at messages_with_translation.csv;

The NLP

I've used the Spacy with nltk since spacy has more stop words for English;
I also manually added some stop words for Portuguese/Spanish, since I've seen several messages in these languages;
QuestionExtractor: Extracts the feature if the message contains a question. I've also used Spacy here.
NumericDigitExtractor: Extracts the feature if the message contains any digit.

Translation

Here, an example of the message translation to English US, the first column are the message column from the dataset, the middle column, is the original column, already on the dataset. The last column, is the message translated on the Yandex Service.

Figure 4

The translation was good in terms like SVP, an abbreviation to please and English, and other similar terms.
But, in general, several stop words are added, acting like a corrector for the messages. In the end, these words are removed anyway, so this could be the reason it has not helped so much.

Further improvements

At this time, I prefer not use a oversampling techinique, since the only effect at this case will be train the model with test samples;
A possible technique will be extract more samples from sites like twitter, related to this categories;

Licensing and dataset

The quickstart code for the webapp was provided by Udacity
The dataset is provided by Figure Eight.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Requirements

Project overview

Figure 1

Figure 2

Figure 3

File Descriptions

Running

Results and discussion

The NLP

Translation

Figure 4

Further improvements

Licensing and dataset

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
app		app
data		data
images		images
models		models
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

brunowdev/disaster-response

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Requirements

Project overview

Figure 1

Figure 2

Figure 3

File Descriptions

Running

Results and discussion

The NLP

Translation

Figure 4

Further improvements

Licensing and dataset

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages