# Natural Language Processing with Disaster Tweets

The Disaster Tweets Classifier is an NLP project using DistilBERT to identify real disaster tweets from Kaggle's competition dataset. The model achieves efficient classification through advanced text preprocessing and fine-tuning techniques. Deployed on Hugging Face Spaces, it provides instant disaster tweet classification through a user-friendly interface.

Data: https://www.kaggle.com/competitions/nlp-getting-started/data

Model: https://huggingface.co/alperugurcan/nlp-disaster

Hugging Face: https://huggingface.co/spaces/alperugurcan/nlp-disaster

In [1]:
!pip install pandas scikit-learn



## 1. Preprocessing

In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import pandas as pd
import numpy as np

In [3]:
train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
train['text'] = train['text'].str.replace(r'http\S+|[^\w\s]', '', regex=True)
test['text'] = test['text'].str.replace(r'http\S+|[^\w\s]', '', regex=True)

In [4]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
def tokenize(texts): 
    return tokenizer(
        texts, 
        padding='max_length',
        truncation=True,
        max_length=128,
        return_tensors=None
    )

In [6]:
train_dataset = Dataset.from_dict({
    'text': train['text'].tolist(),
    'labels': train['target'].tolist()
}).map(lambda x: tokenize(x['text']), batched=True)

test_dataset = Dataset.from_dict({
    'text': test['text'].tolist()
}).map(lambda x: tokenize(x['text']), batched=True)

Map:   0%|          | 0/7613 [00:00<?, ? examples/s]

Map:   0%|          | 0/3263 [00:00<?, ? examples/s]

## 2. Model

In [7]:
# Previous code remains the same until training part...

# Optimized training configuration
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir='./results',
        num_train_epochs=2,                    # Reduced epochs
        per_device_train_batch_size=32,        # Increased batch size
        gradient_accumulation_steps=2,         # Add gradient accumulation
        warmup_ratio=0.1,                      # Add warmup
        learning_rate=2e-4,                    # Increased learning rate
        logging_steps=100,                     # Reduced logging frequency
        report_to="none",
        fp16=True,                            # Enable mixed precision training
        dataloader_num_workers=2,             # Enable parallel data loading
        remove_unused_columns=True,           # Memory optimization
        no_cuda=False,                        # Ensure GPU usage
        load_best_model_at_end=False         # Skip validation to save time
    ),
    train_dataset=train_dataset
)
trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
  self.pid = os.fork()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Step,Training Loss
100,0.4713
200,0.3262


  self.pid = os.fork()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


TrainOutput(global_step=238, training_loss=0.37806424373338204, metrics={'train_runtime': 87.5405, 'train_samples_per_second': 173.931, 'train_steps_per_second': 2.719, 'total_flos': 504237152984064.0, 'train_loss': 0.37806424373338204, 'epoch': 2.0})

## 3. Prediction and Submission

In [8]:
preds = np.argmax(trainer.predict(test_dataset).predictions, axis=1)
pd.DataFrame({'id': test['id'], 'target': preds}).to_csv('submission.csv', index=False)

  self.pid = os.fork()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  self.pid = os.fork()


## 4. Save model for Hugging Face space

In [9]:
output_dir = "disaster_model"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

('disaster_model/tokenizer_config.json',
 'disaster_model/special_tokens_map.json',
 'disaster_model/vocab.txt',
 'disaster_model/added_tokens.json',
 'disaster_model/tokenizer.json')

In [10]:
model.save_pretrained('./model')
tokenizer.save_pretrained('./model')

('./model/tokenizer_config.json',
 './model/special_tokens_map.json',
 './model/vocab.txt',
 './model/added_tokens.json',
 './model/tokenizer.json')

In [14]:
!pip install huggingface_hub
from huggingface_hub import login

# Hugging Face hesabınıza giriş yapın
login()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [15]:
from huggingface_hub import HfApi, create_repo

In [18]:
repo_name = "alperugurcan/nlp-disaster"
create_repo(repo_name)

RepoUrl('https://huggingface.co/alperugurcan/nlp-disaster', endpoint='https://huggingface.co', repo_type='model', repo_id='alperugurcan/nlp-disaster')

In [22]:
from huggingface_hub import upload_file

# Define the repository name
repo_name = "alperugurcan/nlp-disaster"  # Replace with your username and desired model name

# Upload the model files
upload_file(
    path_or_fileobj="/kaggle/working/model/model.safetensors",  # Path to the file on Kaggle
    path_in_repo="model.safetensors",  # Path in the repo
    repo_id=repo_name  # The repo name
)

upload_file(
    path_or_fileobj="/kaggle/working/model/config.json",
    path_in_repo="config.json",
    repo_id=repo_name
)

upload_file(
    path_or_fileobj="/kaggle/working/model/special_tokens_map.json",
    path_in_repo="special_tokens_map.json",
    repo_id=repo_name
)

upload_file(
    path_or_fileobj="/kaggle/working/model/tokenizer.json",
    path_in_repo="tokenizer.json",
    repo_id=repo_name
)

upload_file(
    path_or_fileobj="/kaggle/working/model/tokenizer_config.json",
    path_in_repo="tokenizer_config.json",
    repo_id=repo_name
)

upload_file(
    path_or_fileobj="/kaggle/working/model/vocab.txt",
    path_in_repo="vocab.txt",
    repo_id=repo_name
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/alperugurcan/nlp-disaster/commit/d285f524bdb4550e57428ede5416686812cb35e3', commit_message='Upload vocab.txt with huggingface_hub', commit_description='', oid='d285f524bdb4550e57428ede5416686812cb35e3', pr_url=None, repo_url=RepoUrl('https://huggingface.co/alperugurcan/nlp-disaster', endpoint='https://huggingface.co', repo_type='model', repo_id='alperugurcan/nlp-disaster'), pr_revision=None, pr_num=None)