<a href="https://colab.research.google.com/github/ali-77-gh/NLP/blob/main/processing_tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Natural Language Processing with Disaster Tweets(NLP)**

# Import libraries
-------------------
re : To work with regular expressions (Regular Expressions) to clear texts.

nltk : For natural language processing, including removing stop words and performing stemming.

In [1]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix,precision_score,f1_score



**read data** -- Here the data is in two files

In [2]:
train_data = pd.read_csv('/content/drive/MyDrive/train.csv')
test_data = pd.read_csv('/content/drive/MyDrive/test.csv')

# Data review

In [3]:
train_data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [4]:
train_data.shape

(7613, 5)

In [5]:
train_data.isnull().sum()

Unnamed: 0,0
id,0
keyword,61
location,2533
text,0
target,0


# Here we **fill the empty values** ​​with the empty string

In [6]:

train_data['keyword'] = train_data['keyword'].fillna('')
train_data['location'] = train_data['location'].fillna('')
test_data['keyword'] = test_data['keyword'].fillna('')
test_data['location'] = test_data['location'].fillna('')

In [7]:
train_data_cleaned = train_data.drop(columns=['id'])

# Stop word removal and rooting are downloaded from nltk.

In [8]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

# Text preprocessing includes removing URLs, #, @ and....

In [9]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\@\w+|\#', '', text)
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

# Here, a lemmatizer object and a set of stop_words are defined.

In [10]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
train_data['cleaned_text'] = train_data['text'].apply(preprocess_text)
test_data['cleaned_text'] = test_data['text'].apply(preprocess_text)

**Here, the TF-IDF vector is constructed from the cleaned texts. max_features=5000 indicates that the vector consists of only 5000 important features. The X variable contains the text processed data and the y variable contains the target labels.**

In [11]:
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X = tfidf_vectorizer.fit_transform(train_data['cleaned_text'])
y = train_data['target']

# Split data for modeling

In [12]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

# Building a random forest model

In [13]:
clf = XGBClassifier(random_state=42 , n_estimators=300 , learning_rate=0.5 , max_depth=7 , subsample=0.9 , colsample_bytree=0.9)



clf.fit(X_train, y_train)

**Checking the accuracy of the model on the validation data**

In [14]:
val_predictions = clf.predict(X_val)
accuracy = accuracy_score(y_val, val_predictions)
print(f"Validation accuracy: {accuracy}")


conf_matrix = confusion_matrix(y_val, val_predictions)
print("Confusion Matrix:")
print(conf_matrix)


precision = precision_score(y_val, val_predictions)
print(f"Precision: {precision}")


recall = recall_score(y_val, val_predictions)
print(f"Recall: {recall}")

f1 = f1_score(y_val, val_predictions)
print(f"F1 Score: {f1}")



Validation accuracy: 0.7793345008756567
Confusion Matrix:
[[1079  239]
 [ 265  701]]
Precision: 0.7457446808510638
Recall: 0.7256728778467909
F1 Score: 0.7355718782791185


# Now predict the model for the test data

In [15]:
X_test = tfidf_vectorizer.transform(test_data['cleaned_text'])

test_predictions = clf.predict(X_test)

# Adding the prediction column to the tested data and saving the model to csv format

In [16]:
test_data['target'] = test_predictions
output_file_path = '/content/drive/MyDrive/test_predictions.csv'
test_data[['id', 'target']].to_csv(output_file_path, index=False)

# Display several lines of model predictions

In [17]:
test_data[['id', 'text', 'target']].head(5)

Unnamed: 0,id,text,target
0,0,Just happened a terrible car crash,1
1,2,"Heard about #earthquake is different cities, s...",1
2,3,"there is a forest fire at spot pond, geese are...",1
3,9,Apocalypse lighting. #Spokane #wildfires,1
4,11,Typhoon Soudelor kills 28 in China and Taiwan,1
