## Task 1 – Text Classification using LSTM

### Import Required Libraries

This cell imports the necessary libraries:
- `TensorFlow` and `Keras` for building the LSTM model
- `pandas` and `numpy` for data manipulation
- `sklearn` for data splitting and evaluation


In [None]:
import re
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


### Upload and Load Dataset

This block uploads a CSV dataset and stores it into a DataFrame.
We create a `clean_text` column as a placeholder for tokenized input.


In [None]:
from google.colab import files
import pandas as pd

# Upload the file
uploaded = files.upload()

# Automatically get the uploaded filename
filename = next(iter(uploaded))

# Read the CSV
df = pd.read_csv(filename)

# Use the "text" column directly since it's already clean
df["clean_text"] = df["text"]


Saving stemmed_dataset.csv to stemmed_dataset.csv


### Tokenization and Label Encoding

This step:
- Tokenizes Arabic text using `Tokenizer`
- Pads sequences to fixed length (100 tokens)
- Encodes target labels into numeric format for classification


In [None]:
max_words = 10000
max_len = 100

tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(df["clean_text"])
sequences = tokenizer.texts_to_sequences(df["clean_text"])
X = pad_sequences(sequences, maxlen=max_len)

# Encode string labels as integers
label_to_index = {label: idx for idx, label in enumerate(df["label"].unique())}
df["label_encoded"] = df["label"].map(label_to_index)
y = df["label_encoded"].values


### Split Dataset into Train and Test

This block splits the dataset using an 80/20 ratio and stratifies the data to ensure balanced label distribution.


In [None]:
from sklearn.model_selection import train_test_split

# Split and keep indices
train_indices, test_indices = train_test_split(
    np.arange(len(X)), test_size=0.2, random_state=42, stratify=y)

X_train, X_test = X[train_indices], X[test_indices]
y_train, y_test = y[train_indices], y[test_indices]

### Build LSTM Model

The model includes:
- `Embedding` layer to learn word representations
- `LSTM` layer for sequence learning
- `Dropout` to reduce overfitting
- `Dense` output layer with `softmax` for multiclass classification


In [None]:
model = Sequential([
    Embedding(input_dim=max_words, output_dim=128, input_length=max_len),
    LSTM(64, return_sequences=False),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dense(len(np.unique(y)), activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])




### Train the LSTM Model

The model is trained for 5 epochs with:
- Batch size of 32
- 10% of the training data used as validation
This helps monitor performance during training and avoid overfitting.


In [None]:
history = model.fit(X_train, y_train,
                    epochs=5,
                    batch_size=32,
                    validation_split=0.1)


Epoch 1/5
[1m411/411[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 11ms/step - accuracy: 0.4492 - loss: 1.3142 - val_accuracy: 0.7317 - val_loss: 0.6719
Epoch 2/5
[1m411/411[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 10ms/step - accuracy: 0.8311 - loss: 0.5160 - val_accuracy: 0.8453 - val_loss: 0.4496
Epoch 3/5
[1m411/411[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - accuracy: 0.9044 - loss: 0.3209 - val_accuracy: 0.8597 - val_loss: 0.4260
Epoch 4/5
[1m411/411[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 8ms/step - accuracy: 0.9283 - loss: 0.2453 - val_accuracy: 0.8604 - val_loss: 0.4564
Epoch 5/5
[1m411/411[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 9ms/step - accuracy: 0.9500 - loss: 0.1750 - val_accuracy: 0.8494 - val_loss: 0.5282


### Evaluate Model with Classification Report

This cell:
- Uses the trained model to predict test set labels
- Decodes numeric predictions back to string labels
- Prints a classification report including precision, recall, and F1-score for each class


In [None]:
y_pred = np.argmax(model.predict(X_test), axis=1)
index_to_label = {v: k for k, v in label_to_index.items()}
print(classification_report(y_test, y_pred, target_names=index_to_label.values()))


[1m115/115[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step
               precision    recall  f1-score   support

      culture       0.85      0.73      0.79       499
      economy       0.75      0.85      0.80       653
international       0.92      0.76      0.83       338
        local       0.70      0.75      0.72       648
     religion       0.96      0.97      0.96       695
       sports       0.98      0.95      0.97       819

     accuracy                           0.85      3652
    macro avg       0.86      0.84      0.85      3652
 weighted avg       0.86      0.85      0.86      3652



### Display Sample Wrong Predictions

Displays examples where the model misclassified the input.  
Helps analyze common errors and model limitations.


In [None]:
print("\nExamples of wrong predictions:\n")
wrong_indices = np.where(y_pred != y_test)[0]
for i in wrong_indices[:5]:
    original_idx = test_indices[i]
    true_label = index_to_label[y_test[i]]
    pred_label = index_to_label[y_pred[i]]
    text_sample = df.loc[original_idx, "clean_text"]
    print(f"Text (truncated): {text_sample[:100]}...")
    print(f"True label: {true_label}")
    print(f"Predicted label: {pred_label}\n")



Examples of wrong predictions:

Text (truncated): صلل وطن اقم فرق وعي بشر صلل خدم صرف صحي ندة عرف شرع صرف صحي صلل وذل درس خول بنت حكم علم عام صبح يوم ...
True label: economy
Predicted label: local

Text (truncated): برء من اجد حرز قبل دكتور حسن بن سعد كشب عمد كلة برء رحل ونس رضا حاج حمد الذي زار كلة صبح امس وتأ هذه...
True label: sports
Predicted label: local

Text (truncated): قعد بين ما هي همي درس ونع درس عتبرم اسس عمل و لها هدف مهم وهم انه تبن قعد بين يمكن عليها وضع خطة راد...
True label: local
Predicted label: economy

Text (truncated): هيماء من خلف بن صلح درع عقد نصر بن عبدالل عبر دير درة ربي علم نطق سطى ؤخر جمع ثني دير درس نطق لهذا ع...
True label: local
Predicted label: local

Text (truncated): كتب سلم رحب عرض فرق زون سرح خلل شرك في دور ربع هرج دلف سرح جمع ارد سرح حقق والتي الف عمد شنفر خرج وس...
True label: culture
Predicted label: local



## Conclusion

This task demonstrated a deep learning approach to Arabic text classification using an LSTM model.  
Key observations:

- The LSTM network was able to capture sequence-based features in Arabic texts.
- Performance was reasonable, though slightly below that of transformer-based models like AraBERT.
- The model remains simpler and lighter, which can be beneficial in low-resource environments.


