# **Scenario**<br>
You are working as a Data Scientist for an e-commerce company that operates in the Middle East. The company receives thousands of customer reviews in Arabic every day. These reviews are critical for understanding customer satisfaction, identifying areas for improvement, and making data-driven decisions.

However, manually analyzing these reviews is time-consuming and inefficient. Your task is to build an automated sentiment analysis system that can classify customer reviews as positive, negative, or neutral. This system will help the company:

1-Quickly identify unhappy customers and address their concerns.

2-Analyze trends in customer feedback over time.

3-Improve products and services based on customer sentiment.



# **Installing Required Libraries** <br>
**scikit-learn**: A powerful library for machine learning in Python, providing tools for data preprocessing, model training, and evaluation.

**pandas**: A library for data manipulation and analysis, particularly useful for handling structured data.

**numpy**: A library for numerical computations, providing support for arrays and matrices

**nltk**: For natural language processing tasks, such as stopwords removal

In [1]:
# Install necessary libraries
!pip install pandas numpy scikit-learn nltk



# **Importing Required Libraries** <br>

**pandas**: Used for loading and manipulating the dataset.

**train_test_split**: A function from scikit-learn used to split the dataset into training and testing sets.

**CountVectorizer**: A tool for converting text data into numerical features (bag-of-words representation).

**MultinomialNB**: A Naive Bayes classifier suitable for classification with discrete features (e.g., word counts).

**accuracy_score, classification_report, confusion_matrix**: Functions used to evaluate the performance of the model.

**nltk:** For preprocessing Arabic text, including stopwords removal




In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords
import re

# **Loading the Dataset** <br>
The dataset is  in CSV format and named 'CompanyReviews.csv'.

1-Display the first few rows of the dataset to inspect its structure.

2-Check the shape of the dataset to ensure it contains the expected number of rows and columns.

3-Verify the distribution of labels ('rating') to understand class balance


In [None]:

# Unzip the dataset
!unzip /content/archive.zip

In [11]:


# Load the dataset into a Pandas DataFrame
import pandas as pd

# Assuming the dataset contains a CSV file named 'ar_reviews.csv'
data = pd.read_csv('/content/CompanyReviews.csv')

# Display the first few rows of the dataset
print(data.head())

# Check the shape of the dataset
print(f"Dataset shape: {data.shape}")

# Check the distribution of labels
print(data['rating'].value_counts())

   Unnamed: 0                                 review_description  rating  \
0           0                                               رائع       1   
1           1  برنامج رائع جدا يساعد على تلبيه الاحتياجات بشك...       1   
2           2  التطبيق لا يغتح دائما بيعطيني لا يوجد اتصال با...      -1   
3           3                 لماذا لا يمكننا طلب من ماكدونالدز؟      -1   
4           4  البرنامج بيظهر كل المطاعم و مغلقه مع انها بتكو...      -1   

  company  
0  talbat  
1  talbat  
2  talbat  
3  talbat  
4  talbat  
Dataset shape: (40046, 4)
rating
 1    23921
-1    14200
 0     1925
Name: count, dtype: int64


# **Preprocessing the Data** <br>
 preprocess the data to make it suitable for machine learning:


1. Ensure the input is a string (safeguard against missing or non-string values).
2. Remove non-Arabic characters using regular expressions.
3. Remove extra spaces to clean up the text.
4. Remove Arabic stopwords to reduce noise in the data.

*Apply the preprocessing function to the 'review_description' column in the dataset.*

In [15]:
import re
from nltk.corpus import stopwords

# Download Arabic stopwords
nltk.download('stopwords')
arabic_stopwords = set(stopwords.words('arabic'))

# Function to clean and preprocess text
def preprocess_text(text):
    # Ensure the input is a string (this is a safeguard)
    if not isinstance(text, str):
        return ''

    # Remove non-Arabic characters
    text = re.sub(r'[^\u0600-\u06FF\s]', '', text)

    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove stopwords
    text = ' '.join([word for word in text.split() if word not in arabic_stopwords])

    return text

# Apply preprocessing to the text column
data['cleaned_text'] = data['review_description'].apply(preprocess_text)

# Display the cleaned text
print(data[['review_description', 'cleaned_text']].head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                  review_description  \
0                                               رائع   
1  برنامج رائع جدا يساعد على تلبيه الاحتياجات بشك...   
2  التطبيق لا يغتح دائما بيعطيني لا يوجد اتصال با...   
3                 لماذا لا يمكننا طلب من ماكدونالدز؟   
4  البرنامج بيظهر كل المطاعم و مغلقه مع انها بتكو...   

                                        cleaned_text  
0                                               رائع  
1   برنامج رائع جدا يساعد تلبيه الاحتياجات بشكل اسرع  
2  التطبيق يغتح دائما بيعطيني يوجد اتصال بالشبكةم...  
3                       لماذا يمكننا طلب ماكدونالدز؟  
4  البرنامج بيظهر المطاعم مغلقه انها بتكون فاتحه ...  



To evaluate the performance of the model, by split the dataset into training and testing sets:

**Training Set (80%)**: Used to train the model.

**Testing Set (20%)**: Used to evaluate the model's performance on unseen data.

*The train_test_split() function is used for this purpose, with random_state=42 ensuring reproducibility of the results*


In [16]:
# Split the data into features (X) and labels (y)
X = data['cleaned_text']
y = data['rating']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

Training samples: 32036
Testing samples: 8010


# **Text Vectorization** <br>
convert the text data into numerical features using the CountVectorizer:

**fit_transform()**: This method learns the vocabulary from the training data and transforms the text into a matrix of token counts.

**transform()**: This method applies the same transformation to the testing data using the vocabulary learned from the training data.

*The result is a sparse matrix where each row represents an SMS message, and each column represents a word in the vocabulary.*

In [18]:
# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the training data
X_train_vec = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test_vec = vectorizer.transform(X_test)
print(f"Shape of training data: {X_train_vec.shape}")
print(f"Shape of testing data: {X_test_vec.shape}")

Shape of training data: (32036, 39361)
Shape of testing data: (8010, 39361)


# **Training the Naive Bayes Model** <br>
initialize a Multinomial Naive Bayes classifier, which is well-suited for text classification tasks. The model is then trained on the vectorized training data (X_train_vec) and corresponding labels (y_train).

*Naive Bayes is a probabilistic classifier that assumes independence between features, making it efficient and effective for text data.*



In [19]:
# Initialize the Naive Bayes model
model = MultinomialNB()

# Train the model on the training data
model.fit(X_train_vec, y_train)

# **Making Predictions** <br>
After training the model, we use it to make predictions on the test set (X_test_vec). The predict() method returns the predicted labels for the test data, which we store in y_pred

In [20]:
# Make predictions on the test set
y_pred = model.predict(X_test_vec)

# **Evaluating the Model** <br>
To assess the model's performance, we calculate the following metrics:

**Accuracy:** The proportion of correctly classified instances out of the total instances.

**Classification Report:** Provides precision, recall, F1-score, and support for each class.

**Confusion Matrix:** A matrix showing the actual vs. predicted classifications, helping to visualize the model's performance.

In [23]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 81.54%
Accuracy: 0.8154

Classification Report:
              precision    recall  f1-score   support

          -1       0.78      0.79      0.79      2860
           0       0.00      0.00      0.00       418
           1       0.83      0.90      0.87      4732

    accuracy                           0.82      8010
   macro avg       0.54      0.56      0.55      8010
weighted avg       0.77      0.82      0.79      8010



# **Classifying New Emails** <br>
Test the trained model on new, unseen comments to demonstrate its practical application:

**New comments **: We provide a list of new comments on reviews  to classify.

**Vectorization**: The new messages are transformed using the same CountVectorizer to ensure consistency with the training data.

**Prediction:** The trained model predicts whether each message is spam or ham.

**Output:** The predictions are displayed alongside the original messages, showcasing the model's real-world applicability.

In [29]:
# New sms to classify
new_comments = [
    "المنتج جيد جداً وأعجبني كثيراً.",
    "انتوا صفحة مقرفة وانا تعبت منكم",
    "منتج مقبول بالنسبة لسعره",
    "مفيش حد بيرد عليا وانا مش هتعامل معاكم تاني",
]

# Convert new sms to numerical features
new_comments_vec = vectorizer.transform(new_comments)

# Make predictions
predictions = model.predict(new_comments_vec)

# Display predictions
# Define a mapping dictionary
sentiment_map = {
    1: "Positive",
    -1: "Negative",
    0: "Neutral"
}
for comment, prediction in zip(new_comments, predictions):
    sentiment_label = sentiment_map[prediction]  # Map the prediction to a label
    print(f"SMS: {comment}\nPrediction: {sentiment_label}\n")

SMS: المنتج جيد جداً وأعجبني كثيراً.
Prediction: Positive

SMS: انتوا صفحة مقرفة وانا تعبت منكم
Prediction: Negative

SMS: منتج مقبول بالنسبة لسعره
Prediction: Positive

SMS: مفيش حد بيرد عليا وانا مش هتعامل معاكم تاني
Prediction: Negative

