# **CRISP-DM Methodology for Data Science:**
This methodology includes 5 steps:
* ***Step 1- Business & Data Understanding***: The goal of the first is to identify Variables (Number, Types, Quality), Classes (Number of classes) and Volume (Number of samples).
* ***Step 2- Data Preparation***: This step aims to clean, analyze, encode, normalize and split data.
* ***Step 3- Machine Learning***: The implementation of machine learning algorithms.
* ***Step 4- Performance Evaluation***: Evaluate the peformance using metrics.
* ***Step 5- Deployment***: Saving model and implementing a web interface (API, Service)

# **Step 1- Business & Data Understanding**


In [1]:
import pandas as pd
df=pd.read_csv("finals.csv")
df.head()


Unnamed: 0,Comment_Text_Arabic,Problem_Source
0,يا ولادي شريت تاليفون جديد، بعد جمعة البطارية ...,المنتج
1,الطلبية وصلتني ناقصة، و خدمة العملاء ما يجاوبو...,الخدمة
2,الغسالة من أول استعمال تعمل في حس غريب و ما تن...,المنتج
3,طلبت حذاء، جابولي قياس خاطئ و باش نبدل حكاية!,الخدمة
4,الخامة نتاع التيشرت هذا خايبة برشا، لبستين و ت...,المنتج


this dataset contains the following variables:


*   **Comment_Text_Arabic**: which contains the comments
*   **Problem_Source**: that contains the labels (2 classes of problem came from : product(المنتج) and service(الخدمة))



In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1036 entries, 0 to 1035
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Comment_Text_Arabic  1036 non-null   object
 1   Problem_Source       1036 non-null   object
dtypes: object(2)
memory usage: 16.3+ KB


The data includes 1036 samples with two type of variables:
* *Features* Data which are:
  * Comment_Text_Arabic 1036 non-null object
* *Labels* Data which are:
  * Problem_Source 1036 non-null  object

# **Step 2 - Data Preparation**

There is no missed data, but we have to encode the text into numerical vectors. This is what we call "Word Embedding". Then, we will split the data.


## **2.1. Cleaning using Re and NLTK**

In [3]:
import re
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
arabic_stopwords = set(stopwords.words('arabic'))

def normalize_arabic(text):
    text = re.sub(r'[إأآا]', 'ا', text)
    text = re.sub(r'ى', 'ي', text)
    text = re.sub(r'ؤ', 'ء', text)
    text = re.sub(r'ئ', 'ء', text)
    text = re.sub(r'ة', 'ه', text)
    text = re.sub(r'[^؀-ۿ\s]', '', text)  # Keep only Arabic characters
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def clean_arabic_text(text):
    text = normalize_arabic(text)
    tokens = text.split()
    tokens = [t for t in tokens if t not in arabic_stopwords]
    return ' '.join(tokens)

# Apply preprocessing
df['cleaned'] = df['Comment_Text_Arabic'].apply(clean_arabic_text)
df.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\youss.YOUSSEF\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Comment_Text_Arabic,Problem_Source,cleaned
0,يا ولادي شريت تاليفون جديد، بعد جمعة البطارية ...,المنتج,ولادي شريت تاليفون جديد، جمعه البطاريه طاحت جمله
1,الطلبية وصلتني ناقصة، و خدمة العملاء ما يجاوبو...,الخدمة,الطلبيه وصلتني ناقصه، خدمه العملاء يجاوبوش
2,الغسالة من أول استعمال تعمل في حس غريب و ما تن...,المنتج,الغساله اول استعمال تعمل حس غريب تنظفش بالباهي
3,طلبت حذاء، جابولي قياس خاطئ و باش نبدل حكاية!,الخدمة,طلبت حذاء، جابولي قياس خاطء باش نبدل حكايه
4,الخامة نتاع التيشرت هذا خايبة برشا، لبستين و ت...,المنتج,الخامه نتاع التيشرت خايبه برشا، لبستين تريّش


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1000)
X_tfidf = vectorizer.fit_transform(df['cleaned']).toarray()

In [5]:
df['Problem_Source'] = df['Problem_Source'].str.replace('"', '', regex=False).str.strip()

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(df['Comment_Text_Arabic']).toarray()
y = df['Problem_Source']


In [7]:
y.unique()

array(['المنتج', 'الخدمة'], dtype=object)

## **2.2. Split of Data**

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


# **Step 3 - Machine Learning**

In [9]:
from sklearn.naive_bayes    import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.svm            import SVC
#Inialization
nb=GaussianNB()
nn=MLPClassifier(hidden_layer_sizes=(20,20),activation="logistic",solver='adam')
linear_svm=SVC(kernel='linear')
rbf_svm=SVC(kernel='rbf')
sgd_svm=SVC(kernel='sigmoid')
poly_svm=SVC(kernel='poly',degree=2)
#Training
nb.fit(X_train,y_train)
nn.fit(X_train,y_train)
linear_svm.fit(X_train,y_train)
rbf_svm.fit(X_train,y_train)
sgd_svm.fit(X_train,y_train)
poly_svm.fit(X_train,y_train)
#Prediction
y_pred_nb=nb.predict(X_test)
y_pred_nn=nn.predict(X_test)
y_pred_rbf=rbf_svm.predict(X_test)
y_pred_linear=linear_svm.predict(X_test)
y_pred_sgd=sgd_svm.predict(X_test)
y_pred_poly=poly_svm.predict(X_test)



# **Step 4- Performance Evaluation**

In [10]:
from sklearn.metrics import classification_report

print("************ Performance of Naive Bayes *************")
print(classification_report(y_test, y_pred_nb))

print("************ Performance of Neural Network *************")
print(classification_report(y_test, y_pred_nn))

print("************ Performance of Linear SVM *************")
print(classification_report(y_test, y_pred_linear))

print("************ Performance of RBF SVM *************")
print(classification_report(y_test, y_pred_rbf))

print("************ Performance of SGD SVM *************")
print(classification_report(y_test, y_pred_sgd))

print("************ Performance of Poly SVM *************")
print(classification_report(y_test, y_pred_poly))


************ Performance of Naive Bayes *************
              precision    recall  f1-score   support

      الخدمة       0.87      0.88      0.88        92
      المنتج       0.90      0.90      0.90       116

    accuracy                           0.89       208
   macro avg       0.89      0.89      0.89       208
weighted avg       0.89      0.89      0.89       208

************ Performance of Neural Network *************
              precision    recall  f1-score   support

      الخدمة       0.91      0.95      0.93        92
      المنتج       0.96      0.92      0.94       116

    accuracy                           0.93       208
   macro avg       0.93      0.93      0.93       208
weighted avg       0.93      0.93      0.93       208

************ Performance of Linear SVM *************
              precision    recall  f1-score   support

      الخدمة       0.91      0.91      0.91        92
      المنتج       0.93      0.93      0.93       116

    accuracy      

### ✅ Neural Network

The neural network showed excellent performance with high accuracy on both classes.  
It's a great choice when you have more data or plan to handle more complex tasks later.  
It’s flexible and scalable, though it requires more resources and tuning expertise.

In [11]:
#########################TEST################################
text = "برودوي مجا شي حكايتو فارغة منغير متشريو "

cleaned_text = clean_arabic_text(text)

text_vectorized = vectorizer.transform([cleaned_text]).toarray()

prediction = nn.predict(text_vectorized)

print(f"The predicted problem source for the text is: {prediction[0]}")


The predicted problem source for the text is: المنتج


# **Step 5 - Deployment**

In [12]:
import pickle

with open("MLP_model.pkl", "wb") as f:
    pickle.dump(nn, f)
with open("tfidf_vectorizer.pkl", "wb") as f:
    pickle.dump(vectorizer, f)
with open("cleaner-text.pkl","wb") as f:
  pickle.dump(clean_arabic_text,f)