# "LibnaniYa3ni?TM" #

### a predictor of lebanese-ness. ###

# ------------------------------------------------- #

### This project uses Naive Bayes classification to classify WhatsApp messages as either being slang (Lebanese Slang Arabic) or formal (Fushaa Arabic) arabic. ###

### Step 0: Base packages  ###

In [116]:
import pandas as pd
import numpy as np

### Step 1: Initializing our dictionary/ "Bag of our Words" ###

In [117]:
dictionarydf = pd.read_csv("/Users/y_mehio78/Documents/LebSlangAttempt1.csv")
print(dictionarydf.value_counts())
print("----------------------------------------------")
print(dictionarydf.info())

English              Arabic          Dialect 
tea                  shay            Fushaa      2
restaurant           mateam          Fushaa      2
blue                 azra2/zar2a     Lebanese    2
                     'azraq          Fushaa      2
from a while         mundh hin       Fushaa      2
                                                ..
evil                 shar            Fushaa      1
                     na7es           Lebanese    1
everytime            kulu marih      Fushaa      1
                     kel marra       Lebanese    1
you see, how it is?  turaa kayf hu?  Fushaa      1
Length: 1468, dtype: int64
----------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1494 entries, 0 to 1493
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   English  1494 non-null   object
 1   Arabic   1494 non-null   object
 2   Dialect  1494 non-null   object
dtypes: object(3)
memory usa

### Step 2: Create encoder column ###

In [118]:
##Fushaa will be our base value of 0; Lebanese Slang words will be 1's, and non-identifiable will be 999 (Since there are not 999 dialects of Slang Arabic)
def slang_identifier(x): 
    if x == 'Lebanese': 
        return 1
    elif x == 'Fushaa': 
        return 0
    else:
        return 9999

In [119]:
dictionarydf['SlangDetection'] = dictionarydf['Dialect'].apply(slang_identifier)
print(dictionarydf.head())

            English           Arabic Dialect  SlangDetection
0  is sitting/doing  yajlis / yafeal  Fushaa               0
1              firm          sharika  Fushaa               0
2             tough              qas  Fushaa               0
3            coffee            qahwa  Fushaa               0
4              moon          alqamar  Fushaa               0


In [120]:
from sklearn.model_selection import train_test_split as TTS

X_train, X_test, y_train, y_test = TTS(dictionarydf.Arabic, dictionarydf.Dialect, test_size=0.2)
X_train.shape, X_test.shape, type(X_train)

((1195,), (299,), pandas.core.series.Series)

In [121]:
from sklearn.feature_extraction.text import CountVectorizer as CV

m = CV()

X_train_cv = m.fit_transform(X_train.values)
X_train_cv
X_train_cv.toarray()[:2][0]

array([0, 0, 0, ..., 0, 0, 0])

In [122]:
## Provides descriptives of our dataset:
## the shape, the length of, as well as our entire dictionary, for reference.
print("Dictionary Dimensions: ", X_train_cv.shape, ".", 
"Dictionary Length: ", m.get_feature_names_out().shape)
m.vocabulary_

Dictionary Dimensions:  (1195, 1363) . Dictionary Length:  (1363,)


{'8ani': 110,
 '8aniiyi': 111,
 'akhw': 148,
 'alam': 159,
 'nuzha': 968,
 'daiiman': 420,
 'sa7': 1048,
 'hamaam': 586,
 'sibaha': 1144,
 'shughil': 1140,
 'rabtat': 1007,
 'eunuq': 508,
 'nos': 966,
 'rasim': 1015,
 'gâteau': 569,
 'rakhis': 1010,
 'almuntaji': 226,
 'jumhur': 660,
 'ghurfat': 568,
 'almaeisha': 218,
 'fi': 541,
 'bi': 371,
 'or': 972,
 'depending': 429,
 'on': 971,
 'letter': 760,
 'that': 1223,
 'follows': 549,
 '2aasi': 3,
 '2aasyi': 4,
 '5edmi': 75,
 'qarurat': 989,
 'ma': 774,
 'allawn': 209,
 'ktiir': 720,
 'ealaa': 452,
 'al': 151,
 'aqali': 291,
 'alwaqt': 261,
 '2araaiib': 8,
 'ma5raj': 778,
 'zaman': 1349,
 'fedda': 535,
 'shou': 1137,
 'yabdu': 1303,
 'el': 469,
 'mustahlek': 930,
 '2ezaez': 20,
 'special': 1162,
 'french': 553,
 'tifl': 1227,
 'kariim': 683,
 'kariimi': 684,
 'piscine': 976,
 'qalilan': 983,
 'faqat': 526,
 'mushkila': 924,
 'huma': 611,
 'mustashfaa': 932,
 'kazdoura': 689,
 'ma3': 775,
 'ma3a': 776,
 'ziara': 1356,
 'alwaqie': 260,
 'bi

### Step 3: Naive Bayes ###

In [125]:
from sklearn.naive_bayes import MultinomialNB as MNB

In [126]:
##Training the model on our training data via fitting
model = MNB()
model.fit(X_train_cv, y_train)

MultinomialNB()

In [127]:
X_test_cv = m.transform(X_test)

In [162]:
from sklearn.metrics import classification_report as CR

y_pred = model.predict(X_test_cv)

print(CR(y_test, y_pred))

              precision    recall  f1-score   support

      Fushaa       0.59      0.99      0.74       146
    Lebanese       0.96      0.35      0.52       153

    accuracy                           0.66       299
   macro avg       0.78      0.67      0.63       299
weighted avg       0.78      0.66      0.63       299



In [129]:
##Inputting and trying out our trial data/message

WhatsappMessages = [
    'Hello my friend, kiif the family?',
    'Would you like to meet under the 2amar for some qahwa?',
    'I hate my life.'
]

In [130]:
##Testing our classifier on our messages data

WhatsappMessagesCount = m.transform(WhatsappMessages)
model.predict(WhatsappMessagesCount)

array(['Lebanese', 'Lebanese', 'Fushaa'], dtype='<U8')

# Answering our question: running realtime data through #

### "Real-world" data application! ###

In [155]:
##Initializing the file:

#First, we read it as a df. 
messagesdf = pd.read_csv("/Users/y_mehio78/Documents/TotallyAccurateLebTweets.csv")

InputtedMessages = messagesdf['Contents'].tolist()
## Use only to reshape data if needed: InputtedMessages.reshape(-1, 1)
print(InputtedMessages)


#Next, we pass it through in the cell below:

['Oh hey George, been quite a while…wadda ya say we grab some 2ahwoui? ', 'Hello my friend. Kiif the family?', 'I hate this place. The 2amar overshadows the entire city.', 'Grenade yaani 😂', "As long as you're 2a3d in the laundry room…", 'la wallah ana im not', 'yr not grsping th idea kiif ana mama would say', 'ouff shou awaii', '"The Syrian president has released a press statement today notifying the country that the oil being smuggled through the borders does not, in any way shape or form, have to do with the Ba\'ath party." Maybe it\'s time we slip a little something in his shaai, shou?  ']


In [172]:
InputtedMessagesTransformed = m.transform(InputtedMessages)

In [187]:
print("Hello, and thank you for using the LibnaniYaniTM identifier! Here are the following inputted messages: ")
print("\n", InputtedMessages)
print("\n", "The inputted messages are of the following dialect(s): ", model.predict(InputtedMessagesTransformed))
print("-----------------------------------------------------------")
print("Here is an accuracy assessment of the model. Thank you for using the LibnaniYa3niTM identifier, and have a nice day!")
print("-----------------------------------------------------------")
print(CR(y_test, y_pred))

Hello, and thank you for using the LibnaniYaniTM identifier! Here are the following inputted messages: 

 ['Oh hey George, been quite a while…wadda ya say we grab some 2ahwoui? ', 'Hello my friend. Kiif the family?', 'I hate this place. The 2amar overshadows the entire city.', 'Grenade yaani 😂', "As long as you're 2a3d in the laundry room…", 'la wallah ana im not', 'yr not grsping th idea kiif ana mama would say', 'ouff shou awaii', '"The Syrian president has released a press statement today notifying the country that the oil being smuggled through the borders does not, in any way shape or form, have to do with the Ba\'ath party." Maybe it\'s time we slip a little something in his shaai, shou?  ']

 The inputted messages are of the following dialect(s):  ['Lebanese' 'Lebanese' 'Lebanese' 'Fushaa' 'Lebanese' 'Fushaa' 'Lebanese'
 'Lebanese' 'Lebanese']
-----------------------------------------------------------
Here is an accuracy assessment of the model. Thank you for using the LibnaniY