#Overview

This notebook provides a comprehensive data cleaning and preprocessing pipeline for multilingual datasets in Arabic and English, including:

1. **Cities** – Tourist and historical locations in Saudi Arabia.

2. **Projects** – Major development projects in various regions.

3. **Historical Characters** – Key figures from Saudi history.

##Purpose:

Ensure all datasets are clean, uniform, and ready for analysis and NLP/RAG system.

Handle common data issues such as:

* Missing values

* Duplicate rows

* Inconsistent text formatting (newlines, extra spaces, special characters)

* Correct encoding for Arabic text


##Key Steps in the Notebook:

1. Convert columns to string for uniformity.

2. Remove missing values.

3. Clean text formatting and remove unwanted characters.

4. Recalculate content/text length.

5. Remove duplicate entries.

6. Optional advanced cleaning (e.g., swap or correct links in Cities datasets).

7. Final inspection and save cleaned datasets in UTF-8 format.

##Output

Cleaned datasets ready for data analysis, machine learning models, and retrieval-augmented generation (RAG) systems.

##Data Sources
**Cities datasets**: Collected from the **Open Data Platform** (Saudi Arabia).

**Projects datasets**: Compiled using **Saudi Projects** Database and the **Open Data Platform**.

**Historical Characters dataset**: Compiled from **Saudipedia**.

#Cities

##Feature Description

**DESTINATION**: Name of the city or region.

**NAME**: Name of the tourist or historical place.

**DESCRIPTION**: Text describing the place, its significance, or attractions.

**SM/REFERENCE LINKS**: Official website or social media links of the place.

**LOCATION**: Links to Google Maps or other references.

**DESCRIPTION_LENGTH**: Number of characters in the DESCRIPTION column.

##Load the Dataset

In [None]:
import pandas as pd

# Load datasets
ar_cities = pd.read_excel("/content/merged_cities_arabic_final_unique.xlsx")
en_cities = pd.read_excel("/content/merged_cities_english_final_unique.xlsx")

# Inspect the first rows
print("Arabic Cities Sample:")
print(ar_cities.head())
print("\nEnglish Cities Sample:")
print(en_cities.head())


Arabic Cities Sample:
  DESTINATION               NAME  \
0         جدة         بيت عنقاوي   
1         جدة           حلبة جدة   
2         جدة           بيت ذاكر   
3         جدة           بيت زنيل   
4         جدة  منتزه سيان المائي   

                                         DESCRIPTION  \
0  بيت العنقوي يقع في جدة التاريخية، ويُعدّ من ال...   
1  حلبة جدة، المعروفة أيضًا باسم “جدة ستريت سيركي...   
2  بيت ذاكر هو بيت تاريخي في جدة التاريخية، يُمثّ...   
3  بيت زينل يقع في مدخل جدة التاريخية، ويُعد من أ...   
4  منتزه سيان المائي هو وجهة ترفيهية مائية حديثة ...   

                                  SM/REFERENCE LINKS  \
0  https://www.google.com/maps/place/Angawi+House...   
1          https://maps.app.goo.gl/xHBgwz1HTeiFCHzq6   
2          https://maps.app.goo.gl/ypUxmGfT4QSdpgW37   
3          https://maps.app.goo.gl/nax3EB4xUGn69Y4k7   
4  https://www.google.com/maps/place/Cyan+Waterpa...   

                                            LOCATION  DESCRIPTION_LENGTH  
0    https:/

##Convert Columns to String

In [None]:
# Columns to convert
text_cols = ['DESTINATION', 'NAME', 'DESCRIPTION']
url_cols = ['LOCATION']  # Only convert to string, do not clean

# Convert text columns to string
for col in text_cols:
    ar_cities[col] = ar_cities[col].astype(str)
    en_cities[col] = en_cities[col].astype(str)

# Convert URL columns to string without cleaning
for col in url_cols:
    ar_cities[col] = ar_cities[col].astype(str)
    en_cities[col] = en_cities[col].astype(str)

# Verify types
print(ar_cities.dtypes)
print(en_cities.dtypes)

DESTINATION            object
NAME                   object
DESCRIPTION            object
SM/REFERENCE LINKS     object
LOCATION               object
DESCRIPTION_LENGTH    float64
dtype: object
DESTINATION            object
NAME                   object
DESCRIPTION            object
SM/REFERENCE LINKS     object
LOCATION               object
DESCRIPTION_LENGTH    float64
dtype: object


##Handle Missing Values

In [None]:
# Drop rows with missing NAME or DESCRIPTION
ar_cities = ar_cities.dropna(subset=['NAME', 'DESCRIPTION']).reset_index(drop=True)
en_cities = en_cities.dropna(subset=['NAME', 'DESCRIPTION']).reset_index(drop=True)

# Check the shape after dropping
print("Arabic Cities Shape after removing missing values:", ar_cities.shape)
print("English Cities Shape after removing missing values:", en_cities.shape)

Arabic Cities Shape after removing missing values: (187, 6)
English Cities Shape after removing missing values: (189, 6)


##Clean Text Formatting

In [None]:
# Columns to clean
clean_cols = ['DESTINATION', 'NAME', 'DESCRIPTION']

# Clean text columns
for col in clean_cols:
    ar_cities[col] = ar_cities[col].str.replace('\\n', ' ', regex=True).str.strip()
    en_cities[col] = en_cities[col].str.replace('\\n', ' ', regex=True).str.strip()

# Display first few rows to verify
print("Arabic Cities Sample after text cleaning:")
print(ar_cities.head())
print("\nEnglish Cities Sample after text cleaning:")
print(en_cities.head())

Arabic Cities Sample after text cleaning:
  DESTINATION               NAME  \
0         جدة         بيت عنقاوي   
1         جدة           حلبة جدة   
2         جدة           بيت ذاكر   
3         جدة           بيت زنيل   
4         جدة  منتزه سيان المائي   

                                         DESCRIPTION  \
0  بيت العنقوي يقع في جدة التاريخية، ويُعدّ من ال...   
1  حلبة جدة، المعروفة أيضًا باسم “جدة ستريت سيركي...   
2  بيت ذاكر هو بيت تاريخي في جدة التاريخية، يُمثّ...   
3  بيت زينل يقع في مدخل جدة التاريخية، ويُعد من أ...   
4  منتزه سيان المائي هو وجهة ترفيهية مائية حديثة ...   

                                  SM/REFERENCE LINKS  \
0  https://www.google.com/maps/place/Angawi+House...   
1          https://maps.app.goo.gl/xHBgwz1HTeiFCHzq6   
2          https://maps.app.goo.gl/ypUxmGfT4QSdpgW37   
3          https://maps.app.goo.gl/nax3EB4xUGn69Y4k7   
4  https://www.google.com/maps/place/Cyan+Waterpa...   

                                            LOCATION  DESCRIPTION_L

##Remove Duplicate Rows

In [None]:
# Drop duplicates and reset index
ar_cities = ar_cities.drop_duplicates(subset=['NAME', 'DESTINATION']).reset_index(drop=True)
en_cities = en_cities.drop_duplicates(subset=['NAME', 'DESTINATION']).reset_index(drop=True)

# Check the shape after removing duplicates
print("Arabic Cities Shape after removing duplicates:", ar_cities.shape)
print("English Cities Shape after removing duplicates:", en_cities.shape) #No duplicates were found, so all entries are already unique.

Arabic Cities Shape after removing duplicates: (187, 6)
English Cities Shape after removing duplicates: (189, 6)


##Recalculate Description Length

In [None]:
# Recalculate DESCRIPTION_LENGTH
ar_cities['DESCRIPTION_LENGTH'] = ar_cities['DESCRIPTION'].str.len()
en_cities['DESCRIPTION_LENGTH'] = en_cities['DESCRIPTION'].str.len()

# Check the first few rows to verify
print("Arabic Cities Sample with updated DESCRIPTION_LENGTH:")
print(ar_cities[['NAME', 'DESCRIPTION_LENGTH']].head())
print("\nEnglish Cities Sample with updated DESCRIPTION_LENGTH:")
print(en_cities[['NAME', 'DESCRIPTION_LENGTH']].head())

Arabic Cities Sample with updated DESCRIPTION_LENGTH:
                NAME  DESCRIPTION_LENGTH
0         بيت عنقاوي                 895
1           حلبة جدة                 995
2           بيت ذاكر                1399
3           بيت زنيل                 658
4  منتزه سيان المائي                 568

English Cities Sample with updated DESCRIPTION_LENGTH:
                     NAME  DESCRIPTION_LENGTH
0          al aaqaba view                 390
1       al basta district                 526
2        al dabab walkway                 454
3         al ghayl valley                 431
4  al muftaha art village                 593


##Inspect Cleaned Data

In [None]:
# Check final shapes
print("Final Arabic Cities Shape:", ar_cities.shape)
print("Final English Cities Shape:", en_cities.shape)

# Preview first few rows
print("\nArabic Cities Sample:")
print(ar_cities.head())

print("\nEnglish Cities Sample:")
print(en_cities.head())

Final Arabic Cities Shape: (187, 6)
Final English Cities Shape: (189, 6)

Arabic Cities Sample:
  DESTINATION               NAME  \
0         جدة         بيت عنقاوي   
1         جدة           حلبة جدة   
2         جدة           بيت ذاكر   
3         جدة           بيت زنيل   
4         جدة  منتزه سيان المائي   

                                         DESCRIPTION  \
0  بيت العنقوي يقع في جدة التاريخية، ويُعدّ من ال...   
1  حلبة جدة، المعروفة أيضًا باسم “جدة ستريت سيركي...   
2  بيت ذاكر هو بيت تاريخي في جدة التاريخية، يُمثّ...   
3  بيت زينل يقع في مدخل جدة التاريخية، ويُعد من أ...   
4  منتزه سيان المائي هو وجهة ترفيهية مائية حديثة ...   

                                  SM/REFERENCE LINKS  \
0  https://www.google.com/maps/place/Angawi+House...   
1          https://maps.app.goo.gl/xHBgwz1HTeiFCHzq6   
2          https://maps.app.goo.gl/ypUxmGfT4QSdpgW37   
3          https://maps.app.goo.gl/nax3EB4xUGn69Y4k7   
4  https://www.google.com/maps/place/Cyan+Waterpa...   

             

##Swap Google Maps Links Between Columns

In [None]:
import re

def swap_maps_links(df):
    new_location = []
    new_reference = []

    for loc, ref in zip(df['LOCATION'], df['SM/REFERENCE LINKS']):
        # Find maps link in reference
        maps_in_ref = re.findall(r'https?://[^\s]*maps[^\s]*', str(ref))

        if maps_in_ref:
            # Take first maps link from reference to LOCATION
            new_location.append(maps_in_ref[0])
            # Move old LOCATION value to reference, combine with original ref (without maps link)
            ref_cleaned = str(ref).replace(maps_in_ref[0], '').strip()
            combined_ref = loc.strip()
            if ref_cleaned:
                combined_ref += " | " + ref_cleaned  # keep other reference links
            new_reference.append(combined_ref)
        else:
            # No maps link in reference → keep as is
            new_location.append(loc)
            new_reference.append(ref)

    df['LOCATION'] = new_location
    df['SM/REFERENCE LINKS'] = new_reference
    return df

# Apply to both datasets
ar_cities = swap_maps_links(ar_cities)
en_cities = swap_maps_links(en_cities)

# Check results
print("Arabic Cities Sample after swapping maps links:")
print(ar_cities[['NAME', 'LOCATION', 'SM/REFERENCE LINKS']].head())

print("\nEnglish Cities Sample after swapping maps links:")
print(en_cities[['NAME', 'LOCATION', 'SM/REFERENCE LINKS']].head())

Arabic Cities Sample after swapping maps links:
                NAME                                           LOCATION  \
0         بيت عنقاوي  https://www.google.com/maps/place/Angawi+House...   
1           حلبة جدة          https://maps.app.goo.gl/xHBgwz1HTeiFCHzq6   
2           بيت ذاكر          https://maps.app.goo.gl/ypUxmGfT4QSdpgW37   
3           بيت زنيل          https://maps.app.goo.gl/nax3EB4xUGn69Y4k7   
4  منتزه سيان المائي  https://www.google.com/maps/place/Cyan+Waterpa...   

                                  SM/REFERENCE LINKS  
0    https://www.instagram.com/samiangawiarchitects/  
1     https://www.instagram.com/jeddahcircuit/?hl=en  
2       https://www.visitalbalad.com/en/explore/1550  
3  https://www.instagram.com/jeddahalbalad.sa?igs...  
4  https://www.instagram.com/cyanwaterpark?utm_so...  

English Cities Sample after swapping maps links:
                     NAME                                    LOCATION  \
0          al aaqaba view  https://www.traveldiv

##Arabic and English Text Cleaning

In [None]:
def clean_advanced_arabic(text):
    # Remove emojis and symbols
    text = re.sub(r'[^\w\sء-ي]', '', text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply to NAME and DESCRIPTION
ar_cities['NAME'] = ar_cities['NAME'].apply(clean_advanced_arabic)
ar_cities['DESCRIPTION'] = ar_cities['DESCRIPTION'].apply(clean_advanced_arabic)

In [None]:
def clean_advanced_english(text):
    # Remove special characters except letters, numbers, and basic punctuation
    text = re.sub(r'[^A-Za-z0-9\s.,]', '', text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply to NAME and DESCRIPTION
en_cities['NAME'] = en_cities['NAME'].apply(clean_advanced_english)
en_cities['DESCRIPTION'] = en_cities['DESCRIPTION'].apply(clean_advanced_english)

In [None]:
ar_cities.head()

Unnamed: 0,DESTINATION,NAME,DESCRIPTION,SM/REFERENCE LINKS,LOCATION,DESCRIPTION_LENGTH
0,جدة,بيت عنقاوي,بيت العنقوي يقع في جدة التاريخية ويعد من الأمث...,https://www.instagram.com/samiangawiarchitects/,https://www.google.com/maps/place/Angawi+House...,895
1,جدة,حلبة جدة,حلبة جدة المعروفة أيضا باسم جدة ستريت سيركيت J...,https://www.instagram.com/jeddahcircuit/?hl=en,https://maps.app.goo.gl/xHBgwz1HTeiFCHzq6,995
2,جدة,بيت ذاكر,بيت ذاكر هو بيت تاريخي في جدة التاريخية يمثل ج...,https://www.visitalbalad.com/en/explore/1550,https://maps.app.goo.gl/ypUxmGfT4QSdpgW37,1399
3,جدة,بيت زنيل,بيت زينل يقع في مدخل جدة التاريخية ويعد من أبر...,https://www.instagram.com/jeddahalbalad.sa?igs...,https://maps.app.goo.gl/nax3EB4xUGn69Y4k7,658
4,جدة,منتزه سيان المائي,منتزه سيان المائي هو وجهة ترفيهية مائية حديثة ...,https://www.instagram.com/cyanwaterpark?utm_so...,https://www.google.com/maps/place/Cyan+Waterpa...,568


In [None]:
en_cities.head()

Unnamed: 0,DESTINATION,NAME,DESCRIPTION,SM/REFERENCE LINKS,LOCATION,DESCRIPTION_LENGTH
0,abha,al aaqaba view,"Perched high above the city, Al Aaqaba View of...",https://g.co/kgs/htz7osq,https://www.traveldiv.com/tourism-in-abha/,390
1,abha,al basta district,"The historic heart of Abha, Al Basta District ...",https://discoveraseer.com/en/places/hy-albsth,https://maps.app.goo.gl/kppkmnmefytzznvz6,526
2,abha,al dabab walkway,"Known as the Fog Walkway, this elevated path o...",https://discoveraseer.com/en/places/mmsha-aldabab,https://maps.app.goo.gl/ugdpjbbovqdpdd6w6,454
3,abha,al ghayl valley,"A hidden gem near the city, Al Ghayl Valley is...",https://discoveraseer.com/en/places/oady-alghyl,https://maps.app.goo.gl/ecdwtntfgr93zqwc6,431
4,abha,al muftaha art village,"A cultural landmark in Abha, Al Muftaha Art Vi...",https://www.visitsaudi.com/en/aseer/attraction...,https://maps.app.goo.gl/pfvnoecorpcgtly78,593


In [None]:
# Check final shapes
print("Final Arabic Cities Shape:", ar_cities.shape)
print("Final English Cities Shape:", en_cities.shape)

Final Arabic Cities Shape: (187, 6)
Final English Cities Shape: (189, 6)


##Summary statistics

In [None]:
# Summary statistics
print("\nArabic Cities Description Length Stats:")
print(ar_cities['DESCRIPTION_LENGTH'].describe())

print("\nEnglish Cities Description Length Stats:")
print(en_cities['DESCRIPTION_LENGTH'].describe())


Arabic Cities Description Length Stats:
count     187.000000
mean      401.454545
std       175.235270
min       119.000000
25%       286.000000
50%       400.000000
75%       468.500000
max      1399.000000
Name: DESCRIPTION_LENGTH, dtype: float64

English Cities Description Length Stats:
count     189.000000
mean      533.253968
std       191.928924
min       234.000000
25%       427.000000
50%       499.000000
75%       602.000000
max      1969.000000
Name: DESCRIPTION_LENGTH, dtype: float64


In [None]:
# Check for missing values
print("Arabic Cities Missing Values:\n", ar_cities.isnull().sum())
print("\nEnglish Cities Missing Values:\n", en_cities.isnull().sum())

# Check for duplicates based on all columns
print("\nArabic Cities Duplicate Rows:", ar_cities.duplicated().sum())
print("English Cities Duplicate Rows:", en_cities.duplicated().sum())

Arabic Cities Missing Values:
 DESTINATION           0
NAME                  0
DESCRIPTION           0
SM/REFERENCE LINKS    0
LOCATION              0
DESCRIPTION_LENGTH    0
dtype: int64

English Cities Missing Values:
 DESTINATION           0
NAME                  0
DESCRIPTION           0
SM/REFERENCE LINKS    0
LOCATION              0
DESCRIPTION_LENGTH    0
dtype: int64

Arabic Cities Duplicate Rows: 0
English Cities Duplicate Rows: 0


In [None]:
# check duplicates based on NAME and DESTINATION only
print("\nArabic Cities Duplicates by NAME & DESTINATION:", ar_cities.duplicated(subset=['NAME','DESTINATION']).sum())
print("English Cities Duplicates by NAME & DESTINATION:", en_cities.duplicated(subset=['NAME','DESTINATION']).sum())


Arabic Cities Duplicates by NAME & DESTINATION: 0
English Cities Duplicates by NAME & DESTINATION: 0


##Saving CSV files

In [None]:
# Save cleaned datasets as CSV
ar_cities.to_csv("Arabic_Cities_Cleaned.csv", index=False, encoding='utf-8-sig')# We use utf-8-sig to avoid encoding issues, especially with Arabic text
en_cities.to_csv("English_Cities_Cleaned.csv", index=False, encoding='utf-8-sig')

print("Cleaned datasets saved as CSV successfully!")

Cleaned datasets saved as CSV successfully!


#Projects

##Feature Description
**title**: Name of the project.

**content**: Detailed description of the project, its scope, and objectives.

**content_length**: Number of characters in the content column.

##Load Projects Dataset

In [None]:
#Load the Dataset
en_projects = pd.read_csv("/content/projects_dataset_eng.csv", encoding='latin1')
ar_projects = pd.read_csv("/content/projects_dataset_ara.csv", encoding='latin1')

# Inspect first rows
print("English Projects Sample:")
print(en_projects.head())

print("\nArabic Projects Sample:")
print(ar_projects.head())

English Projects Sample:
                                title  \
0                             Qiddiya   
1                               ROSHN   
2                          Riyadh Art   
3                  Saudi Sports Track   
4  Mohammed bin Salman Nonprofit City   

                                             content  
0  Qiddiya will be the future capital of entertai...  
1  ROSHN Group develops large-scale projects acro...  
2  Riyadh Art offers a vibrant world that nurture...  
3  The Sports Boulevard inspires residents and vi...  
4  Prince Mohammed bin Salman Nonprofit City  M...  

Arabic Projects Sample:
                             title  \
0                           ÇáÞÏíÉ   
1                             ÑæÔä   
2                       ÇáÑíÇÖ ÂÑÊ   
3                   ÇáãÓÇÑ ÇáÑíÇÖí   
4  ãÏíäÉ ãÍãÏ Èä ÓáãÇä ÛíÑ ÇáÑÈÍíÉ   

                                             content  Unnamed: 2  
0  ÓÊßæä ÇáÞÏíÉ ÚÇÕãÉ ÇáãÓÊÞÈá ááÊÑÝíå æÇáÑíÇÖÉ æ...         NaN  
1  ÊõØæÑ "

In [None]:
# Read Arabic Projects with correct Arabic encoding
ar_projects = pd.read_csv("/content/projects_dataset_ara.csv", encoding='cp1256')

# Inspect first rows again
print("Arabic Projects Sample (corrected encoding):")
print(ar_projects.head())

Arabic Projects Sample (corrected encoding):
                             title  \
0                           القدية   
1                             روشن   
2                       الرياض آرت   
3                   المسار الرياضي   
4  مدينة محمد بن سلمان غير الربحية   

                                             content  Unnamed: 2  
0  ستكون القدية عاصمة المستقبل للترفيه والرياضة و...         NaN  
1  تُطور "مجموعة روشن" مشاريع كبرى على مستوى المم...         NaN  
2  تتيح لك "الرياض آرت " عالماً نابضاً بالحياة يغ...         NaN  
3  يُلهم المسار الرياضي سكان وزائري مدينة الرياض ...         NaN  
4  مدينة محمد بن سلمان غير الربحية "مدينة مسك “هي...         NaN  


In [None]:
# Drop empty/unnecessary column in Arabic Projects
ar_projects = ar_projects.drop(columns=['Unnamed: 2'])

# Confirm changes
print("Arabic Projects Columns after dropping unnecessary column:")
print(ar_projects.columns)

Arabic Projects Columns after dropping unnecessary column:
Index(['title', 'content'], dtype='object')


##Convert Columns to String

In [None]:
text_cols_en = ['title', 'content']
text_cols_ar = ['title', 'content']

# Convert English columns to string
for col in text_cols_en:
    en_projects[col] = en_projects[col].astype(str)

# Convert Arabic columns to string
for col in text_cols_ar:
    ar_projects[col] = ar_projects[col].astype(str)

##Remove Missing Values

In [None]:
# Drop rows with missing values in English Projects
en_projects = en_projects.dropna(subset=['title', 'content'])

# Drop rows with missing values in Arabic Projects
ar_projects = ar_projects.dropna(subset=['title', 'content'])

# Check shapes after removing missing values
print("English Projects Shape after removing missing values:", en_projects.shape)
print("Arabic Projects Shape after removing missing values:", ar_projects.shape)

English Projects Shape after removing missing values: (47, 2)
Arabic Projects Shape after removing missing values: (49, 2)


##Clean Text Formatting

In [None]:
# English Projects text cleaning
for col in ['title', 'content']:
    en_projects[col] = en_projects[col].str.replace('\n', ' ', regex=True).str.strip()

# Arabic Projects text cleaning
for col in ['title', 'content']:
    ar_projects[col] = ar_projects[col].str.replace('\n', ' ', regex=True).str.strip()

# Preview first rows after cleaning
print("English Projects Sample after text formatting:")
print(en_projects.head())

print("\nArabic Projects Sample after text formatting:")
print(ar_projects.head())

English Projects Sample after text formatting:
                                title  \
0                             Qiddiya   
1                               ROSHN   
2                          Riyadh Art   
3                  Saudi Sports Track   
4  Mohammed bin Salman Nonprofit City   

                                             content  
0  Qiddiya will be the future capital of entertai...  
1  ROSHN Group develops large-scale projects acro...  
2  Riyadh Art offers a vibrant world that nurture...  
3  The Sports Boulevard inspires residents and vi...  
4  Prince Mohammed bin Salman Nonprofit City  M...  

Arabic Projects Sample after text formatting:
                             title  \
0                           القدية   
1                             روشن   
2                       الرياض آرت   
3                   المسار الرياضي   
4  مدينة محمد بن سلمان غير الربحية   

                                             content  
0  ستكون القدية عاصمة المستقبل للترفيه والريا

##Remove Special Characters & Emojis

In [None]:
import re

# Function to clean Arabic text
def clean_advanced_arabic(text):
    text = re.sub(r'[^\w\sء-ي]', '', text)  # Keep Arabic letters, numbers, whitespace
    text = re.sub(r'\s+', ' ', text).strip()  # Normalize whitespace
    return text

# Function to clean English text
def clean_advanced_english(text):
    text = re.sub(r'[^A-Za-z0-9\s.,]', '', text)  # Keep letters, numbers, basic punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # Normalize whitespace
    return text

# Apply cleaning to Arabic Projects
for col in ['title', 'content']:
    ar_projects[col] = ar_projects[col].apply(clean_advanced_arabic)

# Apply cleaning to English Projects
for col in ['title', 'content']:
    en_projects[col] = en_projects[col].apply(clean_advanced_english)

# Preview cleaned datasets
print("Arabic Projects Sample after advanced cleaning:")
print(ar_projects.head())

print("\nEnglish Projects Sample after advanced cleaning:")
print(en_projects.head())

Arabic Projects Sample after advanced cleaning:
                             title  \
0                           القدية   
1                             روشن   
2                       الرياض آرت   
3                   المسار الرياضي   
4  مدينة محمد بن سلمان غير الربحية   

                                             content  
0  ستكون القدية عاصمة المستقبل للترفيه والرياضة و...  
1  تطور مجموعة روشن مشاريع كبرى على مستوى المملكة...  
2  تتيح لك الرياض آرت عالما نابضا بالحياة يغذي إب...  
3  يلهم المسار الرياضي سكان وزائري مدينة الرياض ل...  
4  مدينة محمد بن سلمان غير الربحية مدينة مسك هي أ...  

English Projects Sample after advanced cleaning:
                                title  \
0                             Qiddiya   
1                               ROSHN   
2                          Riyadh Art   
3                  Saudi Sports Track   
4  Mohammed bin Salman Nonprofit City   

                                             content  
0  Qiddiya will be the future capital of 

##Calculate Content Length

In [None]:
# English Projects content length
en_projects['content_length'] = en_projects['content'].apply(lambda x: len(x))

# Arabic Projects content length
ar_projects['content_length'] = ar_projects['content'].apply(lambda x: len(x))

# Preview first rows
print("English Projects Sample with content length:")
print(en_projects.head())

print("\nArabic Projects Sample with content length:")
print(ar_projects.head())

English Projects Sample with content length:
                                title  \
0                             Qiddiya   
1                               ROSHN   
2                          Riyadh Art   
3                  Saudi Sports Track   
4  Mohammed bin Salman Nonprofit City   

                                             content  content_length  
0  Qiddiya will be the future capital of entertai...            2647  
1  ROSHN Group develops largescale projects acros...            1944  
2  Riyadh Art offers a vibrant world that nurture...            1519  
3  The Sports Boulevard inspires residents and vi...            1573  
4  Prince Mohammed bin Salman Nonprofit City Misk...            1279  

Arabic Projects Sample with content length:
                             title  \
0                           القدية   
1                             روشن   
2                       الرياض آرت   
3                   المسار الرياضي   
4  مدينة محمد بن سلمان غير الربحية   

        

##Final Inspection & Save Cleaned Dataset

In [None]:
# Check for missing values
print("English Projects Missing Values:\n", en_projects.isnull().sum())
print("\nArabic Projects Missing Values:\n", ar_projects.isnull().sum())

# Check for duplicates
print("\nEnglish Projects Duplicate Rows:", en_projects.duplicated().sum())
print("Arabic Projects Duplicate Rows:", ar_projects.duplicated().sum())

English Projects Missing Values:
 title             0
content           0
content_length    0
dtype: int64

Arabic Projects Missing Values:
 title             0
content           0
content_length    0
dtype: int64

English Projects Duplicate Rows: 0
Arabic Projects Duplicate Rows: 1


In [None]:
# Remove duplicate rows in Arabic Projects
ar_projects = ar_projects.drop_duplicates()

# Confirm duplicates removed
print("Arabic Projects Duplicate Rows after removal:", ar_projects.duplicated().sum())

Arabic Projects Duplicate Rows after removal: 0


##Saving the CSV files

In [None]:
# We use utf-8-sig to avoid encoding issues, especially with Arabic text
en_projects.to_csv("English_Projects_Cleaned.csv", index=False, encoding='utf-8-sig')
ar_projects.to_csv("Arabic_Projects_Cleaned.csv", index=False, encoding='utf-8-sig')

print("Cleaned datasets saved successfully as CSV!")


Cleaned datasets saved successfully as CSV!


#Characters Dataset

##Feature Description
**title**: Name of the historical figure.

**content**: Biography or description including achievements, positions, and historical relevance.

**content_length**: Number of characters in the content column.

##Load Characters Dataset

In [None]:
# Load datasets
ar_characters = pd.read_excel("/content/historical characters.AR.xlsx")
en_characters = pd.read_excel("/content/characters ENG.xlsx")

In [None]:
# Inspect Arabic Characters dataset
print("Arabic Characters Columns:")
print(ar_characters.columns)
print("\nArabic Characters Sample:")
print(ar_characters.head())

Arabic Characters Columns:
Index(['title', 'content'], dtype='object')

Arabic Characters Sample:
                                   title  \
0  الملك عبدالعزيز بن عبدالرحمن آل سعود.   
1       الملك سعود بن عبدالعزيز آل سعود.   
2       الملك فيصل بن عبدالعزيز آل سعود.   
3       الملك خالد بن عبدالعزيز آل سعود.   
4        الملك فهد بن عبدالعزيز آل سعود.   

                                             content  
0  \nأبو تركي.\nالترتيب في حكام الدولة السعودية\n...  
1  \nالاسم\nالملك سعود بن عبدالعزيز آل سعود.\nتار...  
2  \n\nالاسم\n\nالملك فيصل بن عبدالعزيز آل سعود.\...  
3  \nالاسم\nالملك خالد بن عبدالعزيز آل سعود.\nالم...  
4  الاسم\nالملك فهد بن عبدالعزيز آل سعود.\nالمنصب...  


In [None]:
# Remove newlines and extra spaces for Arabic content
ar_characters['content'] = ar_characters['content'].str.replace('\n', ' ', regex=True).str.strip()

# Preview cleaned content
print(ar_characters['content'].head())

0    أبو تركي. الترتيب في حكام الدولة السعودية التا...
1    الاسم الملك سعود بن عبدالعزيز آل سعود. تاريخ ا...
2    الاسم  الملك فيصل بن عبدالعزيز آل سعود. المنصب...
3    الاسم الملك خالد بن عبدالعزيز آل سعود. المنصب ...
4    الاسم الملك فهد بن عبدالعزيز آل سعود. المنصب خ...
Name: content, dtype: object


##Convert Columns to String

In [None]:
# Define text columns
text_cols = ['title', 'content']

# Convert Arabic columns to string
for col in text_cols:
    ar_characters[col] = ar_characters[col].astype(str)

# Convert English columns to string
for col in text_cols:
    en_characters[col] = en_characters[col].astype(str)

# Preview samples
print("Arabic Characters Sample after converting to string:")
print(ar_characters.head())

print("\nEnglish Characters Sample after converting to string:")
print(en_characters.head())

Arabic Characters Sample after converting to string:
                                   title  \
0  الملك عبدالعزيز بن عبدالرحمن آل سعود.   
1       الملك سعود بن عبدالعزيز آل سعود.   
2       الملك فيصل بن عبدالعزيز آل سعود.   
3       الملك خالد بن عبدالعزيز آل سعود.   
4        الملك فهد بن عبدالعزيز آل سعود.   

                                             content  
0  \nأبو تركي.\nالترتيب في حكام الدولة السعودية\n...  
1  \nالاسم\nالملك سعود بن عبدالعزيز آل سعود.\nتار...  
2  \n\nالاسم\n\nالملك فيصل بن عبدالعزيز آل سعود.\...  
3  \nالاسم\nالملك خالد بن عبدالعزيز آل سعود.\nالم...  
4  الاسم\nالملك فهد بن عبدالعزيز آل سعود.\nالمنصب...  

English Characters Sample after converting to string:
                               title  \
0  Abdulaziz Bin Abdulrahman Al Saud   
1         Saud Bin Abdulaziz Al Saud   
2       Faisal Bin Abdulaziz Al Saud   
3       Khalid Bin Abdulaziz Al Saud   
4         Fahd Bin Abdulaziz Al Saud   

                                             content  
0

In [None]:
# Inspect English Characters dataset
print("\nEnglish Characters Columns:")
print(en_characters.columns)
print("\nEnglish Characters Sample:")
print(en_characters.head())


English Characters Columns:
Index(['title', 'content'], dtype='object')

English Characters Sample:
                               title  \
0  Abdulaziz Bin Abdulrahman Al Saud   
1         Saud Bin Abdulaziz Al Saud   
2       Faisal Bin Abdulaziz Al Saud   
3       Khalid Bin Abdulaziz Al Saud   
4         Fahd Bin Abdulaziz Al Saud   

                                             content  
0  King Abdulaziz Bin Abdulrahman Bin Faisal Bin ...  
1  King Saud Bin Abdulaziz Al Saud (1902-1969) wa...  
2  King Faisal Bin Abdulaziz Al Saud (1906-1975) ...  
3  King Khalid Bin Abdulaziz Al Saud (1913-1982) ...  
4  King Fahd Bin Abdulaziz Al Saud (1921-2005), w...  


##Remove Missing Values

In [None]:
# Remove missing values in Arabic Characters
ar_characters = ar_characters.dropna(subset=['title', 'content'])

# Remove missing values in English Characters
en_characters = en_characters.dropna(subset=['title', 'content'])

# Check the shape after removal
print("Arabic Characters Shape after removing missing values:", ar_characters.shape)
print("English Characters Shape after removing missing values:", en_characters.shape)

Arabic Characters Shape after removing missing values: (105, 2)
English Characters Shape after removing missing values: (105, 2)


##Clean Text Formatting & Advanced Cleaning

In [None]:
# Function for Arabic advanced cleaning
def clean_advanced_arabic(text):
    # Remove unwanted characters, keep letters and numbers
    text = re.sub(r'[^\w\sء-ي]', '', text)
    # Replace multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text)
    # Strip leading and trailing spaces
    return text.strip()

# Function for English advanced cleaning
def clean_advanced_english(text):
    text = re.sub(r'[^A-Za-z0-9\s.,]', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

# Apply cleaning
ar_characters['content'] = ar_characters['content'].apply(clean_advanced_arabic)
en_characters['content'] = en_characters['content'].apply(clean_advanced_english)

# Preview
print("Arabic Characters Sample after advanced cleaning:")
print(ar_characters.head())

print("\nEnglish Characters Sample after advanced cleaning:")
print(en_characters.head())

Arabic Characters Sample after advanced cleaning:
                                   title  \
0  الملك عبدالعزيز بن عبدالرحمن آل سعود.   
1       الملك سعود بن عبدالعزيز آل سعود.   
2       الملك فيصل بن عبدالعزيز آل سعود.   
3       الملك خالد بن عبدالعزيز آل سعود.   
4        الملك فهد بن عبدالعزيز آل سعود.   

                                             content  
0  أبو تركي الترتيب في حكام الدولة السعودية التاس...  
1  الاسم الملك سعود بن عبدالعزيز آل سعود تاريخ ال...  
2  الاسم الملك فيصل بن عبدالعزيز آل سعود المنصب ث...  
3  الاسم الملك خالد بن عبدالعزيز آل سعود المنصب ر...  
4  الاسم الملك فهد بن عبدالعزيز آل سعود المنصب خا...  

English Characters Sample after advanced cleaning:
                               title  \
0  Abdulaziz Bin Abdulrahman Al Saud   
1         Saud Bin Abdulaziz Al Saud   
2       Faisal Bin Abdulaziz Al Saud   
3       Khalid Bin Abdulaziz Al Saud   
4         Fahd Bin Abdulaziz Al Saud   

                                             content  
0  King

##Calculate Content Length

In [None]:
# Add a new column for content length
ar_characters['content_length'] = ar_characters['content'].apply(len)
en_characters['content_length'] = en_characters['content'].apply(len)

# Preview
print("Arabic Characters Sample with content length:")
print(ar_characters[['title', 'content_length']].head())

print("\nEnglish Characters Sample with content length:")
print(en_characters[['title', 'content_length']].head())

Arabic Characters Sample with content length:
                                   title  content_length
0  الملك عبدالعزيز بن عبدالرحمن آل سعود.           31619
1       الملك سعود بن عبدالعزيز آل سعود.           31820
2       الملك فيصل بن عبدالعزيز آل سعود.           31667
3       الملك خالد بن عبدالعزيز آل سعود.           31602
4        الملك فهد بن عبدالعزيز آل سعود.           31725

English Characters Sample with content length:
                               title  content_length
0  Abdulaziz Bin Abdulrahman Al Saud           32523
1         Saud Bin Abdulaziz Al Saud           32642
2       Faisal Bin Abdulaziz Al Saud           32631
3       Khalid Bin Abdulaziz Al Saud           32611
4         Fahd Bin Abdulaziz Al Saud           32633


##Remove Duplicates

In [None]:
# Remove duplicate rows
ar_characters = ar_characters.drop_duplicates()
en_characters = en_characters.drop_duplicates()

# Check shapes after removing duplicates
print("Arabic Characters Shape after removing duplicates:", ar_characters.shape)
print("English Characters Shape after removing duplicates:", en_characters.shape)

Arabic Characters Shape after removing duplicates: (101, 3)
English Characters Shape after removing duplicates: (101, 3)


In [None]:
# Check for missing values
print("English Characters Missing Values:\n", en_characters.isnull().sum())
print("\nArabic Characters Missing Values:\n", ar_characters.isnull().sum())

English Characters Missing Values:
 title             0
content           0
content_length    0
dtype: int64

Arabic Characters Missing Values:
 title             0
content           0
content_length    0
dtype: int64


##Final Inspection & Save

In [None]:
# Check for missing values one last time
print("Arabic Characters Missing Values:\n", ar_characters.isnull().sum())
print("\nEnglish Characters Missing Values:\n", en_characters.isnull().sum())

# Save cleaned datasets as CSV
ar_characters.to_csv("Arabic_Characters_Cleaned.csv", index=False, encoding='utf-8-sig')  # Using utf-8-sig for Arabic
en_characters.to_csv("English_Characters_Cleaned.csv", index=False, encoding='utf-8-sig')

print("\nDatasets saved successfully!")

Arabic Characters Missing Values:
 title             0
content           0
content_length    0
dtype: int64

English Characters Missing Values:
 title             0
content           0
content_length    0
dtype: int64

Datasets saved successfully!
