# **PANDAS**

#### **Data Processing using Pandas library (Benchmark)**

In [None]:
import pandas as pd
import re
import time
import psutil


> Standalone function to calculate and return processing metrics.
- Parameters:
    - df: pandas DataFrame being processed
    
- Returns:
    - A dictionary with metrics such as CPU usage, memory usage, processing time, etc.



In [None]:
def calculate_processing_metrics(df):

    # Record the start time
    start_time = time.time()

    # Get initial system metrics
    initial_cpu = psutil.cpu_percent(interval=1)
    initial_memory = psutil.virtual_memory().percent


    duplicate_count = (df.duplicated() == 1).sum()

    # Record the end time of the operation
    end_time = time.time()

    # Get final system metrics
    final_cpu = psutil.cpu_percent(interval=1)
    final_memory = psutil.virtual_memory().percent

    # Calculate processing time
    processing_time = end_time - start_time

    # Calculate throughput (assuming rows processed are equal to the DataFrame rows)
    throughput = len(df) / processing_time if processing_time > 0 else 0

    # Return the metrics as a dictionary
    # Return the metrics in a nicely formatted way
    metrics = (
        f"Total Rows Processed: {len(df):,} records\n"
        f"Total Processing Time: {processing_time:.4f} seconds\n"
        f"Initial CPU Usage: {initial_cpu:.2f}%\n"
        f"Final CPU Usage: {final_cpu:.2f}%\n"
        f"Memory Usage: {final_memory:.2f}%\n"
        f"Throughput (Records per Second): {throughput:.2f} records/sec"
    )

    return metrics

> **1. Loading Data**

We load the raw dataset from the `NST_News_Articles.csv` file. The dataset contains information such as the article's title, teaser, URL, and category.

In [None]:
rd = pd.read_excel("NST_News_Articles.xlsx")

print(calculate_processing_metrics(rd))

Total Rows Processed: 110,641 records
Total Processing Time: 1.1632 seconds
Initial CPU Usage: 62.50%
Final CPU Usage: 5.00%
Memory Usage: 21.50%
Throughput (Records per Second): 95121.45 records/sec


> **2. Handle Duplicated Data**

We remove any duplicate rows from the dataset to avoid redundant data.



In [None]:
df_cleaned = rd.drop_duplicates()
print(calculate_processing_metrics(df_cleaned))


Total Rows Processed: 106,473 records
Total Processing Time: 1.1389 seconds
Initial CPU Usage: 5.10%
Final CPU Usage: 5.50%
Memory Usage: 21.50%
Throughput (Records per Second): 93488.10 records/sec


> **3. Handle Missing Data**

We drop rows with missing values in key columns to maintain data quality.


In [None]:
df_cleaned = df_cleaned.dropna()
print(calculate_processing_metrics(df_cleaned))

Total Rows Processed: 105,392 records
Total Processing Time: 1.1954 seconds
Initial CPU Usage: 10.00%
Final CPU Usage: 4.60%
Memory Usage: 21.50%
Throughput (Records per Second): 88163.65 records/sec


> **4. Clean the Teaser Column**

We clean the `Teaser` column by removing unwanted characters and ensuring that the teaser follows a standard format (e.g., extracting place and content from the teaser).

In [None]:
df_cleaned['Teaser'] = df_cleaned['Teaser'].str.replace(r'[^a-zA-Z0-9: ,]', '', regex=True)
print(calculate_processing_metrics(df_cleaned))

Total Rows Processed: 105,392 records
Total Processing Time: 1.1841 seconds
Initial CPU Usage: 5.10%
Final CPU Usage: 5.50%
Memory Usage: 21.50%
Throughput (Records per Second): 89002.29 records/sec


In [None]:
df_cleaned = df_cleaned[df_cleaned['Teaser'].str.contains(':')]
print(calculate_processing_metrics(df_cleaned))

Total Rows Processed: 103,070 records
Total Processing Time: 1.1749 seconds
Initial CPU Usage: 5.60%
Final CPU Usage: 27.80%
Memory Usage: 21.40%
Throughput (Records per Second): 87724.97 records/sec


> **5. Splitting the place from 'Teaser' column**


In [None]:
df_cleaned[['Place', 'Teaser']] = df_cleaned['Teaser'].str.split(':', n=1, expand=True)
print(calculate_processing_metrics(df_cleaned))

#KUALA LUMPUR: The Central Database Hub (PADU) system has recorded a total of 2.38 million individual information updates

Total Rows Processed: 103,070 records
Total Processing Time: 1.2900 seconds
Initial CPU Usage: 57.80%
Final CPU Usage: 33.30%
Memory Usage: 21.50%
Throughput (Records per Second): 79902.15 records/sec


> **6. Extract and Standardize Place Names**

We standardize the place names, convert them to uppercase, and remove any country names or other non-relevant information.

In [None]:
place_corrections = {
    'ALOR STAR': 'ALOR SETAR', 'AOR SETAR': 'ALOR SETAR','LOR STAR':'ALOR SETAR','ASTANA KAZAKHSTAN':'ASTANA',
    'BALIK PULAI':'BALIK PULAU','BATANG AI': 'BATANG KALI', 'BAGAN DATOH':'BAGAN DATUK',
    'CAMERON HIGHLAND': 'CAMERON HIGHLANDS','CHIANGMAI': 'CHIANG MAI','COLOMBO SRI LANKA': 'COLOMBO',
    'FRANK': 'FRANKFURT',
    'GUAMUSANG': 'GUA MUSANG','GUA MUSANG POS SIMPOR': 'GUA MUSANG',
    'DANANG': 'DA NANG',
    'GEOGE TOWN': 'GEORGE TOWN','GEORGETOWN': 'GEORGE TOWN','JERTIH':'JERTEH',
    'JOHOR BARU': 'JOHOR BAHRU', 'JOHOR BAHU': 'JOHOR BAHRU','JOHOR BHARU': 'JOHOR BAHRU','JOHOR BARY': 'JOHOR BAHRU','JOHOR BAHARU': 'JOHOR BAHRU',
    'JOHOR BARU KUALA LUMPUR':'JOHOR BAHRU','JOHOR BARUSINGAPORE':'JOHOR BAHRU',
    'KUALA KUBU BARU':'KUALA KUBU BAHRU','KUALA KUBU BAHARU':'KUALA KUBU BAHRU',
    'UALA LUMPUR': 'KUALA LUMPUR','KUALKUALA LUMPUR':'KUALA LUMPUR','SEPT  KUALA LUMPUR': 'KUALA LUMPR','KKUALA LUMUR': 'KUALA LUMPUR',
    'KIALA LUMUPUR': 'KUALA LUMPUR', 'IKUALA LUMPUR': 'KUALA LUMPUR','KUALAA LUMPUR':'KUALA LUMPUR','KUALALUMPUR':'KUALA LUMPUR',
    'KUALA LUMUR':'KUALA LUMPUR','KUALA LUMPU':'KUALA LUMPUR','KUALA LUMPURHONG KONG':'KUALA LUMPUR','KUALA LIMPUR':'KUALA LUMPUR',
    'KUALA LUMPURJAKARTA':'KUALA LUMPUR','KUALA KUMPUR':'KUALA LUMPUR','KUALA NERUS TERENGGANU':'KUALA NERUS',
    'KUALATERENGGANU':'KUALA TERENGGANU','KUALA TERENGANU':'KUALA TERENGGANU','KUALA TERENGAGNU':'KUALA TERENGGANU','KUALA TENGGANU':'KUALA TERENGGANU',
    'KULA LUMPUR':'KUALA LUMPUR','KUCHINGL':'KUCHING','KUANG':'KLUANG',
    'KUAL LUMPUR':'KUALA LUMPUR','KUALA  LUMPUR':'KUALA LUMPUR',
    'UALA TERENGGANU': 'KUALA TERENGGANU','KKOTA KINABALU':'KOTA KINABALU','KOTA KINABAU':'KOTA KINABALU','KOTA KINBALU':'KOTA KINABALU','KOTA  KINABALU':'KOTA KINABALU',
    'KOTA  BARU':'KOTA BAHRU', 'KOTA BAHARU':'KOTA BAHRU','KOTA BARU':'KOTA BAHRU','KOTA BARUGEORGE TOWN':'KOTA BAHRU',
    'LABUAN BAJO INDONESIA':'LABUAN BAJO','LONDONKUALA LUMPUR':'LONDON','LONDON TUES':'LONDON','LENGONG':'LENGGONG','LAMGKAWI':'LANGKAWI',
    'MARNG':'MARANG','MELAKA': 'MALACCA','MEKALA':'MALACCA','MANAMA BAHRAIN': 'MANAMA',
    'NIBONG TEBA':'NIBONG TEBAL','NEW DELHI INDIA':'NEW DELHI','NEW DELH':'NEW DELHI','NARATHIWAT SOUTHERN THAILAND':'NARATHIWAT','MUNDOK SOUTHERN THAILAND':'MUNDOK',
    'PARISBEIJING':'PARIS',
    'PUTRAJAYAS': 'PUTRAJAYA','PUTRAYAJA': 'PUTRAJAYA','PUTRJAYA': 'PUTRAJAYA','PPUTRAJAYA': 'PUTRAJAYA','PATTANI THAILAND':'PATTANI','PASIR PUTIH':'PASIR PUTEH',
    'PORT MORESBY PAPUA NEW GUINEA': 'PORT MORESBY','PANGKOR ISLAND':'PANGKOR','PULAU PERHENTIAN KECIL TERENGGANU':'PULAU PERHENTIAN',
    'SEBERANG PERAI': 'SEBERANG PRAI','SUNNYLANDS CALIFORNIA': 'SUNNYLANDS','SUNGAI GOLOK THAILAND':'SUNGAI GOLOK',
    'SUBANG': 'SUBANG JAYA','SONGKLA': 'SONGKHLA','SHAH  ALAM': 'SHAH ALAM','SEMENYEH': 'SEMENYIH','SELANGAU': 'SELANGOR','SARI': 'SARIKEI',
    'SAMARAHAN': 'SAMARKAND','SADAO THAILAND': 'SADAO',   'ALSHAH ALAM': 'SHAH ALAM',
    'THE HAGUE NETHERLANDS': 'THE HAGUE','TASHKENTL': 'TASHKENT','TAKBAI SOUTHERN THAILAND': 'TAKBAI','TAK': 'TAK THAILAND',
    'VALLETTA MALTA': 'VALLETTA','VIENTIANE LAOS': 'VIENTIANE','VLADIVOSTOK RUSSIA': 'VLADIVOSTOK','VALETTA':'VALLETTA',
    'ULAANBAATAR  MONGOLIA': 'ULAANBAATAR','ULAANBAATAR MONGOLIA': 'ULAANBAATAR','ULAANBAATAAR': 'ULAANBAATAR',
    'WASHINGTON DC': 'WASHINGTON',

}

df_cleaned['Place'] = df_cleaned['Place'].str.upper()
df_cleaned['Place'] = df_cleaned['Place'].replace(place_corrections)
df_cleaned['Place'] = df_cleaned['Place'].str.split(',').str[0]
df_cleaned['Place'] = df_cleaned['Place'].str.replace(r'[^a-zA-Z\s]+', '', regex=True)

print(calculate_processing_metrics(df_cleaned))


Total Rows Processed: 103,070 records
Total Processing Time: 1.1932 seconds
Initial CPU Usage: 4.50%
Final CPU Usage: 5.50%
Memory Usage: 21.50%
Throughput (Records per Second): 86382.48 records/sec


In [None]:
# Count the number of articles per city
city_counts = df_cleaned['Place'].value_counts()

# Set a threshold: keep only cities with at least N articles (e.g., N=2)
threshold = 2
valid_cities = city_counts[city_counts >= threshold].index

# Save the valid cities to a CSV file
pd.Series(valid_cities).to_csv('valid_cities.csv', index=False, header=False)

# Filter the DataFrame to keep only valid cities
df_cleaned = df_cleaned[df_cleaned['Place'].isin(valid_cities)]

print(calculate_processing_metrics(df_cleaned))

Total Rows Processed: 102,768 records
Total Processing Time: 1.1880 seconds
Initial CPU Usage: 5.10%
Final CPU Usage: 5.00%
Memory Usage: 21.50%
Throughput (Records per Second): 86501.67 records/sec


In [None]:
# Filter the DataFrame to keep only rows where 'Place' is in valid_cities
df_cleaned = df_cleaned[df_cleaned['Place'].isin(valid_cities)]
print(calculate_processing_metrics(df_cleaned))

Total Rows Processed: 102,768 records
Total Processing Time: 1.1859 seconds
Initial CPU Usage: 5.10%
Final CPU Usage: 5.50%
Memory Usage: 21.50%
Throughput (Records per Second): 86658.99 records/sec


> **7. Extract Date from URL**

We extract the date in `YYYY/MM` format from the URL and add it as a separate column in the dataset.

In [None]:
df_cleaned['Date'] = df_cleaned['URL'].str.extract(r'(\d{4}/\d{2})')
print("Date column extracted from the URL.")
print(calculate_processing_metrics(df_cleaned))  # Metrics after extracting the Date column
print("")

Date column extracted from the URL.
Total Rows Processed: 102,768 records
Total Processing Time: 1.2172 seconds
Initial CPU Usage: 5.50%
Final CPU Usage: 57.40%
Memory Usage: 21.60%
Throughput (Records per Second): 84433.14 records/sec



> **8. Final Dataset**

After cleaning and transforming the data, we earrange dataframe and export the cleaned dataset to a new CSV file (`finalData.csv`).

In [None]:
# Rearrange the columns to the desired order
df_cleaned = df_cleaned[['Place', 'Date', 'Category', 'Title','Teaser']]
print(calculate_processing_metrics(df_cleaned))

Total Rows Processed: 102,768 records
Total Processing Time: 1.2065 seconds
Initial CPU Usage: 60.60%
Final CPU Usage: 14.60%
Memory Usage: 21.60%
Throughput (Records per Second): 85178.44 records/sec


In [None]:
sorted_df = df_cleaned.sort_values(by=['Place'])
sorted_df.to_csv('finalData.csv',index=False)