<a href="https://colab.research.google.com/github/chungbrandon-ai/Virtual-Global-Colllaboration/blob/main/Copy_of_VGC_Group_8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tourist Destination in Pakistan
Dataset: https://opendata.com.pk/dataset/tourist-destinations-in-pakistan

####Data Cleaning Tasks
1. Remove photo credits from descriptions.

Cammy

2. Standardize category names

3. Trim extra spaces and line breaks in all text fields.

4. Clean the _key column by removing spaces or replacing them with underscores.

5. Convert latitude and longitude from strings to numeric values

Cammy

6. Remove broken or unnecessary quotes inside the description field.

7. Extract key information from the description column.

Cammy

8. Add a ‚Äúprovince‚Äù column based on district.

9. Remove duplicate rows using the _key column.

Step 1: Import Libraries and Load the Data
This first step imports the necessary libraries (pandas, re, numpy) and reads your CSV file into a DataFrame called df. It then prints the basic information and the first few rows to confirm it loaded correctly.

In [31]:
# Import necessary libraries
import pandas as pd
import re
import numpy as np

# Define the file name
file_name = 'tourist-destinations-in-pakistan.csv'

try:
    # Read the CSV file into a pandas DataFrame
    df = pd.read_csv(file_name)
    print(f"Successfully loaded '{file_name}'")

    print("\n--- Original Data Info ---")
    # Display summary information about the DataFrame
    df.info()

    print("\n--- First 5 Rows of Original Data ---")
    # Display the first 5 rows
    print(df.head())

except FileNotFoundError:
    print(f"ERROR: File '{file_name}' was not found.")
    print("Please make sure you have UPLOADED the file to the Colab session.")
except Exception as e:
    print(f"An error occurred: {e}")

Successfully loaded 'tourist-destinations-in-pakistan.csv'

--- Original Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   _key       69 non-null     object 
 1   Desc       69 non-null     object 
 2   category   69 non-null     object 
 3   district   69 non-null     object 
 4   latitude   69 non-null     float64
 5   longitude  69 non-null     float64
dtypes: float64(2), object(4)
memory usage: 3.4+ KB

--- First 5 Rows of Original Data ---
              _key                                               Desc  \
0       Ansoo Lake  Ansoo Lake is situated in Kaghan Valley of Pak...   
1    Astola Island  Astola Island, also known as Jezira Haft Talar...   
2     Attabad Lake  Attabad Lake is a lake in the Gojal Valley of ...   
3  Badshahi Mosque  The Badshahi Mosque is a Mughal-era congregati...   
4  Baltoro Glacier  The Baltoro Gla

Step 2: Remove Duplicate Rows
This step checks for any duplicate rows based on the _key column. It keeps the first instance of each unique _key and removes any others.

Add a new cell (click the + Code button) and run this:

In [32]:
# Check if the DataFrame 'df' exists
if 'df' in locals():
    print("Task: Removing duplicate rows...")

    # Get the number of rows before removing duplicates
    original_rows = len(df)

    # Remove duplicates based on the '_key' column, keeping the first occurrence
    df.drop_duplicates(subset=['_key'], keep='first', inplace=True)

    # Get the number of rows after removing duplicates
    new_rows = len(df)

    print(f" - Original row count: {original_rows}")
    print(f" - New row count: {new_rows}")
    print(f" - Total duplicates removed: {original_rows - new_rows}")
else:
    print("Error: DataFrame 'df' not found. Please run Step 1 first.")

Task: Removing duplicate rows...
 - Original row count: 69
 - New row count: 69
 - Total duplicates removed: 0


Step 3: Clean Text Fields (Trim Spaces and Line Breaks)
This step cleans all the main text columns. It removes leading/trailing spaces (whitespace) and replaces any line breaks (\n) or multiple spaces with a single space.

Add a new cell and run this:

In [33]:
# Check if the DataFrame 'df' exists
if 'df' in locals():
    print("Task: Trimming whitespace and removing line breaks from text columns...")

    # List of text columns to clean
    text_columns = ['_key', 'Desc', 'category', 'district']

    for col in text_columns:
        if col in df.columns:
            # Convert column to string type to handle potential mixed types
            df[col] = df[col].astype(str)
            # 1. Remove leading/trailing spaces
            df[col] = df[col].str.strip()
            # 2. Replace newline/return characters with a single space
            df[col] = df[col].str.replace(r'[\r\n]+', ' ', regex=True)
            # 3. Replace multiple spaces with a single space
            df[col] = df[col].str.replace(r'\s+', ' ', regex=True)
            print(f" - Column '{col}' cleaned.")

    print("Text cleaning complete.")
else:
    print("Error: DataFrame 'df' not found. Please run the previous steps.")

Task: Trimming whitespace and removing line breaks from text columns...
 - Column '_key' cleaned.
 - Column 'Desc' cleaned.
 - Column 'category' cleaned.
 - Column 'district' cleaned.
Text cleaning complete.


Step 4: Clean _key Column (Replace Spaces with Underscores)
This step specifically targets the _key column and replaces any spaces with underscores, which is a common practice for key columns.

Add a new cell and run this:

In [34]:
# Check if the DataFrame 'df' exists
if 'df' in locals():
    print("Task: Replacing spaces with underscores in '_key' column...")

    # Replace all spaces ' ' with underscores '_'
    df['_key'] = df['_key'].str.replace(' ', '_')

    print("\nSample of cleaned '_key' column:")
    print(df['_key'].head())
else:
    print("Error: DataFrame 'df' not found. Please run the previous steps.")

Task: Replacing spaces with underscores in '_key' column...

Sample of cleaned '_key' column:
0         Ansoo_Lake
1      Astola_Island
2       Attabad_Lake
3    Badshahi_Mosque
4    Baltoro_Glacier
Name: _key, dtype: object


Step 5: Clean Desc Column (Remove Photo Credits and Fix Quotes)
This step performs two cleaning tasks on the Desc (description) column:

Removes any text that looks like (Photo Credit: ...) using a regular expression.

Fixes unnecessary double quotes ("") by replacing them with a single quote (").

Add a new cell and run this:

In [35]:
# Check if the DataFrame 'df' exists
if 'df' in locals():
    print("Task: Cleaning 'Desc' column (removing photo credits and fixing quotes)...")

    # 1. Remove photo credits using regex
    # This looks for "(Photo Credit:" followed by any characters until a ")"
    df['Desc'] = df['Desc'].str.replace(r'\(Photo Credit:.*?\)', '', regex=True).str.strip()

    # 2. Fix broken or unnecessary quotes (e.g., "" becomes ")
    df['Desc'] = df['Desc'].str.replace('""', '"', regex=False)

    print(" - 'Desc' column cleaned.")
    print("\nSample of cleaned 'Desc' column:")
    print(df['Desc'].head())
else:
    print("Error: DataFrame 'df' not found. Please run the previous steps.")

Task: Cleaning 'Desc' column (removing photo credits and fixing quotes)...
 - 'Desc' column cleaned.

Sample of cleaned 'Desc' column:
0    Ansoo Lake is situated in Kaghan Valley of Pak...
1    Astola Island, also known as Jezira Haft Talar...
2    Attabad Lake is a lake in the Gojal Valley of ...
3    The Badshahi Mosque is a Mughal-era congregati...
4    The Baltoro Glacier, at 63 km (39 mi) in lengt...
Name: Desc, dtype: object


Step 6: Standardize the category Column
This step converts all values in the category column to "Title Case" (e.g., "hill station" becomes "Hill Station"). This makes the categories consistent.

Add a new cell and run this:

In [36]:
# Check if the DataFrame 'df' exists
if 'df' in locals():
    print("Task: Standardizing 'category' column to Title Case...")

    print("\nUnique categories BEFORE cleaning:")
    print(df['category'].unique())

    # Convert the entire column to title case
    df['category'] = df['category'].str.title().str.strip()

    print("\nUnique categories AFTER cleaning:")
    print(df['category'].unique())
else:
    print("Error: DataFrame 'df' not found. Please run the previous steps.")

Task: Standardizing 'category' column to Title Case...

Unique categories BEFORE cleaning:
['Lake' 'Island' 'Mosque' 'Mountainous' 'Hill Station' 'Waterfall'
 'National Park' 'Fort' 'Coastal' 'Valley' 'Temple' 'Mine' 'Monument'
 'Museum' 'Resort' 'Desert']

Unique categories AFTER cleaning:
['Lake' 'Island' 'Mosque' 'Mountainous' 'Hill Station' 'Waterfall'
 'National Park' 'Fort' 'Coastal' 'Valley' 'Temple' 'Mine' 'Monument'
 'Museum' 'Resort' 'Desert']


Step 7: Rename district to province and Standardize Values
This step renames the district column to province, as it contains province-level data. It also standardizes the spelling of 'Gilgit-Baltistan' which appeared in two different ways.

Add a new cell and run this:

In [37]:
# Check if the DataFrame 'df' exists
if 'df' in locals():
    print("Task: Renaming 'district' to 'province' and standardizing values...")

    # 1. Rename the column
    df.rename(columns={'district': 'province'}, inplace=True)
    print(" - Column 'district' renamed to 'province'.")

    # 2. Replace 'nan' strings (from Step 3) with actual NumPy NaN
    df['province'].replace('nan', np.nan, inplace=True)

    # 3. Standardize different spellings
    # Note: 'Gilgit‚àíBaltistan' (U+2212) and 'Gilgit-Baltistan' (U+002D) are different
    df['province'] = df['province'].str.replace('Gilgit‚àíBaltistan', 'Gilgit-Baltistan')

    print("\nUnique values in 'province' column after cleaning:")
    print(df['province'].unique())
else:
    print("Error: DataFrame 'df' not found. Please run the previous steps.")

Task: Renaming 'district' to 'province' and standardizing values...
 - Column 'district' renamed to 'province'.

Unique values in 'province' column after cleaning:
['Khyber Pakhtunkhwa' 'Balochistan' 'Gilgit-Baltistan' 'Punjab'
 'Islamabad' 'Azad Kashmir' 'Sindh']


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['province'].replace('nan', np.nan, inplace=True)


Step 8: Convert latitude and longitude to Numeric Types
This step converts the latitude and longitude columns from text (object) to numeric (float) values, which is necessary for any calculations or mapping.

Add a new cell and run this:

In [38]:
# Check if the DataFrame 'df' exists
if 'df' in locals():
    print("Task: Converting 'latitude' and 'longitude' to numeric values...")

    # Convert columns to numeric, 'coerce' turns bad values into NaT (Not a Number)
    df['latitude'] = pd.to_numeric(df['latitude'], errors='coerce')
    df['longitude'] = pd.to_numeric(df['longitude'], errors='coerce')

    print("\nData types after conversion:")
    # .info() will now show latitude and longitude as 'float64'
    df.info()
else:
    print("Error: DataFrame 'df' not found. Please run the previous steps.")

Task: Converting 'latitude' and 'longitude' to numeric values...

Data types after conversion:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   _key       69 non-null     object 
 1   Desc       69 non-null     object 
 2   category   69 non-null     object 
 3   province   69 non-null     object 
 4   latitude   69 non-null     float64
 5   longitude  69 non-null     float64
dtypes: float64(2), object(4)
memory usage: 3.4+ KB


Step 9: Save the Cleaned Data to a New CSV File
This is the final step. It saves your fully cleaned DataFrame (df) into a new file named cleaned_tourist_destinations.csv.

Add a new cell and run this:

In [39]:
# Check if the DataFrame 'df' exists
if 'df' in locals():
    print("Task: Saving cleaned data to a new CSV file...")

    cleaned_file_name = 'cleaned_tourist_destinations.csv'

    # Save the DataFrame to a new CSV, index=False means row numbers are not saved
    df.to_csv(cleaned_file_name, index=False)

    print(f"\n--- SUCCESS! ---")
    print(f"All cleaning tasks are complete.")
    print(f"Cleaned data saved to '{cleaned_file_name}'.")

    print("\nFinal Cleaned Data Sample:")
    print(df.head())

    print(f"\n\nYou can now download '{cleaned_file_name}' from the 'Files' (üìÅ) panel on the left.")
else:
    print("Error: DataFrame 'df' not found. Please run the previous steps.")

Task: Saving cleaned data to a new CSV file...

--- SUCCESS! ---
All cleaning tasks are complete.
Cleaned data saved to 'cleaned_tourist_destinations.csv'.

Final Cleaned Data Sample:
              _key                                               Desc  \
0       Ansoo_Lake  Ansoo Lake is situated in Kaghan Valley of Pak...   
1    Astola_Island  Astola Island, also known as Jezira Haft Talar...   
2     Attabad_Lake  Attabad Lake is a lake in the Gojal Valley of ...   
3  Badshahi_Mosque  The Badshahi Mosque is a Mughal-era congregati...   
4  Baltoro_Glacier  The Baltoro Glacier, at 63 km (39 mi) in lengt...   

      category            province   latitude  longitude  
0         Lake  Khyber Pakhtunkhwa  34.814119  73.676428  
1       Island         Balochistan  25.122321  63.847948  
2         Lake    Gilgit-Baltistan  36.345827  74.865436  
3       Mosque              Punjab  31.588126  74.309322  
4  Mountainous    Gilgit-Baltistan  35.710642  76.553142  


You can now download 