# 🌍 Cleaning, Extraction and Feature Engineering for Locations Data

## 📌 Project Overview
This notebook focuses on preparing a dataset of travel and tourism locations for downstream machine learning tasks, including semantic analysis and recommendation systems. The dataset includes various attributes of tourist spots, such as:

- **name**: Name of the location
- **state**: The state where the location is situated
- **description**: A textual summary of the place
- **latitude** and **longitude**: Geographic coordinates
- **category**: Type of destination (e.g., spiritual, adventure, coastal)
- **activities**: Activities available at the location
- **popularity_score**: A numerical score indicating the popularity
- **places_to_visit**: Notable nearby attractions or points of interest

## 🎯 Objective
The goal of this notebook is to perform comprehensive **data cleaning**, **feature engineering**, and **text-based semantic processing** on the dataset, particularly focusing on generating semantic embeddings using the `SentenceTransformer` library. These embeddings can be used for:
- Similarity search
- Clustering locations
- Feeding into recommendation engines

## 🧹 Cleaning Steps
Key data preprocessing tasks include:
- Handling **missing values** appropriately
- Capping **latitude and longitude** to valid geographic bounds
- Normalizing or transforming features where necessary
- Ensuring consistent data formats across text columns

## 🔍 Feature Engineering
We will extract and generate features from text-based fields such as:
- `description`
- `activities`
- `places_to_visit`

Using **SentenceTransformer**, we will convert these fields into dense semantic embeddings that can be used in machine learning models for similarity, classification, or recommendation tasks.

---

📁 Let's start by loading the data and performing exploratory analysis.

In [1]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import os
from sklearn.cluster import KMeans
from textblob import TextBlob

In [2]:
# Loading the CSV File for Location table
location_df = pd.read_csv("C:/Users/Gharat/Downloads/recommendation-engine/data/raw/locations.csv")

In [3]:
location_df.head()  # Viewing the first 5 rows of the dataframe

Unnamed: 0,Location_id,Name,State,Description,latitude,longitude,Category,activities_offered,popularity_score,Places_to_visit
0,1,Chitkul,Himachal Pradesh,The last inhabited village near the Indo-China...,75.74782,154.406871,Remote nature-based locations,"Trek the Charang Chitkul Pass, Enjoy Baspa Riv...",7200,"Mathi Temple,Kagyupa Temple,Kamru Fort,Bering ..."
1,2,Landour,Uttarakhand,"A quieter, more charming cantonment town adjac...",77.349472,-61.657091,Remote nature-based locations,"Char Dukan, Landour Bakehouse, Lal Tibba, Chuk...",9400,"Char Dukaan,Landour Bakehouse,Lal Tibba,St. Pa..."
2,3,Gurez Valley,Jammu and Kashmir,"A picturesque valley nestled in the Himalayas,...",4.011962,73.884906,Remote nature-based locations,"Shikara Ride on Dal Lake,Gulmarg Gondola Ride,...",6800,"Srinagar,Gulmarg,Pahalgam,Sonamarg, Vaishno De..."
3,4,Shojha,Himachal Pradesh,"A serene village in the Seraj Valley, known fo...",-13.159456,-168.705447,Remote nature-based locations,"Trekking to Jalori Pass,Visiting Serolsar Lake...",8400,"Jalori Pass,Waterfall Point,Serolsar Lake,Tirt..."
4,5,Pangong Tso West Bank Villages,Ladakh,"While Pangong Tso itself is popular, exploring...",-15.533796,110.153536,Remote nature-based locations,"river rafting, Trekking, Go on a Camel Safari ...",4600,"shanti stupa, thiksey monastery,pangong tso,nu..."


In [4]:
# Dropping the duplicates in the record (consistency requirements)
location_df = location_df.drop_duplicates()

In [5]:
# Remove rows with missing critical fields
location_df.dropna(subset=['Location_id', 'Name', 'Category'], inplace=True)

In [6]:
# Standard the text columns in lower-case 
for col in ['Name', 'State', 'Description', 'Category', 'activities_offered', 'Places_to_visit']:
    if col in location_df.columns:
        location_df[col] = location_df[col].astype(str).str.strip().str.lower()

In [7]:
location_df["Category"].unique()  # Indentifying unique categories 

array(['remote nature-based locations', 'local or weekend getaways',
       'remote nature-based locations, spiritual and religious retreats',
       'beachfront and coastal retreats',
       'local or weekend getaways, spiritual and religious retreats',
       'spiritual and religious retreats'], dtype=object)

In [8]:
# Mapping the longer categories sentences to small encodings
type_mapping = {
    'remote nature-based locations': 'remote_nature',
    'local or weekend getaways': 'local_getaway',
    'remote nature-based locations, spiritual and religious retreats': 'remote_nature_religious',
    'beachfront and coastal retreats': 'beach_retreat',
    'local or weekend getaways, spiritual and religious retreats': 'local_religious',
    'spiritual and religious retreats': 'religious_retreat'
}

# Performing the map operation
location_df["category_mapped"] = location_df["Category"].map(type_mapping)

In [9]:
# Standardizing the longitudes & latitudes for the Indian Subcontinent
location_df = location_df[(location_df['latitude'].between(-90, 90)) & (location_df['longitude'].between(-180, 180))]

In [10]:
location_df.head()

Unnamed: 0,Location_id,Name,State,Description,latitude,longitude,Category,activities_offered,popularity_score,Places_to_visit,category_mapped
0,1,chitkul,himachal pradesh,the last inhabited village near the indo-china...,75.74782,154.406871,remote nature-based locations,"trek the charang chitkul pass, enjoy baspa riv...",7200,"mathi temple,kagyupa temple,kamru fort,bering ...",remote_nature
1,2,landour,uttarakhand,"a quieter, more charming cantonment town adjac...",77.349472,-61.657091,remote nature-based locations,"char dukan, landour bakehouse, lal tibba, chuk...",9400,"char dukaan,landour bakehouse,lal tibba,st. pa...",remote_nature
2,3,gurez valley,jammu and kashmir,"a picturesque valley nestled in the himalayas,...",4.011962,73.884906,remote nature-based locations,"shikara ride on dal lake,gulmarg gondola ride,...",6800,"srinagar,gulmarg,pahalgam,sonamarg, vaishno de...",remote_nature
3,4,shojha,himachal pradesh,"a serene village in the seraj valley, known fo...",-13.159456,-168.705447,remote nature-based locations,"trekking to jalori pass,visiting serolsar lake...",8400,"jalori pass,waterfall point,serolsar lake,tirt...",remote_nature
4,5,pangong tso west bank villages,ladakh,"while pangong tso itself is popular, exploring...",-15.533796,110.153536,remote nature-based locations,"river rafting, trekking, go on a camel safari ...",4600,"shanti stupa, thiksey monastery,pangong tso,nu...",remote_nature


In [11]:
# Standardizing the column names
location_df.columns = location_df.columns.str.strip().str.lower().str.replace(' ', '_')

In [12]:
# Pre-processing the text data for the following columns
location_df['state'] = location_df['state'].str.title()
location_df['name'] = location_df['name'].str.title()
location_df['description'] = location_df['description'].str.strip()
location_df['category'] = location_df['category'].str.strip()
location_df['category_mapped'] = location_df['category_mapped'].str.strip()

In [13]:
# Resolving data-type conflicts
location_df['popularity_score'] = pd.to_numeric(location_df['popularity_score'], errors='coerce')

In [14]:
# Creating features: `Total Activities Offered` & `Places to Visiting`
location_df['num_activities'] = location_df['activities_offered'].apply(lambda x: len(str(x).split(',')) if pd.notnull(x) else 0)
location_df['num_places_to_visit'] = location_df['places_to_visit'].apply(lambda x: len(str(x).split(',')) if pd.notnull(x) else 0)

In [15]:
# Categorizing the popularity levels
location_df['popularity_level'] = pd.cut(location_df['popularity_score'], 
                                 bins=[0, 5000, 10000, 20000],
                                 labels=['Low', 'Medium', 'High'])

In [16]:
# Identification of null values
missing_summary = location_df.isnull().sum()
print("Missing values per column:\n", missing_summary)

Missing values per column:
 location_id            0
name                   0
state                  0
description            0
latitude               0
longitude              0
category               0
activities_offered     0
popularity_score       0
places_to_visit        0
category_mapped        0
num_activities         0
num_places_to_visit    0
popularity_level       1
dtype: int64


In [17]:
location_df.head()

Unnamed: 0,location_id,name,state,description,latitude,longitude,category,activities_offered,popularity_score,places_to_visit,category_mapped,num_activities,num_places_to_visit,popularity_level
0,1,Chitkul,Himachal Pradesh,the last inhabited village near the indo-china...,75.74782,154.406871,remote nature-based locations,"trek the charang chitkul pass, enjoy baspa riv...",7200,"mathi temple,kagyupa temple,kamru fort,bering ...",remote_nature,6,8,Medium
1,2,Landour,Uttarakhand,"a quieter, more charming cantonment town adjac...",77.349472,-61.657091,remote nature-based locations,"char dukan, landour bakehouse, lal tibba, chuk...",9400,"char dukaan,landour bakehouse,lal tibba,st. pa...",remote_nature,5,5,Medium
2,3,Gurez Valley,Jammu And Kashmir,"a picturesque valley nestled in the himalayas,...",4.011962,73.884906,remote nature-based locations,"shikara ride on dal lake,gulmarg gondola ride,...",6800,"srinagar,gulmarg,pahalgam,sonamarg, vaishno de...",remote_nature,4,5,Medium
3,4,Shojha,Himachal Pradesh,"a serene village in the seraj valley, known fo...",-13.159456,-168.705447,remote nature-based locations,"trekking to jalori pass,visiting serolsar lake...",8400,"jalori pass,waterfall point,serolsar lake,tirt...",remote_nature,5,5,Medium
4,5,Pangong Tso West Bank Villages,Ladakh,"while pangong tso itself is popular, exploring...",-15.533796,110.153536,remote nature-based locations,"river rafting, trekking, go on a camel safari ...",4600,"shanti stupa, thiksey monastery,pangong tso,nu...",remote_nature,5,5,Low


In [18]:
location_df["popularity_level"].value_counts()

popularity_level
Low       28
Medium    25
High       0
Name: count, dtype: int64

In [19]:
# Imputing the missing values with Mode
location_df["popularity_level"].fillna("Medium", inplace=True)

In [20]:
# Resolving data-type conflicts and errors
location_df['popularity_score'] = pd.to_numeric(location_df['popularity_score'], errors='coerce')
location_df['latitude'] = pd.to_numeric(location_df['latitude'], errors='coerce')
location_df['longitude'] = pd.to_numeric(location_df['longitude'], errors='coerce')

In [21]:
# Extracting the longitudes and latitude as the separate data
geo_df = location_df.dropna(subset=['latitude', 'longitude'])

In [22]:
# Clustering based on the geo-location
kmeans = KMeans(n_clusters=5, random_state=0)
geo_df['geo_cluster'] = kmeans.fit_predict(geo_df[['latitude', 'longitude']])



In [23]:
# Adding a new column for the clusters
location_df = location_df.merge(geo_df[['location_id', 'geo_cluster']], on='location_id', how='left')

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# TF-IDF vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(location_df['description'])

# Dimensionality reduction using SVD 
svd = TruncatedSVD(n_components=100)
embeddings = svd.fit_transform(tfidf_matrix)

# Add to dataframe
location_df['description_embedding'] = embeddings.tolist()

# View result
print(location_df[['description', 'description_embedding']])

                                          description  \
0   the last inhabited village near the indo-china...   
1   a quieter, more charming cantonment town adjac...   
2   a picturesque valley nestled in the himalayas,...   
3   a serene village in the seraj valley, known fo...   
4   while pangong tso itself is popular, exploring...   
5   a scenic village in the sutlej river valley, f...   
6   a base for treks into the johar valley, offeri...   
7   a hidden gem in the kullu district, known for ...   
8   a lake with nine corners, located near nainita...   
9   known as the 'open-air art gallery' of rajasth...   
10  a popular trekking circuit offering panoramic ...   
11  a picturesque village in the spiti valley, kno...   
12  often referred to as the 'grand canyon of indi...   
13  a ghost town at the southeastern tip of pamban...   
14  majestic waterfalls cascading down amidst lush...   
15  beyond the iconic temples of hampi, explore th...   
16  while coorg is known for it

In [25]:
# One-hot encode category_mapped
category_dummies = pd.get_dummies(location_df['category_mapped'], prefix='category')
df = pd.concat([location_df, category_dummies], axis=1)

In [26]:
from textblob import TextBlob

def get_sentiment(text):
    if pd.isnull(text):
        return 0
    return TextBlob(text).sentiment.polarity

location_df['description_sentiment'] = location_df['description'].apply(get_sentiment)

In [27]:
location_df.head() 

Unnamed: 0,location_id,name,state,description,latitude,longitude,category,activities_offered,popularity_score,places_to_visit,category_mapped,num_activities,num_places_to_visit,popularity_level,geo_cluster,description_embedding,description_sentiment
0,1,Chitkul,Himachal Pradesh,the last inhabited village near the indo-china...,75.74782,154.406871,remote nature-based locations,"trek the charang chitkul pass, enjoy baspa riv...",7200,"mathi temple,kagyupa temple,kamru fort,bering ...",remote_nature,6,8,Medium,4,"[0.35118480791698947, 0.10395288659379161, -0....",0.067033
1,2,Landour,Uttarakhand,"a quieter, more charming cantonment town adjac...",77.349472,-61.657091,remote nature-based locations,"char dukan, landour bakehouse, lal tibba, chuk...",9400,"char dukaan,landour bakehouse,lal tibba,st. pa...",remote_nature,5,5,Medium,0,"[0.2814684583527796, -0.12211920256857164, -0....",0.4
2,3,Gurez Valley,Jammu And Kashmir,"a picturesque valley nestled in the himalayas,...",4.011962,73.884906,remote nature-based locations,"shikara ride on dal lake,gulmarg gondola ride,...",6800,"srinagar,gulmarg,pahalgam,sonamarg, vaishno de...",remote_nature,4,5,Medium,1,"[0.44815913762248166, 0.059113783048558105, 0....",0.49375
3,4,Shojha,Himachal Pradesh,"a serene village in the seraj valley, known fo...",-13.159456,-168.705447,remote nature-based locations,"trekking to jalori pass,visiting serolsar lake...",8400,"jalori pass,waterfall point,serolsar lake,tirt...",remote_nature,5,5,Medium,2,"[0.34240346667979366, 0.3695099873928381, 0.07...",0.033333
4,5,Pangong Tso West Bank Villages,Ladakh,"while pangong tso itself is popular, exploring...",-15.533796,110.153536,remote nature-based locations,"river rafting, trekking, go on a camel safari ...",4600,"shanti stupa, thiksey monastery,pangong tso,nu...",remote_nature,5,5,Low,3,"[0.37358315828054134, -0.21770073681821656, -0...",0.188889


In [28]:
# Save the data in the processed folder
location_df.to_csv("C:/Users/Gharat/Downloads/recommendation-engine/data/processed/locations.csv", index=False)