<a href="https://colab.research.google.com/github/dharalakshmi/Tourism-Routes/blob/main/Welcome_To_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install requests beautifulsoup4 selenium pandas openpyxl

Collecting selenium
  Downloading selenium-4.35.0-py3-none-any.whl.metadata (7.4 kB)
Collecting trio~=0.30.0 (from selenium)
  Downloading trio-0.30.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket~=0.12.2 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting typing-extensions>=4.0.0 (from beautifulsoup4)
  Downloading typing_extensions-4.14.1-py3-none-any.whl.metadata (3.0 kB)
Collecting outcome (from trio~=0.30.0->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.12.2->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Downloading selenium-4.35.0-py3-none-any.whl (9.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m58.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trio-0.30.0-py3-none-any.whl (499 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m499.2/499.2 kB[0m [3

In [2]:
import requests
import pandas as pd
import time
import random
import math
from datetime import datetime

class FinalTourismDataCollector:
    """
    FINAL VERSION: Real APIs for maximum data + realistic fill for missing fields

    REAL API DATA (8/14 columns):
    - Distance: OSRM API (live routing data)
    - Origin_Lat/Long: Nominatim API (live geocoding)
    - Origin_State: Nominatim API (live administrative data)
    - Dest_Lat/Long: Nominatim API (live geocoding)
    - Dest_State: Nominatim API (live administrative data)
    - Name: Overpass API (live OpenStreetMap attractions)
    - Type: Overpass API (live attraction categories)

    REALISTIC DATA (3/14 columns - no APIs available):
    - Ratings: Research-based realistic values
    - Best Time to visit: Climate/attraction-based logic
    - Ideal_duration: Distance/attraction-based logic

    INPUT DATA (3/14 columns):
    - Origin: City names from route planning
    - Destination: City names from route planning
    """

    def __init__(self):
        self.data = []
        self.request_delay = 1.5  # Respectful API rate limiting
        self.api_calls_made = 0

    def get_real_city_coordinates(self, city_name):
        """
        REAL API: Get coordinates and state from Nominatim (OpenStreetMap)
        """
        self.api_calls_made += 1
        print(f"API Call #{self.api_calls_made}: Getting real coordinates for {city_name}")

        url = "https://nominatim.openstreetmap.org/search"
        params = {
            'q': f"{city_name}, India",
            'format': 'json',
            'limit': 1,
            'addressdetails': 1
        }

        headers = {'User-Agent': 'TourismDataCollector/1.0'}

        try:
            response = requests.get(url, params=params, headers=headers, timeout=30)
            time.sleep(self.request_delay)

            if response.status_code == 200:
                data = response.json()
                if data:
                    location = data[0]
                    lat = float(location['lat'])
                    lon = float(location['lon'])

                    # Extract real state from API
                    address = location.get('address', {})
                    state = address.get('state', 'Unknown')

                    print(f"✅ Real data: {city_name} = {lat:.6f}, {lon:.6f}, {state}")
                    return lat, lon, state

        except Exception as e:
            print(f"❌ API error for {city_name}: {e}")

        return None, None, 'Unknown'

    def get_real_distance(self, origin_coords, dest_coords, origin_name, dest_name):
        """
        REAL API: Get driving distance from OSRM routing service
        """
        if not all([origin_coords[0], origin_coords[1], dest_coords[0], dest_coords[1]]):
            return None

        self.api_calls_made += 1
        origin_lat, origin_lon, _ = origin_coords
        dest_lat, dest_lon, _ = dest_coords

        url = f"http://router.project-osrm.org/route/v1/driving/{origin_lon},{origin_lat};{dest_lon},{dest_lat}"
        params = {'overview': 'false'}

        print(f"API Call #{self.api_calls_made}: Getting real distance {origin_name} → {dest_name}")

        try:
            response = requests.get(url, params=params, timeout=30)
            time.sleep(self.request_delay)

            if response.status_code == 200:
                data = response.json()
                if data.get('routes'):
                    distance_m = data['routes'][0]['distance']
                    distance_km = round(distance_m / 1000, 2)
                    print(f"✅ Real distance: {distance_km} km")
                    return distance_km

        except Exception as e:
            print(f"❌ Distance API error: {e}")

        return None

    def get_real_attractions(self, city_name, city_lat, city_lon, radius_km=25):
        """
        REAL API: Get tourist attractions from OpenStreetMap Overpass API
        """
        self.api_calls_made += 1
        radius_meters = radius_km * 1000

        overpass_query = f"""
        [out:json][timeout:45];
        (
          node["tourism"~"^(attraction|museum|monument|gallery|zoo|theme_park|viewpoint)$"]
              (around:{radius_meters},{city_lat},{city_lon});
          way["tourism"~"^(attraction|museum|monument|gallery|zoo|theme_park|viewpoint)$"]
              (around:{radius_meters},{city_lat},{city_lon});
          node["historic"~"^(monument|memorial|castle|fort|palace|ruins|archaeological_site)$"]
              (around:{radius_meters},{city_lat},{city_lon});
          way["historic"~"^(monument|memorial|castle|fort|palace|ruins|archaeological_site)$"]
              (around:{radius_meters},{city_lat},{city_lon});
          node["amenity"="place_of_worship"]
              (around:{radius_meters},{city_lat},{city_lon});
          way["amenity"="place_of_worship"]
              (around:{radius_meters},{city_lat},{city_lon});
          node["leisure"~"^(park|garden)$"]
              (around:{radius_meters},{city_lat},{city_lon});
          way["leisure"~"^(park|garden)$"]
              (around:{radius_meters},{city_lat},{city_lon});
        );
        out center meta;
        """

        print(f"API Call #{self.api_calls_made}: Getting real attractions for {city_name}")

        try:
            response = requests.post(
                "https://overpass-api.de/api/interpreter",
                data=overpass_query,
                headers={'User-Agent': 'TourismDataCollector/1.0'},
                timeout=60
            )

            time.sleep(self.request_delay)

            if response.status_code == 200:
                data = response.json()
                attractions = []

                for element in data.get('elements', []):
                    tags = element.get('tags', {})
                    name = tags.get('name')

                    if name and len(name) > 2:  # Valid name
                        # Get coordinates
                        if element['type'] == 'node':
                            lat = element['lat']
                            lon = element['lon']
                        else:
                            center = element.get('center', {})
                            lat = center.get('lat', city_lat)
                            lon = center.get('lon', city_lon)

                        # Get real type from OSM tags
                        attraction_type = self.extract_real_type(tags)

                        attractions.append({
                            'name': name,
                            'type': attraction_type,
                            'lat': lat,
                            'lon': lon
                        })

                print(f"✅ Found {len(attractions)} real attractions")
                return attractions[:40]  # Limit for performance

        except Exception as e:
            print(f"❌ Attractions API error for {city_name}: {e}")

        return []

    def extract_real_type(self, tags):
        """Extract attraction type from real OpenStreetMap tags"""
        if tags.get('tourism') == 'museum':
            return 'Museum'
        elif tags.get('tourism') == 'monument':
            return 'Monument'
        elif tags.get('tourism') == 'attraction':
            return 'Tourist Attraction'
        elif tags.get('tourism') == 'gallery':
            return 'Art Gallery'
        elif tags.get('tourism') == 'zoo':
            return 'Zoo'
        elif tags.get('tourism') == 'theme_park':
            return 'Theme Park'
        elif tags.get('tourism') == 'viewpoint':
            return 'Viewpoint'
        elif tags.get('historic') in ['monument', 'memorial']:
            return 'Historic Monument'
        elif tags.get('historic') in ['castle', 'fort']:
            return 'Fort'
        elif tags.get('historic') == 'palace':
            return 'Palace'
        elif tags.get('historic') == 'ruins':
            return 'Historical Ruins'
        elif tags.get('historic') == 'archaeological_site':
            return 'Archaeological Site'
        elif tags.get('amenity') == 'place_of_worship':
            religion = tags.get('religion', '')
            if religion == 'hindu':
                return 'Hindu Temple'
            elif religion in ['muslim', 'islamic']:
                return 'Mosque'
            elif religion == 'christian':
                return 'Church'
            elif religion == 'buddhist':
                return 'Buddhist Temple'
            elif religion == 'sikh':
                return 'Gurudwara'
            else:
                return 'Religious Site'
        elif tags.get('leisure') in ['park', 'garden']:
            return 'Park'
        else:
            return 'Attraction'

    def generate_realistic_rating(self, attraction_name, attraction_type):
        """
        REALISTIC DATA: Generate ratings based on attraction characteristics
        (No free API provides ratings)
        """
        # Famous attractions get higher ratings
        famous_ratings = {
            'taj mahal': 4.8,
            'india gate': 4.3,
            'red fort': 4.2,
            'gateway of india': 4.2,
            'hawa mahal': 4.4,
            'amber fort': 4.4,
            'charminar': 4.1,
            'golconda fort': 4.1,
            'lotus temple': 4.5,
            'qutub minar': 4.3,
            'marine drive': 4.2
        }

        name_lower = attraction_name.lower()
        for famous_name, rating in famous_ratings.items():
            if famous_name in name_lower:
                return rating

        # Base ratings by type (research-based)
        type_ratings = {
            'Palace': 4.4,
            'Fort': 4.2,
            'Historic Monument': 4.1,
            'Monument': 4.1,
            'Hindu Temple': 4.3,
            'Museum': 4.0,
            'Church': 4.0,
            'Mosque': 4.1,
            'Park': 3.9,
            'Zoo': 4.2,
            'Theme Park': 4.3,
            'Archaeological Site': 4.0,
            'Religious Site': 4.0
        }

        base_rating = type_ratings.get(attraction_type, 3.9)
        # Add realistic variation
        final_rating = base_rating + random.uniform(-0.3, 0.4)
        return round(min(5.0, max(2.5, final_rating)), 1)

    def generate_realistic_visit_time(self, attraction_type):
        """
        REALISTIC DATA: Generate best visit times based on attraction characteristics
        (No API provides visit timing recommendations)
        """
        morning_types = ['Hindu Temple', 'Buddhist Temple', 'Religious Site', 'Park', 'Archaeological Site']
        afternoon_types = ['Museum', 'Palace', 'Fort', 'Historic Monument', 'Art Gallery']
        evening_types = ['Viewpoint', 'Theme Park', 'Tourist Attraction']
        all_day_types = ['Zoo', 'Theme Park']

        if attraction_type in morning_types:
            return random.choice(['Morning', 'Early Morning'])
        elif attraction_type in afternoon_types:
            return random.choice(['Afternoon', 'Late Morning'])
        elif attraction_type in evening_types:
            return random.choice(['Evening', 'Late Afternoon'])
        elif attraction_type in all_day_types:
            return 'All Day'
        else:
            return random.choice(['Morning', 'Afternoon', 'Evening'])

    def generate_realistic_duration(self, attraction_type, distance):
        """
        REALISTIC DATA: Generate ideal durations based on distance and attraction type
        (No API provides visit duration recommendations)
        """
        # Duration based on attraction type
        type_durations = {
            'Museum': ['2-3 hours', '3-4 hours'],
            'Palace': ['2-3 hours', 'Half day'],
            'Fort': ['2-3 hours', 'Half day'],
            'Historic Monument': ['1-2 hours', '2 hours'],
            'Hindu Temple': ['1 hour', '1-2 hours'],
            'Park': ['1-2 hours', '2-3 hours'],
            'Zoo': ['Half day', 'Full day'],
            'Theme Park': ['Full day'],
            'Archaeological Site': ['2-3 hours'],
            'Religious Site': ['1 hour', '1-2 hours']
        }

        possible_durations = type_durations.get(attraction_type, ['1-2 hours', '2-3 hours'])

        # Adjust for distance (longer trips = longer stays)
        if distance and distance > 300:  # Long distance trip
            if 'Full day' not in possible_durations:
                possible_durations.append('Half day')

        return random.choice(possible_durations)

    def collect_final_dataset(self, target_rows=5000):
        """
        Collect final dataset: Real API data + realistic fill
        """
        # Major Indian cities for tourism routes
        cities = [
            "Delhi", "Mumbai", "Bangalore", "Chennai", "Kolkata",
            "Hyderabad", "Pune", "Jaipur", "Ahmedabad", "Kochi",
            "Agra", "Varanasi", "Goa", "Udaipur", "Mysore",
            "Lucknow", "Kanpur", "Patna", "Bhubaneswar", "Srinagar"
        ]

        print("=" * 70)
        print("FINAL TOURISM DATA COLLECTOR")
        print("=" * 70)
        print("REAL API DATA (8/14 columns):")
        print("✅ Distance - OSRM routing API")
        print("✅ Coordinates - Nominatim geocoding API")
        print("✅ States - Nominatim administrative API")
        print("✅ Attraction Names - Overpass/OpenStreetMap API")
        print("✅ Attraction Types - Overpass/OpenStreetMap API")
        print()
        print("REALISTIC DATA (3/14 columns - no free APIs exist):")
        print("🎯 Ratings - Research-based realistic values")
        print("🎯 Visit Times - Climate/attraction-based logic")
        print("🎯 Duration - Distance/attraction-based logic")
        print()
        print("INPUT DATA (3/14 columns):")
        print("📋 Origin/Destination - Route planning")
        print("=" * 70)
        print(f"Target rows: {target_rows}")
        print(f"Estimated API calls: ~{len(cities)*3}")
        print(f"Estimated time: {len(cities)*2//60} minutes")
        print("=" * 70)

        # Step 1: Get real coordinates for all cities
        print("\nSTEP 1: Getting real coordinates for all cities...")
        city_data = {}
        for city in cities:
            lat, lon, state = self.get_real_city_coordinates(city)
            if lat and lon:
                city_data[city] = {
                    'coords': (lat, lon, state),
                    'attractions': []
                }

        print(f"\n✅ Got real coordinates for {len(city_data)} cities")

        # Step 2: Get real attractions for each city
        print("\nSTEP 2: Getting real attractions from OpenStreetMap...")
        for city_name, city_info in city_data.items():
            lat, lon, state = city_info['coords']
            attractions = self.get_real_attractions(city_name, lat, lon)
            city_info['attractions'] = attractions

            if len(attractions) == 0:
                print(f"⚠️ No attractions found for {city_name}")

        # Step 3: Generate tourism routes with real distances
        print("\nSTEP 3: Generating routes with real distances...")
        all_data = []

        cities_with_data = [city for city, data in city_data.items() if data['attractions']]

        for origin_city in cities_with_data:
            if len(all_data) >= target_rows:
                break

            origin_info = city_data[origin_city]
            origin_coords = origin_info['coords']

            for dest_city in cities_with_data:
                if origin_city == dest_city:
                    continue

                if len(all_data) >= target_rows:
                    break

                dest_info = city_data[dest_city]
                dest_coords = dest_info['coords']

                # Get real distance via API
                real_distance = self.get_real_distance(origin_coords, dest_coords, origin_city, dest_city)

                if real_distance is None:
                    continue

                # Use attractions from destination city
                for attraction in dest_info['attractions']:
                    if len(all_data) >= target_rows:
                        break

                    # Generate realistic data for fields without APIs
                    rating = self.generate_realistic_rating(attraction['name'], attraction['type'])
                    visit_time = self.generate_realistic_visit_time(attraction['type'])
                    duration = self.generate_realistic_duration(attraction['type'], real_distance)

                    row = {
                        'Origin': origin_city,
                        'Destination': dest_city,
                        'Distance': real_distance,
                        'Origin_Lat': origin_coords[0],
                        'Origin_Long': origin_coords[1],
                        'Origin_State': origin_coords[2],
                        'Dest_Lat': dest_coords[0],
                        'Dest_Long': dest_coords[1],
                        'Dest_State': dest_coords[2],
                        'Ratings': rating,
                        'Ideal_duration': duration,
                        'Name': attraction['name'],
                        'Type': attraction['type'],
                        'Best Time to visit': visit_time
                    }

                    all_data.append(row)

                    if len(all_data) % 100 == 0:
                        print(f"📊 Progress: {len(all_data)}/{target_rows} rows")

        self.data = all_data

        print(f"\n🎉 COLLECTION COMPLETE!")
        print(f"📊 Total rows collected: {len(all_data)}")
        print(f"🌐 Total API calls made: {self.api_calls_made}")
        print(f"✅ Real API data: 8/14 columns")
        print(f"🎯 Realistic data: 3/14 columns (no APIs exist)")
        print(f"📋 Input data: 3/14 columns")

        return all_data

    def save_to_excel(self, filename="final_tourism_dataset_5000.xlsx"):
        """Save the final dataset to Excel"""
        if not self.data:
            print("❌ No data to save!")
            return None

        df = pd.DataFrame(self.data)

        # Remove duplicates
        df = df.drop_duplicates().reset_index(drop=True)

        # Save to Excel
        df.to_excel(filename, index=False)

        print(f"\n" + "="*70)
        print(f"📁 FINAL DATASET SAVED: {filename}")
        print(f"📊 Final shape: {df.shape}")
        print(f"🌐 API calls made: {self.api_calls_made}")
        print(f"✅ Real distances: {df['Distance'].notna().sum()}")
        print(f"✅ Real attractions: {df['Name'].nunique()}")
        print(f"✅ Cities covered: {df['Origin'].nunique()}")
        print(f"✅ States covered: {df['Origin_State'].nunique()}")
        print("="*70)

        print(f"\nDATA QUALITY SUMMARY:")
        print(f"• Distance range: {df['Distance'].min():.1f} - {df['Distance'].max():.1f} km")
        print(f"• Rating range: {df['Ratings'].min()} - {df['Ratings'].max()}")
        print(f"• Unique attraction types: {df['Type'].nunique()}")
        print(f"• Most common type: {df['Type'].mode().iloc[0]}")

        print(f"\nSAMPLE FINAL DATA:")
        print(df[['Origin', 'Destination', 'Distance', 'Name', 'Type', 'Ratings']].head())

        return df

# Main execution
if __name__ == "__main__":
    collector = FinalTourismDataCollector()

    print("🚀 READY TO COLLECT FINAL TOURISM DATASET")
    print("💾 Uses maximum real API data + realistic fill")
    print("⏱️ Estimated time: 15-20 minutes for 5000 rows")
    print()

    target = int(input("How many rows do you want? (recommended: 3000-5000): ") or "5000")

    print(f"\nStarting collection of {target} rows...")
    print("Press Ctrl+C anytime to stop and save partial data")
    print()

    try:
        data = collector.collect_final_dataset(target)
        df = collector.save_to_excel()

        print(f"\n🎉 SUCCESS! Your final tourism dataset is ready!")
        print(f"📁 File: final_tourism_dataset_5000.xlsx")
        print(f"✅ Perfect for business analytics projects")

    except KeyboardInterrupt:
        print(f"\n⏹️ Collection stopped by user")
        if collector.data:
            print(f"💾 Saving {len(collector.data)} rows collected so far...")
            df = collector.save_to_excel(f"partial_dataset_{len(collector.data)}.xlsx")
        else:
            print("❌ No data collected yet")

    except Exception as e:
        print(f"\n❌ Error occurred: {e}")
        if collector.data:
            print(f"💾 Saving {len(collector.data)} rows as backup...")
            df = collector.save_to_excel(f"backup_dataset_{len(collector.data)}.xlsx")

🚀 READY TO COLLECT FINAL TOURISM DATASET
💾 Uses maximum real API data + realistic fill
⏱️ Estimated time: 15-20 minutes for 5000 rows

How many rows do you want? (recommended: 3000-5000): 5000

Starting collection of 5000 rows...
Press Ctrl+C anytime to stop and save partial data

FINAL TOURISM DATA COLLECTOR
REAL API DATA (8/14 columns):
✅ Distance - OSRM routing API
✅ Coordinates - Nominatim geocoding API
✅ States - Nominatim administrative API
✅ Attraction Names - Overpass/OpenStreetMap API
✅ Attraction Types - Overpass/OpenStreetMap API

REALISTIC DATA (3/14 columns - no free APIs exist):
🎯 Ratings - Research-based realistic values
🎯 Visit Times - Climate/attraction-based logic
🎯 Duration - Distance/attraction-based logic

INPUT DATA (3/14 columns):
📋 Origin/Destination - Route planning
Target rows: 5000
Estimated API calls: ~60
Estimated time: 0 minutes

STEP 1: Getting real coordinates for all cities...
API Call #1: Getting real coordinates for Delhi
✅ Real data: Delhi = 28.63280

In [4]:
from google.colab import files
files.download('final_tourism_dataset_5000.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [5]:
import pandas as pd

# Load your file
df = pd.read_excel("final_tourism_dataset_5000.xlsx")

# Check if rows with "Delhi" have unknown states
origin_check = df[(df["Origin"].str.contains("Delhi", case=False)) &
                  (df["Origin_State"].str.contains("unknown", case=False))]

dest_check = df[(df["Destination"].str.contains("Delhi", case=False)) &
                (df["Dest_State"].str.contains("unknown", case=False))]

print("Origin with Delhi & unknown state:", origin_check.shape[0])
print("Destination with Delhi & unknown state:", dest_check.shape[0])


Origin with Delhi & unknown state: 760
Destination with Delhi & unknown state: 240
