# Danish Tourism Patterns - Analysis Notebook

## 1. Motivation

### What is your dataset?

We're analyzing Danish tourism patterns using data from multiple sources, which all are provided below:

1. **Danish Tourism Statistics** (Statistics Denmark - Statistik Banken)
   - Demographics data: Socioeconomic (FU14), Regional (FU17), Age (FU18): https://www.statistikbanken.dk/statbank5a/SelectVarVal/Define.asp?MainTable=FU14&PLanguage=0&PXSId=0&wsid=cftree

2. **Air Travel Comparisons**
   - World Bank tourism data: https://databank.worldbank.org/id/7ad54403

3. **Global Reference Data**
   - TripAdvisor Restaurants (31 European cities): https://www.kaggle.com/datasets/damienbeneschi/krakow-ta-restaurans-data-raw/code
   - Global Peace Index (GPI): https://www.kaggle.com/datasets/natalyreguer in/global-peace-index-gpi
   - Cost of Living Index: https://www.numbeo.com/cost-of-living/rankings_by_country.jsp
   - Mean Temperature by Country: https://www.kaggle.com/datasets/palinatx/mean-temperature-for-countries-by-year-2014-2022?select=combined_temperature.csv
   - CO2 Emissions: https://flightemissionmap.org/#Copenhagen/55.67,12.57/127/20000

The Danish data covers demographics with detailed stats on who travels and how much they spend. The global data helps understand why Danes choose certain destinations.

### Why did you choose these datasets?

We wanted to answer questions like:
- How has Danish tourism changed over time?
- What factors make destinations popular?
- Do different demographics travel differently?
- And just in general get as much informations about travelling, with focus on danes.

We combined government statistics with global data to get both local patterns and international context. The idea was to create something that helps people understand Danish travel trends and maybe plan their own trips.

### What was your goal for the end user's experience?

Our goal for the end user is that they are able to see how Danish tourism has evolved, Compare countries on different metrics, Explore demographic patterns and in the end be able to make informed travel decisions, and find out where the next travel destination should be.
We went for a magazine-style design that guides users through key insights but also lets them explore freely.

## 2. Basic stats

### Data cleaning process

Each visualization needed different cleaning approaches:

**For the bubble chart:**
```python
import pandas as pd
import numpy as np

# Load World Bank tourism data
df = pd.read_csv('../data/bubble_plot.csv')

# Clean currency and numeric fields
df['GDP'] = pd.to_numeric(df['GDP, PPP (current international $) [NY.GDP.MKTP.PP.CD]'].str.replace(',', ''), errors='coerce')
df['Population'] = pd.to_numeric(df['Population, total [SP.POP.TOTL]'], errors='coerce')
df['Departures'] = pd.to_numeric(df['International tourism, number of departures [ST.INT.DPRT]'].str.replace(',', ''), errors='coerce')
df['PerCapita'] = pd.to_numeric(df['International Tourism Departures per capita'], errors='coerce')

# Add continent categorization
continent_mapping = {
    'Denmark': 'Europe', 'Sweden': 'Europe', 'Norway': 'Europe', 
    'United Kingdom': 'Europe', 'Germany': 'Europe', 'France': 'Europe',
    'United States': 'North America', 'Canada': 'North America',
    'China': 'Asia', 'Japan': 'Asia', 'India': 'Asia',
    'Australia': 'Oceania', 'New Zealand': 'Oceania',
    'South Africa': 'Africa', 'Egypt, Arab Rep.': 'Africa',
    'Brazil': 'South America', 'Argentina': 'South America',
    'Saudi Arabia': 'Middle East', 'United Arab Emirates': 'Middle East'
}

df['Continent'] = df['Country Name'].map(continent_mapping).fillna('Other')
```

**For the choropleth map:**
```python
import folium
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore', category=pd.errors.SettingWithCopyWarning)
warnings.filterwarnings('ignore', category=FutureWarning)

# Load all datasets
cost_df = pd.read_csv("../data/CostOfLiving.csv")
gpi_df = pd.read_csv("../data/GPI.csv")
temp_df = pd.read_csv("../data/combined_temperature.csv")
restaurants_df = pd.read_csv("../data/TA_restaurants_europa.csv")
CO2_df = pd.read_csv("../data/CO2Emission.csv")

# Clean and standardize country names
cost_df.columns = cost_df.columns.str.strip()
cost_df['Country'] = cost_df['Country'].str.strip()

# Country name mapping for standardization
country_name_map = {
    'United States': 'United States of America',
    'Russia': 'Russian Federation',
    'South Korea': 'Korea, Republic of',
    'Bosnia And Herzegovina': 'Bosnia and Herzegovina',
    'United Kingdom': 'United Kingdom'
    # ... (extensive mapping)
}
```

**For the demographics data:**
```python
# Process socioeconomic spending data
socio_translations = {
    "Gennemsnitshusstand": "Average Household",
    "Selvstændig": "Self-employed",
    "Lønmodtager på højeste niveau": "High Income",
    "Lønmodtager på mellemniveau": "Medium Income",
    "Lønmodtager på grundniveau": "Basic Income",
    "Arbejdsløs": "Unemployed",
    "Uddannelsessøgende": "Student",
    "Pensionist": "Pensioner",
    "Ude af erhverv i øvrigt": "Not in Workforce"
}

socio_df = socioeconomic.copy()
socio_df['Group_EN'] = socio_df['Group'].map(socio_translations)
socio_df['Total'] = socio_df['Packages'] + socio_df['Restaurants'] + socio_df['Accommodation']
```

### Dataset statistics

| Visualization | Dataset | Size | Coverage | 
|--------------|---------|------|----------|
| Bubble Chart | Tourism data | 5,000+ rows | 1996-2019, 200+ countries |
| Choropleth | Cost/Climate/Safety | 573 countries | 2023-2024 |
| Radar Charts | Demographics | 3 dimensions | Current data |

Key findings based on datasets:
- Danish tourism departures per Capita grew from 0.957 to 1.563 (1996-2019)
- Cost of living has stronger correlation with destination choice than safety
- High income groups spend 3x more on package holidays