In [1]:
import plotly.io as pio

pio.renderers.default = "vscode+jupyterlab+notebook_connected"

# **Project 1: Tourist/visitors arrivals by country**

This project analyzes tourist and visitor arrivals by country for a specified year using data from the United Nations. The goal is to understand patterns in international tourism by examining the volume of visitors arriving in various countries.

Date resource: [UNdata - Tourist/visitor arrivals and tourism expenditure](https://data.un.org/)

### **1. Using Pandas**

**Step1: Read in the Data**

Use pandas to load the dataset, ensuring the encoding is set to ISO-8859-1 to properly process any special characters in the file.

In [2]:
import pandas as pd

df = pd.read_csv('SYB66_176_202310_Tourist-Visitors Arrival and Expenditure.csv', encoding='ISO-8859-1')

FileNotFoundError: [Errno 2] No such file or directory: 'SYB66_176_202310_Tourist-Visitors Arrival and Expenditure.csv'

In [None]:
df.head()

Unnamed: 0,T31,Tourist/visitor arrivals and tourism expenditure,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8
0,Region/Country/Area,,Year,Series,Tourism arrivals series type,Tourism arrivals series type footnote,Value,Footnotes,Source
1,4,Afghanistan,2010,Tourism expenditure (millions of US dollars),,,147,,"World Tourism Organization (UNWTO), Madrid, th..."
2,4,Afghanistan,2019,Tourism expenditure (millions of US dollars),,,85,,"World Tourism Organization (UNWTO), Madrid, th..."
3,4,Afghanistan,2020,Tourism expenditure (millions of US dollars),,,75,,"World Tourism Organization (UNWTO), Madrid, th..."
4,8,Albania,2010,Tourist/visitor arrivals (thousands),TF,,2191,Excluding nationals residing abroad.,"World Tourism Organization (UNWTO), Madrid, th..."


**Step2: Data Cleaning**

a. Set the first row (index 0) as the header

In [None]:
import pandas as pd

df = pd.read_csv('SYB66_176_202310_Tourist-Visitors Arrival and Expenditure.csv', encoding='ISO-8859-1', header=1)

In [None]:
df.head()

Unnamed: 0,Region/Country/Area,Unnamed: 1,Year,Series,Tourism arrivals series type,Tourism arrivals series type footnote,Value,Footnotes,Source
0,4,Afghanistan,2010,Tourism expenditure (millions of US dollars),,,147,,"World Tourism Organization (UNWTO), Madrid, th..."
1,4,Afghanistan,2019,Tourism expenditure (millions of US dollars),,,85,,"World Tourism Organization (UNWTO), Madrid, th..."
2,4,Afghanistan,2020,Tourism expenditure (millions of US dollars),,,75,,"World Tourism Organization (UNWTO), Madrid, th..."
3,8,Albania,2010,Tourist/visitor arrivals (thousands),TF,,2191,Excluding nationals residing abroad.,"World Tourism Organization (UNWTO), Madrid, th..."
4,8,Albania,2019,Tourist/visitor arrivals (thousands),TF,,6128,Excluding nationals residing abroad.,"World Tourism Organization (UNWTO), Madrid, th..."


b. Rename the label "Unnamed:1" 

In [None]:
df = df.rename(columns={"Unnamed: 1": "Country name"})

In [None]:
df.head()

Unnamed: 0,Region/Country/Area,Country name,Year,Series,Tourism arrivals series type,Tourism arrivals series type footnote,Value,Footnotes,Source
0,4,Afghanistan,2010,Tourism expenditure (millions of US dollars),,,147,,"World Tourism Organization (UNWTO), Madrid, th..."
1,4,Afghanistan,2019,Tourism expenditure (millions of US dollars),,,85,,"World Tourism Organization (UNWTO), Madrid, th..."
2,4,Afghanistan,2020,Tourism expenditure (millions of US dollars),,,75,,"World Tourism Organization (UNWTO), Madrid, th..."
3,8,Albania,2010,Tourist/visitor arrivals (thousands),TF,,2191,Excluding nationals residing abroad.,"World Tourism Organization (UNWTO), Madrid, th..."
4,8,Albania,2019,Tourist/visitor arrivals (thousands),TF,,6128,Excluding nationals residing abroad.,"World Tourism Organization (UNWTO), Madrid, th..."


c. Convert non-numeric values to numeric values

In [None]:
df['Year'].dtype

dtype('int64')

In [None]:
df['Value'].dtype

dtype('O')

In [None]:
df['Value'] = df['Value'].replace({',': ''}, regex=True) 
df['Value'] = pd.to_numeric(df['Value'], errors='coerce')
df['Value'] = df['Value'].fillna(0)
print(df['Value'].dtype)

int64


**Step3: Select the data**

a. Remove unrelated columns

In [None]:
df[['Country name', 'Year', 'Series', 'Value']].head(10)

Unnamed: 0,Country name,Year,Series,Value
0,Afghanistan,2010,Tourism expenditure (millions of US dollars),147
1,Afghanistan,2019,Tourism expenditure (millions of US dollars),85
2,Afghanistan,2020,Tourism expenditure (millions of US dollars),75
3,Albania,2010,Tourist/visitor arrivals (thousands),2191
4,Albania,2019,Tourist/visitor arrivals (thousands),6128
5,Albania,2020,Tourist/visitor arrivals (thousands),2604
6,Albania,2021,Tourist/visitor arrivals (thousands),5515
7,Albania,1995,Tourism expenditure (millions of US dollars),70
8,Albania,2005,Tourism expenditure (millions of US dollars),880
9,Albania,2010,Tourism expenditure (millions of US dollars),1778


b. Filter the dataset to extract the data focused on Tourist/Visitor arrivals (thousands) in 2021

In [None]:
df_arrivals = df[df['Series'] == 'Tourist/visitor arrivals (thousands)']

df_arrivals[['Country name', 'Year', 'Series', 'Value']]

Unnamed: 0,Country name,Year,Series,Value
3,Albania,2010,Tourist/visitor arrivals (thousands),2191
4,Albania,2019,Tourist/visitor arrivals (thousands),6128
5,Albania,2020,Tourist/visitor arrivals (thousands),2604
6,Albania,2021,Tourist/visitor arrivals (thousands),5515
13,Algeria,1995,Tourist/visitor arrivals (thousands),520
...,...,...,...,...
2204,Zimbabwe,2005,Tourist/visitor arrivals (thousands),1559
2205,Zimbabwe,2010,Tourist/visitor arrivals (thousands),2239
2206,Zimbabwe,2019,Tourist/visitor arrivals (thousands),2294
2207,Zimbabwe,2020,Tourist/visitor arrivals (thousands),639


In [None]:
df_arrivals_2021 = df_arrivals[df_arrivals['Year'] == 2021]
df_arrivals_2021[['Country name', 'Series', 'Value']]

Unnamed: 0,Country name,Series,Value
6,Albania,Tourist/visitor arrivals (thousands),5515
18,Algeria,Tourist/visitor arrivals (thousands),125
33,Andorra,Tourist/visitor arrivals (thousands),1949
41,Angola,Tourist/visitor arrivals (thousands),64
53,Anguilla,Tourist/visitor arrivals (thousands),28
...,...,...,...
2123,United States of America,Tourist/visitor arrivals (thousands),22100
2155,Uzbekistan,Tourist/visitor arrivals (thousands),1881
2182,Viet Nam,Tourist/visitor arrivals (thousands),157
2197,Zambia,Tourist/visitor arrivals (thousands),554


**Step4: Compute the Mean, Median, and Mode**

Calculate the mean, median, and mode for Tourist/Visitor arrivals (thousands) in 2021

a. The mean

In [None]:
mean = df_arrivals_2021['Value'].mean()
print(f"Mean: {mean:.2f}")

Mean: 2604.87


b. The median

In [None]:
median = df_arrivals_2021['Value'].median()
print(f"Median: {median:.2f}")

Median: 430.00


c. The mode

In [None]:
mode = float(df_arrivals_2021['Value'].mode().iloc[0])
print(f"Mode: {mode}")

Mode: 0.0


### **2. Using only the Python standard library**

**Step1: Read in the Data and Data Cleaning**

- Library Imports: The code utilizes the `csv` library to manage the reading of data from a CSV file.
  
- File Opening and Reading: The specified CSV file is opened, and `csv.DictReader` processes each row into a dictionary format for easier data handling.

- Data cleaning and filtering: Process the header by replacing empty column names with "Country name," format and convert data, and extract the 2021 "Tourist/visitor arrivals (thousands)" data.

In [None]:
import csv

filename = 'SYB66_176_202310_Tourist-Visitors Arrival and Expenditure.csv'

data = []
with open(filename, newline='', encoding='ISO-8859-1') as csvfile:
    reader = csv.reader(csvfile)
    next(reader)
    headers = next(reader)
    headers = ['Country name' if col == '' else col for col in headers]
    for row in reader:
        row_dict = dict(zip(headers, row))
        try:
            row_dict['Value'] = float(row_dict['Value'].replace(',', '')) if row_dict['Value'] else 0.0
        except ValueError:
            row_dict['Value'] = 0.0
        if row_dict['Series'] == 'Tourist/visitor arrivals (thousands)' and row_dict['Year'] == '2021':
            filtered_row = {
                'Country name': row_dict['Country name'],
                'Year': row_dict['Year'],
                'Series': row_dict['Series'],
                'Value': row_dict['Value']
            }
            data.append(filtered_row)
            
for row in data[:5]:  
    print(row)

{'Country name': 'Albania', 'Year': '2021', 'Series': 'Tourist/visitor arrivals (thousands)', 'Value': 5515.0}
{'Country name': 'Algeria', 'Year': '2021', 'Series': 'Tourist/visitor arrivals (thousands)', 'Value': 125.0}
{'Country name': 'Andorra', 'Year': '2021', 'Series': 'Tourist/visitor arrivals (thousands)', 'Value': 1949.0}
{'Country name': 'Angola', 'Year': '2021', 'Series': 'Tourist/visitor arrivals (thousands)', 'Value': 64.0}
{'Country name': 'Anguilla', 'Year': '2021', 'Series': 'Tourist/visitor arrivals (thousands)', 'Value': 28.0}


**Step2: Compute the Mean, Median, and Mode**

a. The mean

To calculate the mean:
- Sum the Values: Add up all the `Value` entries (tourist arrivals) from the `data` list.
- Count the Entries: Count how many data points (countries) are in the `data` list.
- Divide the Sum by the Count: Divide the total sum by the number of entries to get the mean.

In [None]:
total = sum(row['Value'] for row in data)  
count = len(data)  
mean = total / count if count > 0 else 0  
print(f"Mean: {mean:.2f}")

Mean: 2604.87


b. The median

To calculate the median:
- Extract the Values: Collect all the Value entries (tourist arrivals) from the data list.
- Sort the Values: Arrange the values in ascending order.
- Find the Middle Value:
If the number of values is odd, the median is the middle value in the sorted list.
If the number of values is even, the median is the average of the two middle values.

In [None]:
values = [row['Value'] for row in data]  
values.sort()  

count = len(values)
if count % 2 == 1:
    median = values[count // 2]  
else:
    median = (values[count // 2 - 1] + values[count // 2]) / 2 
print(f"Median: {median:.2f}")

Median: 430.00


c. The mode

To calculate the mode:
- Extract the Values: Collect all the Value entries (tourist arrivals) from the data list.
- Count the Frequency: Create a dictionary to track how many times each value appears.
- Identify the Mode: The mode is the value with the highest frequency.

In [None]:
values = [row['Value'] for row in data]  
value_counts = {}

for value in values:
    if value in value_counts:
        value_counts[value] += 1
    else:
        value_counts[value] = 1

max_count = max(value_counts.values())
modes = [key for key, count in value_counts.items() if count == max_count]
mode_value = modes[0] if modes else None  

print(f"Mode: {mode_value}")

Mode: 0.0


### **3. Data Visualization**

- Filter data for the specific metric and year, then sort by value for display:
Filter the data for "Tourist/Visitor Arrivals (thousands)" in 2021.

- Determine the maximum bar length:
Keep the chart compact for readability on narrow screens.
Find the highest value in the dataset to scale the bars proportionally.

- Generate the textual bar chart:
Scale the bar length relative to the maximum value.
Format and display each country's name alongside its bar.
Use '■' to represent the length of the bars visually.

In [None]:
def generate_bar_chart(data):
    max_value = max(row['Value'] for row in data)  
    max_bar_length = 50  
    
    data = sorted(data, key=lambda x: x['Value'], reverse=True)
    
    print("Tourist/Visitor Arrivals (Thousands) in 2021 by Country")
    print("=" * 80)
    for row in data:
        country = row['Country name']
        value = row['Value']
        
        bar_length = int((value / max_value) * max_bar_length) if max_value > 0 else 0
        bar = "■" * bar_length  
        
        print(f"{country[:20]:<20} | {bar} {value:.0f}K")
    print("=" * 80)

generate_bar_chart(data)

Tourist/Visitor Arrivals (Thousands) in 2021 by Country
France               | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 48395K
Mexico               | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 31860K
Spain                | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 31181K
Türkiye              | ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 29925K
Italy                | ■■■■■■■■■■■■■■■■■■■■■■■■■■■ 26888K
United States of Ame | ■■■■■■■■■■■■■■■■■■■■■■ 22100K
Denmark              | ■■■■■■■■■■■■■■■■■■■ 18405K
Greece               | ■■■■■■■■■■■■■■■ 14705K
Austria              | ■■■■■■■■■■■■■ 12728K
Germany              | ■■■■■■■■■■■■ 11688K
United Arab Emirates | ■■■■■■■■■■■ 11479K
Croatia              | ■■■■■■■■■■ 10641K
Poland               | ■■■■■■■■■■ 9722K
Hungary              | ■■■■■■■■ 7929K
India                | ■■■■■■■ 7010K
Romania              | ■■■■■■■ 6789K
Portugal             | ■■■■■■ 6345K
United Kingdom       | ■■■■■■ 6287K
Netherlands (Kingdom | ■■■■■■ 6248K
Albania              | ■■■■■ 5515K
Domi

### **4. What I found**

- This data is from 2021 and is affected by the COVID-19 pandemic, so it is important to note that the situation may be different post-pandemic as travel restrictions and other factors have changed.
- France recorded the highest number of tourist arrivals in 2021 with over 48 million visitors, showing its sustained popularity as a global tourism leader. Mexico (31.9 million) and Spain (31.1 million) also saw substantial visitor numbers, indicating their draw as premier vacation destinations.
- Countries like the United States and Denmark showed solid visitor counts, yet emerging regions like the Middle East (United Arab Emirates) and certain Asian nations (India, Thailand) had notable arrivals, pointing to an expansion of tourism beyond traditionally dominant destinations. 
- Countries in the Asia-Pacific, such as Australia (246K) and Japan (246K), had lower arrival figures, potentially due to strict travel restrictions and delayed reopenings during the pandemic. 