# Travel and Tourism Reform Project

### Documentation

**Dataframes:** 
- df_qcontcust_2009_2019 -> contains data on all years between 2009 - 2019
- df_qcontcust_2009, df_qcontcust_2010, ... to df_qcontcust_2019 -> filtered from df_qcontcust_2009_2019 for each year
- df_qcontcust_2022 -> contains data for 2022 

**Dictionaries:**
- flow_dict -> contains flow codes (arrival/departure, foreign/UK) for all years
- Purpose_value_map_0919 -> Purpose of visit mapping for the years 2009 to 2019
- Purpose_value_map_22 -> Purpose of visit mapping for 2022
- Nationality_value_map_0919 -> mapping for Nationality of respondent - NEW CODES (2009-2019)
- Nationality_value_map_22 -> mapping for Nationality of respondent - NEW CODES (2022)

**New variables created:**

***


## Importing Packages

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import statsmodels.api as sm
from statsmodels.sandbox.stats.multicomp import multipletests

import scipy.stats as ss
from scipy.stats import kruskal
from scipy.stats import mannwhitneyu
from scipy.stats import chi2_contingency

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, multilabel_confusion_matrix
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.utils.class_weight import compute_class_weight
from scikit_posthocs import posthoc_dunn

from itertools import product

from imblearn.over_sampling import RandomOverSampler

from tabulate import tabulate

## Loading Data

In [13]:
df_qcontcust_2009_2019 = pd.read_csv("C:\\Users\\medasud\\Downloads\\2013-UKDA-7380-tab\\tab\\qcontcust_2009_2019.tab", delimiter='\t')
#filtering the dataset into different years
df_qcontcust_2009 = df_qcontcust_2009_2019[df_qcontcust_2009_2019['Year'] == 2009]
df_qcontcust_2010 = df_qcontcust_2009_2019[df_qcontcust_2009_2019['Year'] == 2010]
df_qcontcust_2011 = df_qcontcust_2009_2019[df_qcontcust_2009_2019['Year'] == 2011]
df_qcontcust_2012 = df_qcontcust_2009_2019[df_qcontcust_2009_2019['Year'] == 2012]
df_qcontcust_2013 = df_qcontcust_2009_2019[df_qcontcust_2009_2019['Year'] == 2013]
df_qcontcust_2014 = df_qcontcust_2009_2019[df_qcontcust_2009_2019['Year'] == 2014]
df_qcontcust_2015 = df_qcontcust_2009_2019[df_qcontcust_2009_2019['Year'] == 2015]
df_qcontcust_2016 = df_qcontcust_2009_2019[df_qcontcust_2009_2019['Year'] == 2016]
df_qcontcust_2017 = df_qcontcust_2009_2019[df_qcontcust_2009_2019['Year'] == 2017]
df_qcontcust_2018 = df_qcontcust_2009_2019[df_qcontcust_2009_2019['Year'] == 2018]
df_qcontcust_2019 = df_qcontcust_2009_2019[df_qcontcust_2009_2019['Year'] == 2019]
df_qcontcust_2022 = pd.read_csv("C:\\Users\\medasud\\Downloads\\2022-UKDA-9122-tab\\tab\\qcontcust2022.tab", delimiter='\t')


df_qreg_2013 = pd.read_csv("C:\\Users\\medasud\\Downloads\\2013-UKDA-7380-tab\\tab\\qreg_2013.tab", delimiter='\t')
df_qreg_2014 = pd.read_csv("C:\\Users\\medasud\\Downloads\\2014-UKDA-7534-tab\\tab\\qreg_2014.tab", delimiter='\t')
df_qreg_2015 = pd.read_csv("C:\\Users\\medasud\\Downloads\\2015-UKDA-7754-tab\\tab\\qreg_2015.tab", delimiter='\t')
df_qreg_2016 = pd.read_csv("C:\\Users\\medasud\\Downloads\\2016-UKDA-8016-tab\\tab\\qreg_2016.tab", delimiter='\t')
df_qreg_2017 = pd.read_csv("C:\\Users\\medasud\\Downloads\\2017-UKDA-8286-tab\\tab\\qreg_2017.tab", delimiter='\t')
df_qreg_2018 = pd.read_csv("C:\\Users\\medasud\\Downloads\\2018-UKDA-8468-tab\\tab\\qreg_2018.tab", delimiter='\t')
df_qreg_2019 = pd.read_csv("C:\\Users\\medasud\\Downloads\\2019-UKDA-8575-tab\\tab\\qreg_2019.tab", delimiter='\t')
df_qreg_2022 = pd.read_csv("C:\\Users\\medasud\\Downloads\\2022-UKDA-9122-tab\\tab\\qreg_2022.tab", delimiter='\t')
#qreg is not available for 2009-2012


  df_qcontcust_2009_2019 = pd.read_csv("C:\\Users\\medasud\\Downloads\\2013-UKDA-7380-tab\\tab\\qcontcust_2009_2019.tab", delimiter='\t')
  df_qcontcust_2022 = pd.read_csv("C:\\Users\\medasud\\Downloads\\2022-UKDA-9122-tab\\tab\\qcontcust2022.tab", delimiter='\t')
  df_qreg_2013 = pd.read_csv("C:\\Users\\medasud\\Downloads\\2013-UKDA-7380-tab\\tab\\qreg_2013.tab", delimiter='\t')
  df_qreg_2014 = pd.read_csv("C:\\Users\\medasud\\Downloads\\2014-UKDA-7534-tab\\tab\\qreg_2014.tab", delimiter='\t')
  df_qreg_2015 = pd.read_csv("C:\\Users\\medasud\\Downloads\\2015-UKDA-7754-tab\\tab\\qreg_2015.tab", delimiter='\t')
  df_qreg_2017 = pd.read_csv("C:\\Users\\medasud\\Downloads\\2017-UKDA-8286-tab\\tab\\qreg_2017.tab", delimiter='\t')


## Creating New Variables from Mappings

In [14]:
#dictionary for flow
flow_dict = {
    1.0: "Air Departure Foreign",
    2.0: "Air Departure UK",
    3.0: "Air Arrival Foreign",
    4.0: "Air Arrival UK",
    5.0: "Sea Departure Foreign",
    6.0: "Sea Departure UK",
    7.0: "Sea Arrival Foreign",
    8.0: "Sea Arrival UK"
}

In [78]:
#function to create Flow_Label column for all years

def create_flow_label_column(df):
    #fill missing values in Purpose column with -1
    df['Flow'].replace(' ', np.nan, inplace=True)
    df['Flow'].fillna(-1, inplace=True)
    df['Flow'] = df['Flow'].astype(float)
    df['Flow'].replace('-1', np.nan, inplace=True)
    
    df['Flow_Label'] = df['Flow'].map(flow_dict)

#call this function for df_qcontcust of each year
dataframes = [df_qcontcust_2009, df_qcontcust_2010, df_qcontcust_2011, df_qcontcust_2012,
              df_qcontcust_2013, df_qcontcust_2014, df_qcontcust_2015, df_qcontcust_2016,
              df_qcontcust_2017, df_qcontcust_2018, df_qcontcust_2019, df_qcontcust_2022]

#iterate over the list of dataframes and apply the function
for df in dataframes:
    create_flow_label_column(df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Flow'].replace(' ', np.nan, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Flow'].fillna(-1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Flow'] = df['Flow'].astype(float)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-

In [64]:
#creating a mapping of the values in Purpose for 2009-2019

Purpose_value_map_0919 = {
    10.0: "Holiday/pleasure",
    11.0: "Visit family (priority)",
    12.0: "Visit friends",
    13.0: "Getting married",
    14.0: "Play amateur sport",
    15.0: "Watch sport",
    16.0: "Personal shopping",
    17.0: "Cruise 0-2 nights ashore - UK",
    18.0: "Cruise 0-2 nights ashore - For",
    20.0: "Business; Work",
    21.0: "Visit trade fair",
    22.0: "Conference 20+ people",
    23.0: "Definite job to go to",
    24.0: "International commuter",
    25.0: "Looking for work",
    26.0: "Au Pair",
    40.0: "Formal course (check residence and definition)",
    41.0: "Medical treatment",
    44.0: "Accompany / join",
    45.0: "OTHER",
    51.0: "Immigrating/Emigrating",
    60.0: "Other formal study",
    61.0: "First/foundation degree",
    62.0: "Higher/PostGrad degree",
    63.0: "English language course",
    64.0: "Course between school and degree",
    65.0: "Secondary education",
    66.0: "Professional qualification",
    70.0: "Overnight transit",
    71.0: "Same day transit",
    80.0: "Military (serving on duty)",
    81.0: "Merchant navy (joining or leaving ship)",
    82.0: "Airline crew (positioning)",
    83.0: "Unacc schoolchild (16 or under, school to parents)",
    97.0: "Coding query",
}

In [107]:
#creating a mapping of the values in Purpose for 2022 (different from 2009-2019)

Purpose_value_map_22 = {
    10.0: "Holiday/pleasure",
    11.0: "Visit family (priority)",
    12.0: "Visit friends",
    13.0: "Getting married",
    14.0: "Play amateur sport",
    15.0: "Watch sport",
    16.0: "Personal shopping",
    17.0: "Cruise 0-2 nights ashore - UK",
    18.0: "Cruise 0-2 nights ashore - For",
    20.0: "Business; Work",
    21.0: "Visit trade fair",
    22.0: "Conference 20+ people",
    23.0: "Definite job to go to",
    24.0: "International commuter",
    25.0: "Looking for work",
    26.0: "Au Pair",
    27.0: "Working Holiday",
	30.0: "Olympics/Paralympics Participate",
	31.0: "Olympics/Paralympics Work",
	32.0: "Olympics/Paralympics Watch",
	41.0: "Medical Treatment",
	43.0: "Joining another traveller",
	44.0: "Accompany another traveller",
    45.0: "OTHER",
    46.0: "Religious Pilgrimage",
	47.0: "University Degree or Diploma",
	50.0: "Asylum Seeker",
    51.0: "Immigrating/Emigrating",
    52.0: "Returning Home To Live",
	60.0: "Formal Course",
	61.0: "First or Foundation Degree",
	62.0: "Higher or Postgraduate Degree",
    63.0: "English language course (not degree level)",
    64.0: "Other Course Below Degree Level & Above Secondary Education",
    65.0: "Secondary education",
    66.0: "Professional qualification",
    70.0: "Overnight transit",
    71.0: "Same day transit",
    80.0: "Military or embassy (serving on duty)",
    81.0: "Merchant navy (joining or leaving ship)",
    82.0: "Airline crew (positioning)",
    83.0: "Unacc schoolchild (16 or under, school to parents)",
    84.0: "Embassy Personel"
}

In [66]:
#function to create Purpose_Label column for 2009-2019

def create_purpose_label_column(df):
    df['Purpose'].replace(' ', np.nan, inplace=True)
    
    # Create a new column "Purpose_Label" by mapping the values
    df['Purpose_Label'] = df['Purpose'].map(Purpose_value_map_0919)
    df['Purpose_Label'].fillna(("Unknown"), inplace=True)
    df['Purpose'].replace('-1', np.nan, inplace=True) #fill unknown values with -1
    
#call this function for df_qcontcust of each year
dataframes = [df_qcontcust_2009, df_qcontcust_2010, df_qcontcust_2011, df_qcontcust_2012,
              df_qcontcust_2013, df_qcontcust_2014, df_qcontcust_2015, df_qcontcust_2016,
              df_qcontcust_2017, df_qcontcust_2018, df_qcontcust_2019]

#iterate over the list of dataframes and apply the function for 2009-2019
for df in dataframes:
    create_purpose_label_column(df)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Purpose'].replace(' ', np.nan, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Purpose_Label'] = df['Purpose'].map(Purpose_value_map_0919)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Purpose_Label'].fillna(("Unknown"), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/us

In [108]:
#create the column for 2022

df_qcontcust_2022['Purpose'].replace(' ', np.nan, inplace=True)
df_qcontcust_2022['Purpose'] = pd.to_numeric(df_qcontcust_2022['Purpose'], errors='coerce')
#Purpose column was not numeric
df_qcontcust_2022['Purpose_Label'] = df_qcontcust_2022['Purpose'].map(Purpose_value_map_22)
df_qcontcust_2022['Purpose_Label'].fillna(("Unknown"), inplace=True)
df_qcontcust_2022['Purpose'].replace('-1', np.nan, inplace=True)
