## AI4D-LAB-TANZANIA-TOURISM-CLASSIFICATION-CHALLENGE

Can you use tourism survey data and ML to classify the range of expenditures a tourist spends in Tanzania?

The Tanzanian tourism sector plays a significant role in the Tanzanian economy, contributing about 17% to the country’s GDP and 25% of all foreign exchange revenues. The sector, which provides direct employment for more than 600,000 people and up to 2 million people indirectly, generated approximately $2.4 billion in 2018 according to government statistics. Tanzania received a record 1.1 million international visitor arrivals in 2014, mostly from Europe, the US and Africa.

Tanzania is the only country in the world which has allocated more than 25% of its total area for wildlife, national parks, and protected areas.There are 16 national parks in Tanzania, 28 game reserves, 44 game-controlled areas, two marine parks and one conservation area.

Tanzania’s tourist attractions include the Serengeti plains, which hosts the largest terrestrial mammal migration in the world; the Ngorongoro Crater, the world’s largest intact volcanic caldera and home to the highest density of big game in Africa; Kilimanjaro, Africa’s highest mountain; and the Mafia Island marine park; among many others. The scenery, topography, rich culture and very friendly people provide for excellent cultural tourism, beach holidays, honeymooning, game hunting, historical and archaeological ventures – and certainly the best wildlife photography safaris in the world.

The objective of this hackathon is to develop a machine learning model that can classify the range of expenditures a tourist spends in Tanzania. The model can be used by different tour operators and the Tanzania Tourism Board to automatically help tourists across the world estimate their expenditure before visiting Tanzania.

https://zindi.africa/competitions/ai4d-lab-tanzania-tourism-classification-challenge

AI Squad Team Members

>@Ebiendele (Team Leader), <br>@Mike_ade, <br>@D-Prof

### Downloading Dataset from zindi using the Zindi package

In [1]:
# !pip -q install git+https://github.com/eaedk/testing-zindi-package.git
# from zindi.user import Zindian
# USERNAME = "adetoromichael346@gmail.com" #@param {type : "string"}
# user = Zindian(username = USERNAME)
# user.select_a_challenge(reward = 'all', kind = 'competition', active = 'true')
# user.download_dataset(destination = "dataset")
# I can't find the AI4D... competition on the list of challenges, so i will manually import the data.

### LOAD NECESSARY LIBRARIES

In [None]:
!pip install catboost optuna # installing catboost and optuna libraries

In [3]:
# dependencies
import re
import pandas as pd
pd.options.display.max_columns = 999
pd.options.display.max_rows = 999
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, StratifiedKFold, KFold
from catboost import Pool, CatBoostClassifier
from sklearn.cluster import KMeans
from lightgbm import LGBMClassifier
from geopy.geocoders import Nominatim
import optuna
from sklearn.metrics import log_loss

import warnings
warnings.filterwarnings('ignore')

In [5]:
#@markdown <br><center><img src='https://upload.wikimedia.org/wikipedia/commons/thumb/d/da/Google_Drive_logo.png/600px-Google_Drive_logo.png' height="150" alt="Gdrive-logo"/></center>
#@markdown <center><h2>Mount GDrive to /content/drive</h3></center><br>
MODE = "MOUNT" #@param ["MOUNT", "UNMOUNT"]
#Mount your Gdrive! 
from google.colab import drive
drive.mount._DEBUG = False
if MODE == "MOUNT":
  drive.mount('/content/drive', force_remount=True)
elif MODE == "UNMOUNT":
  try:
    drive.flush_and_unmount()
  except ValueError:
    pass
  get_ipython().system_raw("rm -rf /root/.config/Google/DriveFS")

Mounted at /content/drive


### LOAD DATA (and other relevant dataset like continent and latitude and longitude)

In [6]:
# load the data
path = '/content/drive/MyDrive/DATASET/dataset_/AI4D DATA'
# path to the folder where the datasets are located in my google drive
cont_coun = pd.read_csv(f'{path}/Countries-Continents.csv')
# downloaded from 'https://github.com/dbouquin/IS_608/blob/master/NanosatDB_munging/Countries-Continents.csv'
lat_lon = pd.read_csv(f'{path}/world_country_and_usa_states_latitude_and_longitude_values.csv')
# downloaded from 'https://www.kaggle.com/datasets/paultimothymooney/latitude-and-longitude-for-every-country-and-state'
Train = pd.read_csv(f'{path}/Train.csv')
Test = pd.read_csv(f'{path}/Test.csv')
sub = pd.read_csv(f'{path}/SampleSubmission.csv')
random_seed = 2001 # random seed for all computations

### COMBINE DATA INTO A WHOLE DATASET

In [7]:
all_data = pd.concat([Train, Test], sort = False).reset_index(drop = True)
all_data.shape

(24675, 21)

### PREPROCESSING

I will fill missing values in travel with with free-bus, and not alone, even though where travel with is missing, the addition of total_male and total_female gives us 1 for almost all cases which signifies that they are most likely alone. This is because there must be a reason why it is missing even though they were alone during the tourism.

Since total_male and total female have missing values, this means no male or female tourist were around during that particular tourism, so i will fill the missing values with 0

In [8]:
le = LabelEncoder()
def preprocess(data):
  data.travel_with.fillna('Free-bus', inplace = True)
  data.total_female.fillna(0, inplace = True)
  data.total_male.fillna(0, inplace = True)
  # label encoding packages, tour_arrangement and first_trip_tz
  LE_cols = ['package_transport_int', 'package_accomodation', 'package_food', 'tour_arrangement',\
                  'package_transport_tz', 'package_sightseeing', 'package_guided_tour', 'package_insurance', 'first_trip_tz']
  for le_col in LE_cols:
      data[le_col] = le.fit_transform(data[le_col])
  return data

preprocessed_data = preprocess(all_data)

### FEATURE INTERACTION AND ENGINEERING

In [9]:
def feature_engineering(data, cont_coun, lat_lon):
  # mapping continent to country
  data['country_'] = data['country'].str.title()
  cont_coun.rename(columns = {'Country' : 'country_'}, inplace = True)
  data.country_.replace({'United States Of America' : 'US', 'Drc' : 'Congo, Democratic Republic of', 'Swizerland' : 'Switzerland', 'Morroco' : 'Morocco', 'Uae': 'United Arab Emirates', 'Saud Arabia' : 'Saudi Arabia', 'Myanmar' : 'Burma (Myanmar)', 'Russia' : 'Russian Federation',\
                                  'Korea' : 'Korea, South', 'Czech Republic' : 'CZ', 'Taiwan' : 'China', 'Djibout' : 'Djibouti', 'Ukrain' : 'Ukraine', 'Malt' : 'Malta', 'Costarica' : 'Costa Rica', 'Burgaria' : 'Bulgaria', 'Comoro' : 'Comoros', 'Philipines' : 'Philippines', 'Somali' : 'Somalia',\
                                  'Ecuado' : 'Ecuador', 'Monecasque' : 'Monaco', 'Trinidad Tobacco' : 'Trinidad and Tobago', 'Bosnia' : 'Bosnia and Herzegovina'}, inplace=True)
  data = data.merge(cont_coun, on = 'country_', how = 'left')
  data.Continent = np.where((data.country_ == 'Scotland') & (data.Continent.isna()), 'Europe', data.Continent)
  data.Continent = np.where((data.country_ == 'Bermuda') & (data.Continent.isna()), 'North America', data.Continent)
  data.drop(['country_'], 1, inplace = True)  

  # mapping latitude and longitude to country
  lat_lon = lat_lon[['country', 'latitude', 'longitude']]
  data['country'] = data['country'].str.title()
  data.country.replace({'United States Of America' : 'United States', 'Drc' : 'Congo [DRC]', 'Congo' : 'Congo [Republic]', 'Swizerland' : 'Switzerland', 'Morroco' : 'Morocco', 'Uae': 'United Arab Emirates', 'Saud Arabia' : 'Saudi Arabia', 'Myanmar' : 'Myanmar [Burma]', \
                                  'Korea' : 'South Korea', 'Ivory Coast' : 'Côte d\'Ivoire', 'Djibout' : 'Djibouti', '\tDjibouti' : 'Djibouti', 'Ukrain' : 'Ukraine', 'Malt' : 'Malta', 'Costarica' : 'Costa Rica', 'Burgaria' : 'Bulgaria', 'Comoro' : 'Comoros', 'Philipines' : 'Philippines', 'Somali' : 'Somalia', \
                                  'Ecuado' : 'Ecuador', 'Macedonia' : 'Macedonia [FYROM]', 'Monecasque' : 'Monaco', 'Trinidad Tobacco' : 'Trinidad and Tobago', 'Bosnia' : 'Bosnia and Herzegovina', 'Scotland' : ''}, inplace=True)
  data = data.merge(lat_lon, on = 'country', how = 'left')
  data.latitude.fillna(56.78611112, inplace = True)
  data.longitude.fillna(-4.1140518, inplace = True)
  
  # mapping age
  map_age = {'<18' : 18, '18-24' : 24, '25-44' : 44, '45-64' : 64, '65+' : 75}
  data.age_group = data.age_group.map(map_age)
  
  # feature interaction
  # adding nights together
  data['total_nights'] = data.night_mainland + data.night_zanzibar 
  # adding packages together
  data['total_packages'] = data.package_transport_int + data.package_accomodation + data.package_food + data.package_transport_tz + data.package_sightseeing + data.package_guided_tour + data.package_insurance
  # adding people together
  data['total_people'] = data.total_male + data.total_female
  # dividing packages by people available
  data["packages_per_people"] = data["total_packages"] / data["total_people"]

  # frequecy encoding, since it has many unique features
  cols = ['country']
  for col in cols:
    data[col] = data[col].map(data.groupby(col).size() / len(data))
  
  # Groupby features by mean
  data['country_by_people'] = data['total_people'].groupby(data['country']).transform('mean')
  data['people_by_packages'] = data['total_packages'].groupby(data['total_people']).transform('mean')
  data['people_by_night'] = data['total_nights'].groupby(data['total_people']).transform('mean')
  data['age_by_packages'] = data['total_packages'].groupby(data['age_group']).transform('mean')
  
  # one hot encoding
  cols2dum = ['info_source', 'main_activity', 'purpose', 'travel_with', 'Continent']
  data = pd.get_dummies(data, prefix_sep = '_', columns = cols2dum)
  
  # handling inf values and missing values
  data = data.replace([np.inf], np.nan)
  data.fillna(data.mean() , inplace = True)
  
  # clustering the dataset into 6 different clusters since we have 6 different classes to classify
  data_km = data.drop(['Tour_ID', 'cost_category'], axis = 1)
  km = KMeans(n_clusters = 6, random_state = random_seed)
  data['cluster'] = km.fit_predict(data_km)
  
  # renaming column names, by removing char in the string
  data.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x), inplace=True)
  data.drop(['Tour_ID'], 1, inplace = True)
  
  # getting train and test dataset
  train = data[data.cost_category.notnull()].reset_index(drop = True)
  test = data[data.cost_category.isna()].reset_index(drop = True)
  return train, test

train, test = feature_engineering(preprocessed_data, cont_coun, lat_lon)
train.shape, test.shape

((18506, 65), (6169, 65))

### MODELLING and evaluation

In [10]:
X, y = train.drop('cost_category', axis = 1), train['cost_category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .15, shuffle = True, random_state = random_seed)

#### BASELINE MODEL

##### CATBOOST

In [11]:
d_train, d_test  = Pool(X_train, y_train), Pool(X_test, y_test)
cb_model_ = CatBoostClassifier(l2_leaf_reg = 9.441413522475084, depth = 7, bootstrap_type = 'Bayesian', learning_rate = 0.01772339213540557, n_estimators = 3167, use_best_model = True,
                                                 leaf_estimation_iterations = 1, random_strength = 0.17095032711212016, loss_function = 'MultiClass', verbose = 0, random_state = random_seed)
cb_model_.fit(d_train, eval_set = [(d_test)], verbose = 0, early_stopping_rounds = 500)
preds_ = cb_model_.predict_proba(d_test)
log_loss(y_test, preds_).round(5)

1.03045

##### LIGHTGBM

In [12]:
lgb_model_ = LGBMClassifier(boosting_type = 'gbdt', objective = 'multiclass', metric = 'multi_logloss', n_estimators = 3000, learning_rate = 0.01, use_best_model = True,
                                             num_leaves = 45, colsample_bytree = 0.5, subsample = 0.9, subsample_freq = 1, max_depth = 6, reg_alpha = 0.8, reg_lambda = 0.8,
                                             min_split_gain = 0.05, min_child_weight = 0.05, random_state = random_seed, num_class = 6, silent = -1, verbose = -1)
lgb_model_.fit(X_train, y_train, eval_set = [(X_train, y_train), (X_test, y_test)], early_stopping_rounds = 500, eval_metric = 'logloss', verbose = 0)
preds_ = lgb_model_.predict_proba(X_test)
log_loss(y_test, preds_).round(5)

1.03165

### CROSS-VALIDATION

In [13]:
# creating fols to n=be used for cross validation
TARGET_COL = 'cost_category'
remove_features = ['cost_category', 'folds']
features_columns = [col for col in train.columns if col not in remove_features]
cat = le.fit_transform(train.cost_category)
def create_folds(data):
    data["folds"] = -1
    data = data.sample(frac = 1).reset_index(drop = True)
    num_bins = np.floor(1 + np.log2(len(train))).astype(int)
    data.loc[:, "bins"] = pd.cut(cat, bins = num_bins, labels = False)
    kf = StratifiedKFold(n_splits = 15)
    for f, (t_, v_) in enumerate(kf.split(X = data, y = data.bins.values)):
        data.loc[v_, "folds"] = f
    data.drop("bins", axis = 1, inplace = True)
    return data
train = create_folds(train)

##### CATBOOST

In [14]:
log_loss_score_ = []
print("-" * 30)
n_splits = 15
for fold in range(n_splits):
  x_train_, y_train_ = train[train['folds']!=fold][features_columns] , train[train['folds']!=fold][TARGET_COL] 
  x_test_, y_test_ = train[train['folds']==fold][features_columns] , train[train['folds']==fold][TARGET_COL] 
  d_train = Pool(x_train_, y_train_)
  d_test = Pool(x_test_, y_test_)
  model_cb = CatBoostClassifier(l2_leaf_reg = 9.441413522475084, depth = 7, bootstrap_type = 'Bayesian', learning_rate = 0.01772339213540557, n_estimators = 3167, use_best_model = True,
                                                 leaf_estimation_iterations = 1, random_strength = 0.17095032711212016, loss_function = 'MultiClass', verbose = 0, random_state = random_seed)
  model_cb.fit(d_train, eval_set = [(d_train), (d_test)], verbose = 0, early_stopping_rounds = 500)
  preds_ = model_cb.predict_proba(d_test)
  log_loss_ = log_loss(y_test_, preds_)
  print(f'LOG_LOSS_{fold + 1}: {log_loss_}')
  log_loss_score_.append(log_loss_)
  print("-" * 30)
print('\n')
print(f"LOG_LOSS_CV_CB: {np.mean(log_loss_score_).round(5)}")

------------------------------
LOG_LOSS_1: 1.0557587942913207
------------------------------
LOG_LOSS_2: 1.0925862059968245
------------------------------
LOG_LOSS_3: 1.0402501930017034
------------------------------
LOG_LOSS_4: 1.034107627439311
------------------------------
LOG_LOSS_5: 1.0593547324176908
------------------------------
LOG_LOSS_6: 1.0549207403493077
------------------------------
LOG_LOSS_7: 1.0407737579244611
------------------------------
LOG_LOSS_8: 1.0869126688266508
------------------------------
LOG_LOSS_9: 1.0344020700567562
------------------------------
LOG_LOSS_10: 1.0552004593679924
------------------------------
LOG_LOSS_11: 1.0397300234388287
------------------------------
LOG_LOSS_12: 1.0344587168314696
------------------------------
LOG_LOSS_13: 1.0602824873411716
------------------------------
LOG_LOSS_14: 1.0524757646426146
------------------------------
LOG_LOSS_15: 1.0534895988112714
------------------------------


LOG_LOSS_CV_CB: 1.05298


##### LIGHTGBM

In [15]:
log_loss_score_lgb = []
print("-" * 30)
n_splits = 15
for fold in range(n_splits):
  x_train_, y_train_ = train[train['folds']!=fold][features_columns] , train[train['folds']!=fold][TARGET_COL] 
  x_test_, y_test_ = train[train['folds']==fold][features_columns] , train[train['folds']==fold][TARGET_COL] 
  model_lgb = LGBMClassifier(boosting_type = 'gbdt', objective = 'multiclass', metric = 'multi_logloss', n_estimators = 3000, learning_rate = 0.01, 
                              num_leaves = 45, colsample_bytree = 0.8, subsample = 0.9, subsample_freq = 1, max_depth = 6, reg_alpha = 0.5, reg_lambda = 0.5, 
                              min_split_gain = 0.05, min_child_weight = 0.05, random_state = random_seed, num_class = 6, silent = -1, verbose = -1)
  model_lgb.fit(x_train_, y_train_, eval_set = [(x_train_, y_train_), (x_test_, y_test_)], early_stopping_rounds = 300, eval_metric = 'logloss', verbose = 0)
  preds_ = model_lgb.predict_proba(x_test_)
  log_loss_ = log_loss(y_test_, preds_)
  print(f'LOG_LOSS_{fold + 1}: {log_loss_}')
  log_loss_score_lgb.append(log_loss_)
  print("-" * 30)
print('\n')
print(f"LOG_LOSS_CV_LGB: {np.mean(log_loss_score_lgb).round(5)}")

------------------------------
LOG_LOSS_1: 1.0620548180422837
------------------------------
LOG_LOSS_2: 1.09405293911991
------------------------------
LOG_LOSS_3: 1.0400865489768625
------------------------------
LOG_LOSS_4: 1.0370939084571578
------------------------------
LOG_LOSS_5: 1.0607797717165917
------------------------------
LOG_LOSS_6: 1.060595525158828
------------------------------
LOG_LOSS_7: 1.0423544809683352
------------------------------
LOG_LOSS_8: 1.094199925132832
------------------------------
LOG_LOSS_9: 1.0407448597036435
------------------------------
LOG_LOSS_10: 1.058403559178775
------------------------------
LOG_LOSS_11: 1.0510352523487851
------------------------------
LOG_LOSS_12: 1.0388229842521257
------------------------------
LOG_LOSS_13: 1.0672714542995443
------------------------------
LOG_LOSS_14: 1.055973995534415
------------------------------
LOG_LOSS_15: 1.0514247197482978
------------------------------


LOG_LOSS_CV_LGB: 1.05699


### HYPERPARAMETER TUNNING USING OPTUNA

#### CATBOOST

In [16]:
# def objective(trial):
#     d_train = Pool(X_train, y_train)
#     d_test = Pool(X_test, y_test)
#     param = {'l2_leaf_reg' : trial.suggest_float('l2_leaf_reg', 9, 10),
#                    'depth' : trial.suggest_int('depth', 7, 9),
#                    'learning_rate' : trial.suggest_float('learning_rate', 0.01, 0.02),
#                    'n_estimators' : trial.suggest_int('n_estimators ', 3000, 3500),
#                    'random_strength' : trial.suggest_float('random_strength', 0.1, 0.2)}

#     cat = CatBoostClassifier(**param, bootstrap_type = 'Bayesian', loss_function = 'MultiClass', leaf_estimation_iterations = 1, random_state = random_seed, verbose = 0, use_best_model = True,)
#     cat.fit(d_train, eval_set = [(d_test)], verbose = 0, early_stopping_rounds = 500)
#     pred = cat.predict_proba(d_test)
#     return log_loss(y_test, pred)

# study = optuna.create_study(direction = "minimize")
# study.optimize(objective, n_trials = 50)

In [17]:
# trial = study.best_trial
# print('LOG_LOSS: {}'.format(trial.value))
# print('Best Parameters: {}'.format(trial.params))

### SUBMISSION

In [18]:
test.drop('cost_category', axis = 1, inplace = True)
def predict_and_submit(test_, filename):
    d_ = {'Tour_ID' : sub['Tour_ID'], 'High Cost' : test_[:, 0], 'Higher Cost' : test_[:, 1], 'Highest Cost' : test_[:, 2], 'Low Cost' : test_[:, 3], 'Lower Cost' : test_[:, 4], 'Normal Cost' : test_[:, 5]}
    df_ = pd.DataFrame(data = d_)
    df_ = df_[['Tour_ID', 'High Cost', 'Higher Cost', 'Highest Cost', 'Low Cost', 'Lower Cost', 'Normal Cost']]
    df_.to_csv(f'{path}/{filename}.csv', index = False)
    return df_.shape

In [19]:
y_a = cb_model_.predict_proba(test)
y_b = lgb_model_.predict_proba(test)
y_c = model_cb.predict_proba(test)
y_d = model_lgb.predict_proba(test)

In [20]:
pred = (y_a * 0.5 + y_b * 0.5) * 0.5 + (y_c * 0.5 + y_d * 0.5) * 0.5
predict_and_submit(pred, 'ai_cat_1.06_')

(6169, 7)

##### MORE ENSEMBLING

In [21]:
a = pd.read_csv(f'{path}/ai_cat_1.06_.csv')
b = pd.read_csv(f'{path}/ai_cat_1.06_.csv')
c = pd.read_csv(f'{path}/ai_cat_1.06_.csv')
a = a.drop('Tour_ID', axis=1)
b = b.drop('Tour_ID', axis=1)
c = c.drop('Tour_ID', axis=1)

In [22]:
stack2 = (0.9 * a + 0.14 * b + 0.1 * c)
stack2 = stack2.round(5)
predict_and_submit(pred, 'ense210000')

(6169, 7)

Reference:
>Discussion forum: https://zindi.africa/competitions/ai4d-lab-tanzania-tourism-classification-challenge/discussions/12021
> External dateset for country to latitude and longitude mapping: <br>https://www.kaggle.com/datasets/paultimothymooney/latitude-and-longitude-for-every-country-and-state<br>
> External dateset for country to continent mapping: https://github.com/dbouquin/IS_608/blob/master/NanosatDB_munging/Countries-Continents.csv