## Project Information:
The Tanzanian tourism sector plays a significant role in the Tanzanian economy, contributing about 17% to the country’s GDP and 25% of all foreign exchange revenues. The sector, which provides direct employment for more than 600,000 people and up to 2 million people indirectly, generated approximately $2.4 billion in 2018 according to government statistics. Tanzania received a record 1.1 million international visitor arrivals in 2014, mostly from Europe, the US and Africa.

Tanzania is the only country in the world which has allocated more than 25% of its total area for wildlife, national parks, and protected areas.There are 16 national parks in Tanzania, 28 game reserves, 44 game-controlled areas, two marine parks and one conservation area.

Tanzania’s tourist attractions include the Serengeti plains, which hosts the largest terrestrial mammal migration in the world; the Ngorongoro Crater, the world’s largest intact volcanic caldera and home to the highest density of big game in Africa; Kilimanjaro, Africa’s highest mountain; and the Mafia Island marine park; among many others. The scenery, topography, rich culture and very friendly people provide for excellent cultural tourism, beach holidays, honeymooning, game hunting, historical and archaeological ventures – and certainly the best wildlife photography safaris in the world.

The objective of this hackathon is to develop a machine learning model to predict what a tourist will spend when visiting Tanzania.The model can be used by different tour operators and the Tanzania Tourism Board to automatically help tourists across the world estimate their expenditure before visiting Tanzania.

## Available Data
The dataset describes 6476 rows of up-to-date information on tourist expenditure collected by the National Bureau of Statistics (NBS) in Tanzania.The dataset was collected to gain a better understanding of the status of the tourism sector and provide an instrument that will enable sector growth.

Your goal is to accurately predict tourist expenditure when visiting Tanzania.

The majority of the visitors under the age group of 25-44 came for business (18.5%), or leisure and holidays (53.2%), which is consistent with the fact that they are economically more productive. Those at the age group of 45-64 were more prominent in holiday making and visiting friends and relatives. The results further reveal that most visitors belonging to the age group of 18-24 came for leisure and holidays (55.3%) as well as volunteering (13.7%). The majority of senior citizens (65 and above) came for leisure and holidays (80.9%) and visiting friends and relatives (9.5%).

The survey covers seven departure points, namely: Julius Nyerere International Airport, Kilimanjaro International Airport, Abeid Amani Karume International Airport, and the Namanga, Tunduma, Mtukula and Manyovu border points.

In [1]:
#Import the necessary packages to help in simpla data analysis
# Helps reading/loading and manupulating dataframes or comma separated files
import pandas as pd 

#Array manupulations
import numpy as np

#visualization packages
import matplotlib.pyplot as plt
import seaborn as sns

## Load the training and testing data

In [2]:
train=pd.read_csv('Train.csv')
test=pd.read_csv('Test.csv')

In [3]:
#set ID as the index for easier Identification
train=train.set_index('ID')
test=test.set_index('ID')

In [4]:
test.describe()

Unnamed: 0,total_female,total_male,night_mainland,night_zanzibar
count,1600.0,1599.0,1601.0,1601.0
mean,0.925625,1.056911,8.741412,2.495315
std,1.169807,1.309879,19.78849,6.266489
min,0.0,0.0,0.0,0.0
25%,0.0,1.0,2.0,0.0
50%,1.0,1.0,5.0,0.0
75%,1.0,1.0,10.0,4.0
max,20.0,40.0,664.0,174.0


## Data Analysis: Preprocessing
In this section we will do data analysis, where we will be on the look out for categorical data(Encoding/changing them to numerical values), dealing with missing values (Dropping or imputing). We will also perform data visualization to identify distributions in data. This will help us choose a good machine learning model.

In [5]:
#Check the shapes of our training data
train.shape

(4809, 22)

In [6]:
#Check for missing values in each column centage
print('Percentage of Missing Data for each Column')
train.isna().sum()/train.shape[0]*100

Percentage of Missing Data for each Column


country                   0.000000
age_group                 0.000000
travel_with              23.164899
total_female              0.062383
total_male                0.103972
purpose                   0.000000
main_activity             0.000000
info_source               0.000000
tour_arrangement          0.000000
package_transport_int     0.000000
package_accomodation      0.000000
package_food              0.000000
package_transport_tz      0.000000
package_sightseeing       0.000000
package_guided_tour       0.000000
package_insurance         0.000000
night_mainland            0.000000
night_zanzibar            0.000000
payment_mode              0.000000
first_trip_tz             0.000000
most_impressing           6.508630
total_cost                0.000000
dtype: float64

#### From the Output above we note that travel_with column has 1114 values missing which accounts for 23.16% percent of data missing. We drop such a column since there exist no better of imputing the missing values.Also, Most_impressing column has 6.5% missing and we will then drop it. Note that total_female and total_male are float values and could be imputed using a mean or median but we choose to drop it. Dropping the latter is based on the fact that they are just a few rows missing and most probably they is need for domain knowledge while imputing. Also, We could take it the futes back if the model becomes unsatifactory.

In [7]:
#Dropping travel with
train1=train.drop(['travel_with','most_impressing'],axis=1)

#Same applies to the test, since we will no longer have the dropped column for train.
test1=test.drop(['travel_with','most_impressing'],axis=1)


In [8]:
#we can now drop the rows with missing values (test and train)
test1.dropna(inplace=True)
train1.dropna(inplace=True)

In [9]:
#WE get the General statistics of the data. THis helps us identify min,max,mean e.t.v of the data.
#Most importantly we will get the unique counts for the catagorical data.
train1.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
country,4801.0,105.0,UNITED STATES OF AMERICA,694.0,,,,,,,
age_group,4801.0,4.0,25-44,2482.0,,,,,,,
total_female,4801.0,,,,0.923349,1.278328,0.0,0.0,1.0,1.0,49.0
total_male,4801.0,,,,1.008956,1.138403,0.0,1.0,1.0,1.0,44.0
purpose,4801.0,7.0,Leisure and Holidays,2835.0,,,,,,,
main_activity,4801.0,9.0,Wildlife tourism,2253.0,,,,,,,
info_source,4801.0,8.0,"Travel, agent, tour operator",1909.0,,,,,,,
tour_arrangement,4801.0,2.0,Independent,2568.0,,,,,,,
package_transport_int,4801.0,2.0,No,3353.0,,,,,,,
package_accomodation,4801.0,2.0,No,2600.0,,,,,,,


In [10]:
#We note that most of the features are categorical and therefore we need to convert them to numerical values,
#a process called feature encoding. We will deal with each categorical variable at a time. The encoding will
#be done both on training and testing. We use inbuilt methods to filter categorical variables
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns_train = categorical_columns_selector(train1)
categorical_columns_test = categorical_columns_selector(test1)

print('train: ',categorical_columns_train)
print('test: ',categorical_columns_test)

#confirm that we have the same columns
assert categorical_columns_train==categorical_columns_test

train:  ['country', 'age_group', 'purpose', 'main_activity', 'info_source', 'tour_arrangement', 'package_transport_int', 'package_accomodation', 'package_food', 'package_transport_tz', 'package_sightseeing', 'package_guided_tour', 'package_insurance', 'payment_mode', 'first_trip_tz']
test:  ['country', 'age_group', 'purpose', 'main_activity', 'info_source', 'tour_arrangement', 'package_transport_int', 'package_accomodation', 'package_food', 'package_transport_tz', 'package_sightseeing', 'package_guided_tour', 'package_insurance', 'payment_mode', 'first_trip_tz']


In [11]:
#Encoding the age_group: It has four ranges of ages. we will treat it as ordinal variable. We set 
#1-24 to 0,25-44 to 1, 45-64 to 2 and 65+ t 3. We note that 1-24 in training but it is set 24-jan in test.
#Similar range. For this we will applt the replace method in dataframe
train1['age_group']=train1['age_group'].replace(['45-64','25-44', '1-24', '65+'],[2,1,0,3])
test1['age_group']=test1['age_group'].replace(['45-64', '25-44', '24-Jan', '65+'],[2,1,0,3])

print('train',train1['age_group'].unique())
print('test1',test1['age_group'].unique())

train [2 1 0 3]
test1 [2 1 0 3]


In [12]:
#The country column conatins 105 unique entries. We need to first filter the countries by continent 
#to have lesser number of country values. We first change the names to the correct ones. We will use a
#python package awoc to convert countries to continents
import awoc
my_world=awoc.AWOC()

old_countries_train=['SWIZERLAND', 'UNITED KINGDOM', 'CHINA', 'SOUTH AFRICA',
       'UNITED STATES OF AMERICA', 'NIGERIA', 'INDIA', 'BRAZIL', 'CANADA',
       'MALT', 'MOZAMBIQUE', 'RWANDA', 'AUSTRIA', 'MYANMAR', 'GERMANY',
       'KENYA', 'ALGERIA', 'IRELAND', 'DENMARK', 'SPAIN', 'FRANCE',
       'ITALY', 'EGYPT', 'QATAR', 'MALAWI', 'JAPAN', 'SWEDEN',
       'NETHERLANDS', 'UAE', 'UGANDA', 'AUSTRALIA', 'YEMEN',
       'NEW ZEALAND', 'BELGIUM', 'NORWAY', 'ZIMBABWE', 'ZAMBIA', 'CONGO',
       'BURGARIA', 'PAKISTAN', 'GREECE', 'MAURITIUS', 'DRC', 'OMAN',
       'PORTUGAL', 'KOREA', 'SWAZILAND', 'TUNISIA', 'KUWAIT', 'DOMINICA',
       'ISRAEL', 'FINLAND', 'CZECH REPUBLIC', 'UKRAIN', 'ETHIOPIA',
       'BURUNDI', 'SCOTLAND', 'RUSSIA', 'GHANA', 'NIGER', 'MALAYSIA',
       'COLOMBIA', 'LUXEMBOURG', 'NEPAL', 'POLAND', 'SINGAPORE',
       'LITHUANIA', 'HUNGARY', 'INDONESIA', 'TURKEY', 'TRINIDAD TOBACCO',
       'IRAQ', 'SLOVENIA', 'UNITED ARAB EMIRATES', 'COMORO', 'SRI LANKA',
       'IRAN', 'MONTENEGRO', 'ANGOLA', 'LEBANON', 'SLOVAKIA', 'ROMANIA',
       'MEXICO', 'LATVIA', 'CROATIA', 'CAPE VERDE', 'SUDAN', 'COSTARICA',
       'CHILE', 'NAMIBIA', 'TAIWAN', 'SERBIA', 'LESOTHO', 'GEORGIA',
       'PHILIPINES', 'IVORY COAST', 'MADAGASCAR', 'DJIBOUT', 'CYPRUS',
       'ARGENTINA', 'URUGUAY', 'MORROCO', 'THAILAND', 'BERMUDA',
       'ESTONIA']
new_countries_train=['SWITZERLAND', 'UNITED KINGDOM', 'CHINA', 'SOUTH AFRICA',
       'UNITED STATES', 'NIGERIA', 'INDIA', 'BRAZIL', 'CANADA',
       'MALTA', 'MOZAMBIQUE', 'RWANDA', 'AUSTRIA', 'MYANMAR', 'GERMANY',
       'KENYA', 'ALGERIA', 'IRELAND', 'DENMARK', 'SPAIN', 'FRANCE',
       'ITALY', 'EGYPT', 'QATAR', 'MALAWI', 'JAPAN', 'SWEDEN',
       'NETHERLANDS', 'United Arab Emirates', 'UGANDA', 'AUSTRALIA', 'YEMEN',
       'NEW ZEALAND', 'BELGIUM', 'NORWAY', 'ZIMBABWE', 'ZAMBIA', 'Democratic Republic of the Congo',
       'BULGARIA', 'PAKISTAN', 'GREECE', 'MAURITIUS', 'Democratic Republic of the Congo', 'OMAN',
       'PORTUGAL', 'North KOREA', 'SWAZILAND', 'TUNISIA', 'KUWAIT', 'DOMINICA',
       'ISRAEL', 'FINLAND', 'CZECH REPUBLIC', 'UKRAINE', 'ETHIOPIA',
       'BURUNDI', 'United Kingdom', 'RUSSIA', 'GHANA', 'NIGER', 'MALAYSIA',
       'COLOMBIA', 'LUXEMBOURG', 'NEPAL', 'POLAND', 'SINGAPORE',
       'LITHUANIA', 'HUNGARY', 'INDONESIA', 'TURKEY', 'Trinidad and Tobago',
       'IRAQ', 'SLOVENIA', 'UNITED ARAB EMIRATES', 'COMOROS', 'SRI LANKA',
       'IRAN', 'MONTENEGRO', 'ANGOLA', 'LEBANON', 'SLOVAKIA', 'ROMANIA',
       'MEXICO', 'LATVIA', 'CROATIA', 'CAPE VERDE', 'SUDAN', 'COSTA RICA',
       'CHILE', 'NAMIBIA', 'TAIWAN', 'SERBIA', 'LESOTHO', 'GEORGIA',
       'PHILIPPINES', 'IVORY COAST', 'MADAGASCAR', 'DJIBOUTI', 'CYPRUS',
       'ARGENTINA', 'URUGUAY', 'MOROCCO', 'THAILAND', 'BERMUDA',
       'ESTONIA']
old_test=['AUSTRALIA', 'SOUTH AFRICA', 'GERMANY', 'CANADA', 'UNITED KINGDOM',
       'DENMARK', 'RUSSIA', 'FRANCE', 'SPAIN', 'SWIZERLAND',
       'UNITED STATES OF AMERICA', 'CHINA', 'INDIA', 'ZAMBIA',
       'NEW ZEALAND', 'COMORO', 'NETHERLANDS', 'MALAYSIA', 'KENYA',
       'ITALY', 'FINLAND', 'MALAWI', 'BELGIUM', 'NORWAY', 'MALT',
       'ETHIOPIA', 'OMAN', 'CZECH REPUBLIC', 'GHANA', 'UAE', 'PORTUGAL',
       'SINGAPORE', 'SWEDEN', 'UGANDA', 'BRAZIL', 'QATAR', 'UKRAIN',
       'ROMANIA', 'DRC', 'HUNGARY', 'RWANDA', 'AUSTRIA', 'BOTSWANA',
       'ZIMBABWE', 'IRELAND', 'JAPAN', 'IRAN', 'MOZAMBIQUE', 'SWAZILAND',
       'BULGARIA', 'ISRAEL', 'CHILE', 'SUDAN', 'BANGLADESH', 'SLOVAKIA',
       'COSTARICA', 'NAMIBIA', 'POLAND', 'DOMINICA', 'SCOTLAND', 'HAITI',
       'PAKISTAN', 'TAIWAN', 'PHILIPINES', 'VIETNAM', 'SERBIA', 'BURUNDI',
       'BOSNIA', 'LIBERIA', 'PERU', 'GREECE', 'INDONESIA', 'LEBANON',
       'CAPE VERDE', 'JAMAICA', 'UNITED ARAB EMIRATES', 'MORROCO',
       'EGYPT', 'CYPRUS', 'MACEDONIA', 'CONGO', 'GUINEA', 'ARGENTINA',
       'YEMEN', 'SOMALI', 'KOREA', 'SAUD ARABIA']
new_test=['AUSTRALIA', 'SOUTH AFRICA', 'GERMANY', 'CANADA', 'UNITED KINGDOM',
       'DENMARK', 'RUSSIA', 'FRANCE', 'SPAIN', 'SWITZERLAND',
       'UNITED STATES', 'CHINA', 'INDIA', 'ZAMBIA',
       'NEW ZEALAND', 'COMOROS', 'NETHERLANDS', 'MALAYSIA', 'KENYA',
       'ITALY', 'FINLAND', 'MALAWI', 'BELGIUM', 'NORWAY', 'MALTA',
       'ETHIOPIA', 'OMAN', 'CZECH REPUBLIC', 'GHANA', 'United Arab Emirates', 'PORTUGAL',
       'SINGAPORE', 'SWEDEN', 'UGANDA', 'BRAZIL', 'QATAR', 'UKRAINE',
       'ROMANIA', 'Democratic Republic of the Congo', 'HUNGARY', 'RWANDA', 'AUSTRIA', 'BOTSWANA',
       'ZIMBABWE', 'IRELAND', 'JAPAN', 'IRAN', 'MOZAMBIQUE', 'SWAZILAND',
       'BULGARIA', 'ISRAEL', 'CHILE', 'SUDAN', 'BANGLADESH', 'SLOVAKIA',
       'COSTA RICA', 'NAMIBIA', 'POLAND', 'DOMINICA', 'United Kingdom', 'HAITI',
       'PAKISTAN', 'TAIWAN', 'PHILIPPINES', 'VIETNAM', 'SERBIA', 'BURUNDI',
       'Bosnia and Herzegovina', 'LIBERIA', 'PERU', 'GREECE', 'INDONESIA', 'LEBANON',
       'CAPE VERDE', 'JAMAICA', 'UNITED ARAB EMIRATES', 'MOROCcO',
       'EGYPT', 'CYPRUS', 'MACEDONIA', 'Democratic Republic of the Congo', 'GUINEA', 'ARGENTINA',
       'YEMEN', 'SOMALIA', 'North KOREA', 'SAUDi ARABIA']
train_continents=[]
for country in new_countries_train:
    data = my_world.get_country_data(country)
    train_continents.append(data.get('Continent Name'))
    
test_continents=[]
for country in new_test:
    data = my_world.get_country_data(country)
    test_continents.append(data.get('Continent Name'))

#We can now replace the countries with continents both for test and train
train1['country']=train1['country'].replace(old_countries_train,train_continents)
test1['country']=test1['country'].replace(old_test,test_continents)

In [13]:
#Now that we have encoded have transformed the countries to continents, 
#we can check the other categorical variables to ensure that we don't have mispelt names or unkown values.
for a in categorical_columns_train:
    print(a)
    print("train",train1[a].unique())
    print("test",test1[a].unique())
    print(sorted (train1[a].unique())==sorted (test1[a].unique()))
    print()

country
train ['Europe' 'Asia' 'Africa' 'North America' 'South America' 'Oceania']
test ['Oceania' 'Africa' 'Europe' 'North America' 'Asia' 'South America']
True

age_group
train [2 1 0 3]
test [2 1 0 3]
True

purpose
train ['Leisure and Holidays' 'Visiting Friends and Relatives' 'Business'
 'Meetings and Conference' 'Volunteering' 'Scientific and Academic'
 'Other']
test ['Leisure and Holidays' 'Business' 'Volunteering'
 'Meetings and Conference' 'Visiting Friends and Relatives'
 'Scientific and Academic' 'Other']
True

main_activity
train ['Wildlife tourism' 'Cultural tourism' 'Mountain climbing' 'Beach tourism'
 'Conference tourism' 'Hunting tourism' 'Bird watching' 'business'
 'Diving and Sport Fishing']
test ['Wildlife tourism' 'Beach tourism' 'Cultural tourism' 'Mountain climbing'
 'business' 'Hunting tourism' 'Conference tourism' 'Bird watching'
 'Diving and Sport Fishing']
True

info_source
train ['Friends, relatives' 'others' 'Travel, agent, tour operator'
 'Radio, TV, Web' 'T

In [14]:
#from the output above we note that both the train and test have the same  values for categorical variables and 
#therefore we can apply one hot encoding to the whole dataset using the one hot encoder from sklearn
from sklearn.preprocessing import OneHotEncoder as OHE 

#get the categorical variables in both testing and training
categorical_columns_selector = selector(dtype_include=object)
categorical_columns_train = train1[categorical_columns_selector(train1)]
categorical_columns_test = test1[categorical_columns_selector(test1)]

encoder=OHE(sparse=False,handle_unknown='ignore')
encoder.fit(categorical_columns_train)
tr=encoder.transform(categorical_columns_train)
te=encoder.fit_transform(categorical_columns_test)

train_enc=pd.DataFrame(tr)
train_enc.columns=encoder.get_feature_names()
train_enc.index=train1.index

test_enc=pd.DataFrame(te)
test_enc.columns=encoder.get_feature_names()

In [15]:
train_enc

Unnamed: 0_level_0,x0_Africa,x0_Asia,x0_Europe,x0_North America,x0_Oceania,x0_South America,x1_Business,x1_Leisure and Holidays,x1_Meetings and Conference,x1_Other,...,x10_No,x10_Yes,x11_No,x11_Yes,x12_Cash,x12_Credit Card,x12_Other,x12_Travellers Cheque,x13_No,x13_Yes
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
tour_0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
tour_10,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
tour_1000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
tour_1002,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
tour_1004,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
tour_993,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
tour_994,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
tour_995,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
tour_997,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [16]:
train1.shape

(4801, 20)

In [17]:
train_enc.shape

(4801, 52)

In [18]:
#WE can now concatenate the encoded to the original data and drop all categorical features
# concatenating df3 and df4 along columns
train2 = pd.concat([train1, train_enc],axis=1)
train3=train2.drop(categorical_columns_train,axis=1)

# #test
test2 = pd.concat([test1,  test_enc.reindex(test1.index)], axis=0)
test3=test2.drop(categorical_columns_test,axis=1)

In [19]:
train2.head()

Unnamed: 0_level_0,country,age_group,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,package_transport_int,package_accomodation,...,x10_No,x10_Yes,x11_No,x11_Yes,x12_Cash,x12_Credit Card,x12_Other,x12_Travellers Cheque,x13_No,x13_Yes
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
tour_0,Europe,2,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Friends, relatives",Independent,No,No,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
tour_10,Europe,1,1.0,0.0,Leisure and Holidays,Cultural tourism,others,Independent,No,No,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
tour_1000,Europe,1,0.0,1.0,Visiting Friends and Relatives,Cultural tourism,"Friends, relatives",Independent,No,No,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
tour_1002,Europe,1,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,No,Yes,...,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
tour_1004,Asia,0,1.0,0.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Independent,No,No,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


In [20]:
train2.shape

(4801, 72)

## Modeling
 Create a baseline model. We will choose a basic Linear Regression model as our baseline model as it is easy to implement.

In [21]:
#From here we need to split on training data for training and testing.validation using the train test split method in sklearn
from sklearn.model_selection import train_test_split
X=train3.drop('total_cost',axis=1)
y=train3['total_cost']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.35)

In [22]:
#We apply Random Forest as model of choice
#We could apply many different algorithms but Random Forest which is an ensembe of decision trees, is sufficient
#for this task
from sklearn.ensemble import RandomForestRegressor
d_trees=RandomForestRegressor()
d_trees.fit(X_train,y_train)
print("Train score: ",d_trees.score(X_train,y_train))
print("Test score:", d_trees.score(X_test,y_test))

Train score:  0.8961444821306915
Test score: 0.3576085025027679


In [24]:
predictions = d_trees.predict(X_test)
predictions

array([30617059.42      ,  1267540.803     ,  9311457.022     , ...,
        5019577.8464    ,   510300.27175902,   715504.296     ])

In [25]:
#From the outputs above we note that our model is over fitting i.e the train score is much hihger than
#the test score. Two possible approaches can be used to mitigate this problem 1) add more data points for training
#2) Do hyperparameter optimization. Here we will do a hyperparameter optimizations using randomized grid search.


In [26]:
#We now perform hyperparam
from sklearn.model_selection import RandomizedSearchCV
import pprint
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)
{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}


{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

In [27]:
# Use the random grid to search for best hyperparameters

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = d_trees, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)# Fit the random search model
rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


RandomizedSearchCV(cv=3, estimator=RandomForestRegressor(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   random_state=42, verbose=2)

In [28]:
#we check for the best parameters for the model 
rf_random.best_params_

{'n_estimators': 400,
 'min_samples_split': 10,
 'min_samples_leaf': 4,
 'max_features': 'auto',
 'max_depth': 70,
 'bootstrap': True}

In [29]:
#apply the hyperparameters back to out original model and check for improvemnts
d_trees=RandomForestRegressor(n_estimators=800,min_samples_split=10,
 min_samples_leaf=2,
 max_features='sqrt',
 max_depth=20,
 bootstrap=False,verbose=0)
d_trees.fit(X_train,y_train)
print("Train score: ",d_trees.score(X_train,y_train))
print("Test score:", d_trees.score(X_test,y_test))

Train score:  0.6770495639891421
Test score: 0.3988870991626432


In [30]:
#we note that the test score was improved but the train score performed a bit poorer. 
#More can be done by performing feature engineering and feature engineerin from domain knowledge.

In [31]:
predictions = d_trees.predict(X_test)

In [32]:
predictions

array([25522813.23604026,  1019354.040043  , 11334016.42335319, ...,
        4320001.52173004,  1166098.59814054,  1010072.84527681])