# Zindi's Tanzania Tourism Prediction

The question is: Can you use tourism survey data and ML to predict how much money a tourist will spend when visiting Tanzania?

I will use regression to answer this question

In [251]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder

This notebook will be divided into the following sections:
1. Loading and cleaning the data
2. EDA
3. Feature Engineering
4. Model Building
5. Model Evaluation and Visualization

## 1. Load and Clean the Data

We have both the train and test datasets and so we load them both

In [252]:
train = pd.read_csv('../Data/Tz-Tourist-Prediction/Train.csv')
test = pd.read_csv('../Data/Tz-Tourist-Prediction/Test.csv')
VariableDefinitions = pd.read_csv('../Data/Tz-Tourist-Prediction/VariableDefinitions.csv')
SampleSubmission = pd.read_csv('../Data/Tz-Tourist-Prediction/SampleSubmission.csv')

Now that we have loaded the data frames, I will see what lies in each of them

In [253]:
train.dtypes

ID                        object
country                   object
age_group                 object
travel_with               object
total_female             float64
total_male               float64
purpose                   object
main_activity             object
info_source               object
tour_arrangement          object
package_transport_int     object
package_accomodation      object
package_food              object
package_transport_tz      object
package_sightseeing       object
package_guided_tour       object
package_insurance         object
night_mainland           float64
night_zanzibar           float64
payment_mode              object
first_trip_tz             object
most_impressing           object
total_cost               float64
dtype: object

In [254]:
test.dtypes

ID                        object
country                   object
age_group                 object
travel_with               object
total_female             float64
total_male               float64
purpose                   object
main_activity             object
info_source               object
tour_arrangement          object
package_transport_int     object
package_accomodation      object
package_food              object
package_transport_tz      object
package_sightseeing       object
package_guided_tour       object
package_insurance         object
night_mainland             int64
night_zanzibar             int64
payment_mode              object
first_trip_tz             object
most_impressing           object
dtype: object

In [255]:
train.shape

(4809, 23)

In [256]:
SampleSubmission.head()

Unnamed: 0,ID,total_cost
0,tour_1,0
1,tour_100,0
2,tour_1001,0
3,tour_1006,0
4,tour_1009,0


What we are tring to predict is the total_cost column. It is our y (target). Now we must check each of the features in the train df to determine which ones are relevant in making the prediction.
Among the columns in train, we only return the ID and the total_cost. 

In [257]:
# check what is in the train dataframe
train.head()

Unnamed: 0,ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,...,package_transport_tz,package_sightseeing,package_guided_tour,package_insurance,night_mainland,night_zanzibar,payment_mode,first_trip_tz,most_impressing,total_cost
0,tour_0,SWIZERLAND,45-64,Friends/Relatives,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Friends, relatives",Independent,...,No,No,No,No,13.0,0.0,Cash,No,Friendly People,674602.5
1,tour_10,UNITED KINGDOM,25-44,,1.0,0.0,Leisure and Holidays,Cultural tourism,others,Independent,...,No,No,No,No,14.0,7.0,Cash,Yes,"Wonderful Country, Landscape, Nature",3214906.5
2,tour_1000,UNITED KINGDOM,25-44,Alone,0.0,1.0,Visiting Friends and Relatives,Cultural tourism,"Friends, relatives",Independent,...,No,No,No,No,1.0,31.0,Cash,No,Excellent Experience,3315000.0
3,tour_1002,UNITED KINGDOM,25-44,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,...,Yes,Yes,Yes,No,11.0,0.0,Cash,Yes,Friendly People,7790250.0
4,tour_1004,CHINA,1-24,,1.0,0.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Independent,...,No,No,No,No,7.0,4.0,Cash,Yes,No comments,1657500.0


In [258]:
train.nunique()

ID                       4809
country                   105
age_group                   4
travel_with                 5
total_female               14
total_male                 14
purpose                     7
main_activity               9
info_source                 8
tour_arrangement            2
package_transport_int       2
package_accomodation        2
package_food                2
package_transport_tz        2
package_sightseeing         2
package_guided_tour         2
package_insurance           2
night_mainland             64
night_zanzibar             34
payment_mode                4
first_trip_tz               2
most_impressing             7
total_cost               1637
dtype: int64

In [259]:
# Check for empty values
train.isnull().isnull().sum()

ID                       0
country                  0
age_group                0
travel_with              0
total_female             0
total_male               0
purpose                  0
main_activity            0
info_source              0
tour_arrangement         0
package_transport_int    0
package_accomodation     0
package_food             0
package_transport_tz     0
package_sightseeing      0
package_guided_tour      0
package_insurance        0
night_mainland           0
night_zanzibar           0
payment_mode             0
first_trip_tz            0
most_impressing          0
total_cost               0
dtype: int64

We cannot proceed with values that are empty i.e., travel_with, total males and females and most impressing. I'll have different strategies for them

In [260]:
# forward fill
# train = train.fillna(method='ffill', axis=0)

In [261]:
# train.isnull().sum()

In [262]:
VariableDefinitions

Unnamed: 0,Column Name,Definition
0,id,Unique identifier for each tourist
1,country,The country a tourist coming from.
2,age_group,The age group of a tourist.
3,travel_with,The relation of people a tourist travel with t...
4,total_female,Total number of females
5,total_male,Total number of males
6,purpose,The purpose of visiting Tanzania
7,main_activity,The main activity of tourism in Tanzania
8,infor_source,The source of information about tourism in Tan...
9,tour_arrangment,The arrangment of visiting Tanzania


In [284]:
new_data=train.loc[:, #'tour_arrangement',
                      ('package_transport_int',
                      'package_accomodation',
                      'package_food',
                      'package_transport_tz',
                      'package_sightseeing',
                      'package_guided_tour',
                      'package_insurance')
      ]
old_df = train.loc[:,('total_female', 'total_male')]

new_data.replace(['Yes', 'No'],[1, 0], inplace=True)
semi_encoded_data=pd.concat([old_df, new_data], axis=1, join='inner')

older_df = train.loc[:,('night_mainland', 'night_zanzibar')]

encoded_data=pd.concat([new_data, older_df], axis=1, join='inner')

encoded_data.dtypes

Unnamed: 0,total_female,total_male,package_transport_int,package_accomodation,package_food,package_transport_tz,package_sightseeing,package_guided_tour,package_insurance,night_mainland,night_zanzibar
0,1.0,1.0,0,0,0,0,0,0,0,13.0,0.0
1,1.0,0.0,0,0,0,0,0,0,0,14.0,7.0
2,0.0,1.0,0,0,0,0,0,0,0,1.0,31.0
3,1.0,1.0,0,1,1,1,1,1,0,11.0,0.0
4,1.0,0.0,0,0,0,0,0,0,0,7.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...
4804,0.0,1.0,0,0,0,0,0,0,0,2.0,0.0
4805,1.0,1.0,1,1,1,1,1,1,1,11.0,0.0
4806,1.0,0.0,0,0,0,0,0,0,0,3.0,7.0
4807,1.0,1.0,1,1,1,0,0,0,0,5.0,0.0


In [283]:
# # Time to one hot encode the above
# # We first need to convert them into categorical labels

# def preprocessing(mydata):
#     """Preprocessing done here"""
#     # Fill empty rows
#     mydata = mydata.fillna(method='ffill', axis=0)
    
#     """Encoding formula"""
#     new_data = pd.get_dummies(data=mydata, columns = [#'country',
#                                                      'age_group',
#                                                     # 'travel_with',
#                                                     # 'purpose',
#                                                     # 'main_activity',
#                                                     # 'info_source',
#                                                     'tour_arrangement',
#                                                     'package_transport_int',
#                                                     'package_accomodation',
#                                                     'package_food',
#                                                     'package_transport_tz',
#                                                     'package_sightseeing',
#                                                     'package_guided_tour',
#                                                     'package_insurance',
#                                                     # 'night_mainland',
#                                                     # 'night_zanzibar',
#                                                     # 'payment_mode',
#                                                     #'most_impressing'
#                                                     ])
    
#     old_df = mydata[['night_mainland', 'night_zanzibar']]
    
#     encoded_data = pd.concat([new_data, old_df], axis=1, join='inner')
#     return encoded_data

In [276]:
# encoded_data = preprocessing(train)
# encoded_data

The encoded data is now our training set


In [291]:
# Check for null values
encoded_data.fillna(0)

Unnamed: 0,total_female,total_male,package_transport_int,package_accomodation,package_food,package_transport_tz,package_sightseeing,package_guided_tour,package_insurance,night_mainland,night_zanzibar
0,1.0,1.0,0,0,0,0,0,0,0,13.0,0.0
1,1.0,0.0,0,0,0,0,0,0,0,14.0,7.0
2,0.0,1.0,0,0,0,0,0,0,0,1.0,31.0
3,1.0,1.0,0,1,1,1,1,1,0,11.0,0.0
4,1.0,0.0,0,0,0,0,0,0,0,7.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...
4804,0.0,1.0,0,0,0,0,0,0,0,2.0,0.0
4805,1.0,1.0,1,1,1,1,1,1,1,11.0,0.0
4806,1.0,0.0,0,0,0,0,0,0,0,3.0,7.0
4807,1.0,1.0,1,1,1,0,0,0,0,5.0,0.0


I'll skip to model building now

## 4. Model Building

In [292]:
encoded_data.nunique()

total_female             14
total_male               14
package_transport_int     2
package_accomodation      2
package_food              2
package_transport_tz      2
package_sightseeing       2
package_guided_tour       2
package_insurance         2
night_mainland           64
night_zanzibar           34
dtype: int64

In [293]:
encoded_data.isnull().sum()

total_female             3
total_male               5
package_transport_int    0
package_accomodation     0
package_food             0
package_transport_tz     0
package_sightseeing      0
package_guided_tour      0
package_insurance        0
night_mainland           0
night_zanzibar           0
dtype: int64

In [294]:
# Our algorithm
lr = LinearRegression()

X = encoded_data
y = train['total_cost']

# Algorithm training
lr.fit(X, y)



ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [295]:
test.nunique()

ID                       1601
country                    87
age_group                   4
travel_with                 5
total_female               13
total_male                 10
purpose                     7
main_activity               9
info_source                 8
tour_arrangement            2
package_transport_int       2
package_accomodation        2
package_food                2
package_transport_tz        2
package_sightseeing         2
package_guided_tour         2
package_insurance           2
night_mainland             51
night_zanzibar             25
payment_mode                4
first_trip_tz               2
most_impressing             7
dtype: int64

In [215]:
encoded_test_data = preprocessing(test)
encoded_test_data.iloc[:,11:41].dtypes

most_impressing                  object
age_group_24-Jan                  uint8
age_group_25-44                   uint8
age_group_45-64                   uint8
age_group_65+                     uint8
tour_arrangement_Independent      uint8
tour_arrangement_Package Tour     uint8
package_transport_int_No          uint8
package_transport_int_Yes         uint8
package_accomodation_No           uint8
package_accomodation_Yes          uint8
package_food_No                   uint8
package_food_Yes                  uint8
package_transport_tz_No           uint8
package_transport_tz_Yes          uint8
package_sightseeing_No            uint8
package_sightseeing_Yes           uint8
package_guided_tour_No            uint8
package_guided_tour_Yes           uint8
package_insurance_No              uint8
package_insurance_Yes             uint8
first_trip_tz_No                  uint8
first_trip_tz_Yes                 uint8
night_mainland                    int64
night_zanzibar                    int64


In [223]:
# Prediction time -- Something is wrong here. Pre-processing the test dataset

X_test = encoded_test_data.iloc[:,11:41]

In [224]:
y_pred = lr.predict(X_test)

Feature names unseen at fit time:
- age_group_24-Jan
- most_impressing
Feature names seen at fit time, yet now missing:
- age_group_1-24
- total_cost



ValueError: could not convert string to float: ' Wildlife'

In [219]:
y_pred

array([6.43359522e+19, 9.87797188e+19, 5.19147582e+19, ...,
       4.68884957e+19, 2.48717017e+19, 6.93119688e+19])

In [220]:
submission = pd.DataFrame({"ID": test["ID"],
                           "total_cost": y_pred})

In [221]:
submission.sample(5)

Unnamed: 0,ID,total_cost
1225,tour_5479,-8.998726e+17
280,tour_1983,2.375596e+19
1267,tour_5610,1.699053e+19
523,tour_2799,6.68508e+19
1186,tour_5342,3.695882e+19


In [24]:
submission.to_csv('first_submission.csv', index = False)

## 5. Model Evaluation