# Zindi's Tanzania Tourism Prediction

The question is: Can you use tourism survey data and ML to predict how much money a tourist will spend when visiting Tanzania?

I will use regression to answer this question

In [1]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder

This notebook will be divided into the following sections:
1. Loading and cleaning the data
2. EDA
3. Feature Engineering
4. Model Building
5. Model Evaluation and Visualization

## 1. Load and Clean the Data

We have both the train and test datasets and so we load them both

In [2]:
train = pd.read_csv('../Data/Tz-Tourist-Prediction/Train.csv')
test = pd.read_csv('../Data/Tz-Tourist-Prediction/Test.csv')
VariableDefinitions = pd.read_csv('../Data/Tz-Tourist-Prediction/VariableDefinitions.csv')
SampleSubmission = pd.read_csv('../Data/Tz-Tourist-Prediction/SampleSubmission.csv')

Now that we have loaded the data frames, I will see what lies in each of them

In [3]:
train.dtypes

ID                        object
country                   object
age_group                 object
travel_with               object
total_female             float64
total_male               float64
purpose                   object
main_activity             object
info_source               object
tour_arrangement          object
package_transport_int     object
package_accomodation      object
package_food              object
package_transport_tz      object
package_sightseeing       object
package_guided_tour       object
package_insurance         object
night_mainland           float64
night_zanzibar           float64
payment_mode              object
first_trip_tz             object
most_impressing           object
total_cost               float64
dtype: object

In [4]:
train.shape

(4809, 23)

In [5]:
SampleSubmission.head()

Unnamed: 0,ID,total_cost
0,tour_1,0
1,tour_100,0
2,tour_1001,0
3,tour_1006,0
4,tour_1009,0


What we are tring to predict is the total_cost column. It is our y (target). Now we must check each of the features in the train df to determine which ones are relevant in making the prediction.
Among the columns in train, we only return the ID and the total_cost. 

In [6]:
# check what is in the train dataframe
train.head()

Unnamed: 0,ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,...,package_transport_tz,package_sightseeing,package_guided_tour,package_insurance,night_mainland,night_zanzibar,payment_mode,first_trip_tz,most_impressing,total_cost
0,tour_0,SWIZERLAND,45-64,Friends/Relatives,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Friends, relatives",Independent,...,No,No,No,No,13.0,0.0,Cash,No,Friendly People,674602.5
1,tour_10,UNITED KINGDOM,25-44,,1.0,0.0,Leisure and Holidays,Cultural tourism,others,Independent,...,No,No,No,No,14.0,7.0,Cash,Yes,"Wonderful Country, Landscape, Nature",3214906.5
2,tour_1000,UNITED KINGDOM,25-44,Alone,0.0,1.0,Visiting Friends and Relatives,Cultural tourism,"Friends, relatives",Independent,...,No,No,No,No,1.0,31.0,Cash,No,Excellent Experience,3315000.0
3,tour_1002,UNITED KINGDOM,25-44,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,...,Yes,Yes,Yes,No,11.0,0.0,Cash,Yes,Friendly People,7790250.0
4,tour_1004,CHINA,1-24,,1.0,0.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Independent,...,No,No,No,No,7.0,4.0,Cash,Yes,No comments,1657500.0


In [7]:
train.nunique()

ID                       4809
country                   105
age_group                   4
travel_with                 5
total_female               14
total_male                 14
purpose                     7
main_activity               9
info_source                 8
tour_arrangement            2
package_transport_int       2
package_accomodation        2
package_food                2
package_transport_tz        2
package_sightseeing         2
package_guided_tour         2
package_insurance           2
night_mainland             64
night_zanzibar             34
payment_mode                4
first_trip_tz               2
most_impressing             7
total_cost               1637
dtype: int64

In [8]:
# Check for empty values
train.isnull().isnull().sum()

ID                       0
country                  0
age_group                0
travel_with              0
total_female             0
total_male               0
purpose                  0
main_activity            0
info_source              0
tour_arrangement         0
package_transport_int    0
package_accomodation     0
package_food             0
package_transport_tz     0
package_sightseeing      0
package_guided_tour      0
package_insurance        0
night_mainland           0
night_zanzibar           0
payment_mode             0
first_trip_tz            0
most_impressing          0
total_cost               0
dtype: int64

We cannot proceed with values that are empty i.e., travel_with, total males and females and most impressing. I'll have different strategies for them

In [9]:
# forward fill
# train = train.fillna(method='ffill', axis=0)

In [10]:
# train.isnull().sum()

In [11]:
VariableDefinitions

Unnamed: 0,Column Name,Definition
0,id,Unique identifier for each tourist
1,country,The country a tourist coming from.
2,age_group,The age group of a tourist.
3,travel_with,The relation of people a tourist travel with t...
4,total_female,Total number of females
5,total_male,Total number of males
6,purpose,The purpose of visiting Tanzania
7,main_activity,The main activity of tourism in Tanzania
8,infor_source,The source of information about tourism in Tan...
9,tour_arrangment,The arrangment of visiting Tanzania


In [21]:
# Time to one hot encode the above
# We first need to convert them into categorical labels

def preprocessing(mydata):
    """Preprocessing done here"""
    # Fill empty rows
    mydata = mydata.fillna(method='ffill', axis=0)
    
    """Encoding formula"""
    encoded_data = pd.get_dummies(data=mydata, columns = [#'country',
                                                     'age_group',
                                                    'travel_with',
                                                    'purpose',
                                                    'main_activity',
                                                    'info_source',
                                                    'tour_arrangement',
                                                    'package_transport_int',
                                                    'package_accomodation',
                                                    'package_food',
                                                    'package_transport_tz',
                                                    'package_sightseeing',
                                                    'package_guided_tour',
                                                    'package_insurance',
                                                    'night_mainland',
                                                    'night_zanzibar',
                                                    'payment_mode',
                                                    'first_trip_tz',
                                                    'most_impressing'])
    
    return encoded_data

In [22]:
encoded_data = preprocessing(train)

The encoded data is now our training set


In [23]:
# Check for numm values
encoded_data.isnull().sum()

ID                                                      0
country                                                 0
total_female                                            0
total_male                                              0
total_cost                                              0
                                                       ..
most_impressing_Friendly People                         0
most_impressing_Good service                            0
most_impressing_No comments                             0
most_impressing_Satisfies and Hope Come Back            0
most_impressing_Wonderful Country, Landscape, Nature    0
Length: 165, dtype: int64

I'll skip to model building now

## 4. Model Building

In [31]:
# Our algorithm
lr = LinearRegression()

X = encoded_data.iloc[:, 2:21]
y = train['total_cost']

# Algorithm training
lr.fit(X, y)

In [43]:
# Prediction time -- Something is wrong here. Pre-processing the test dataset

X_test = preprocessing(test).iloc[:,2:21]

In [44]:
y_pred = lr.predict(X_test)

Feature names unseen at fit time:
- age_group_24-Jan
- main_activity_Beach tourism
Feature names seen at fit time, yet now missing:
- age_group_1-24
- total_cost



In [38]:
y_pred

array([ 2.96169740e-09, -5.39680325e-09,  7.86630060e-09, ...,
        2.18256830e-09,  1.00000000e+00,  2.94653538e-09])

In [39]:
submission = pd.DataFrame({"ID": test["ID"],
                           "total_cost": y_pred})

In [40]:
submission.sample(5)

Unnamed: 0,ID,total_cost
834,tour_3989,1.588848e-09
70,tour_1240,2.114601e-09
307,tour_2056,2.964686e-09
661,tour_334,5.77791e-09
1339,tour_5856,1.0


In [42]:
submission.to_csv('first_submission.csv', index = False)