# Data Preprocessing

Train/Dev/Test splits, outlier trimming, filling in missing values etc. should be done here. Save new DataFrame to a file whne done.

In [31]:
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [32]:
all_data = pd.read_csv("data/Train.csv", index_col="Tour_ID").sort_index()
descriptions = pd.read_csv("data/VariableDefinitions.csv")
descriptions

Unnamed: 0,Column Name,Definition
0,id,Unique identifier for each tourist
1,country,The country a tourist coming from.
2,age_group,The age group of a tourist.
3,travel_with,The relation of people a tourist travel with t...
4,total_female,Total number of females
5,total_male,Total number of males
6,purpose,The purpose of visiting Tanzania
7,main_activity,The main activity of tourism in Tanzania
8,infor_source,The source of information about tourism in Tan...
9,tour_arrangment,The arrangment of visiting Tanzania


In [33]:
all_data.shape

(18506, 20)

### Data Cleaning

We need to replace the NaN values in the data with something else - in this case, it will be the average value from that column.

### Train, Dev, Test split

In [34]:
X_train, X_test, y_train, y_test = train_test_split(all_data.iloc[:, :-1], all_data.iloc[:,-1], test_size=0.3, random_state=13)
y_test

Tour_ID
tour_idoz4s8ore     Higher Cost
tour_idmrpgckx9      Lower Cost
tour_idb9tw1xhe     Higher Cost
tour_id15p5ryu5       High Cost
tour_idbt8hghl6      Lower Cost
                       ...     
tour_id669wplzi        Low Cost
tour_idkky33aii     Higher Cost
tour_idgiqq4jp5    Highest Cost
tour_id5pfwrcvu        Low Cost
tour_idoh2wy0j4       High Cost
Name: cost_category, Length: 5552, dtype: object

Now the training set is in the train DataFrame, while the dev set is in the dev DataFrame.

### One-Hot Encoding

One-Hot Encoding is a way to represent our categorical variables as numbers, and hence make them compatible with a ML model. In this convention, each "type" in a single column is made into a column of its own, and the rows which have this type are assigned a 1 for that type's respective column and 0 otherwise.

In [37]:
train_encoded = pd.get_dummies(X_train, columns = ['country','age_group', 'travel_with', 'purpose', 'main_activity',
'info_source', 'tour_arrangement', 'package_transport_int', 'package_accomodation', 'package_food', 'package_transport_tz', 'package_sightseeing', 'package_guided_tour', 'package_insurance', 'first_trip_tz'])

test_encoded = pd.get_dummies(X_test, columns = ['country','age_group', 'travel_with', 'purpose', 'main_activity',
'info_source', 'tour_arrangement', 'package_transport_int', 'package_accomodation', 'package_food', 'package_transport_tz', 'package_sightseeing', 'package_guided_tour', 'package_insurance', 'first_trip_tz'])

#y: cost cateogry remapping
cost_map = {
    'Lower Cost': 0,
    'Low Cost': 1,
    'Normal Cost': 2,
    'High Cost': 3,
    'Higher Cost': 4,
    'Highest Cost': 5
}

train_y_encoded = y_train.map(cost_map)

test_y_encoded = y_test.map(cost_map)

In [38]:
test_y_encoded.head()

Tour_ID
tour_idoz4s8ore    4
tour_idmrpgckx9    0
tour_idb9tw1xhe    4
tour_id15p5ryu5    3
tour_idbt8hghl6    0
Name: cost_category, dtype: int64

Saving train and test sets to files so we can just reload them later:

In [39]:
train_encoded.to_csv("data/X_train.csv", index=False)
test_encoded.to_csv("data/X_test.csv", index=False)
train_y_encoded.to_csv("data/y_train.csv", index=False)
test_y_encoded.to_csv("data/y_test.csv", index=False)