# Data Preprocessing

Train/Dev/Test splits, outlier trimming, filling in missing values etc. should be done here. Save new DataFrame to a file whne done.

In [16]:
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [17]:
all_data = pd.read_csv("data/Train.csv", index_col="Tour_ID").sort_index()
descriptions = pd.read_csv("data/VariableDefinitions.csv")
descriptions

Unnamed: 0,Column Name,Definition
0,id,Unique identifier for each tourist
1,country,The country a tourist coming from.
2,age_group,The age group of a tourist.
3,travel_with,The relation of people a tourist travel with t...
4,total_female,Total number of females
5,total_male,Total number of males
6,purpose,The purpose of visiting Tanzania
7,main_activity,The main activity of tourism in Tanzania
8,infor_source,The source of information about tourism in Tan...
9,tour_arrangment,The arrangment of visiting Tanzania


In [19]:
all_data.shape

(18506, 20)

### Data Cleaning

We need to replace the NaN values in the data with something else - in this case, it will be the average value from that column.

### One-Hot Encoding

One-Hot Encoding is a way to represent our categorical variables as numbers, and hence make them compatible with a ML model. In this convention, each "type" in a single column is made into a column of its own, and the rows which have this type are assigned a 1 for that type's respective column and 0 otherwise.

In [20]:
X = all_data.iloc[:, :-1]
y = all_data.iloc[:,-1]

y.head()

Tour_ID
tour_id000yfpco     Lower Cost
tour_id000zcjd9     Lower Cost
tour_id003q62x6      High Cost
tour_id0071kq5v      High Cost
tour_id00bp42je    Higher Cost
Name: cost_category, dtype: object

In [21]:
X_encoded = pd.get_dummies(X, columns = ['country','age_group', 'travel_with', 'purpose', 'main_activity',
'info_source', 'tour_arrangement', 'package_transport_int', 'package_accomodation', 'package_food', 'package_transport_tz', 'package_sightseeing', 'package_guided_tour', 'package_insurance', 'first_trip_tz'])

#y: cost cateogry remapping
cost_map = {
    'Lower Cost': 0,
    'Low Cost': 1,
    'Normal Cost': 2,
    'High Cost': 3,
    'Higher Cost': 4,
    'Highest Cost': 5
}

y_encoded = y.map(cost_map)

### Train, Dev, Test split

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, test_size=0.3, random_state=12)
y_test

Tour_ID
tour_idsbyhxvdd    4
tour_idow4l3eyx    0
tour_idymdnwrph    2
tour_idoo7t7yy4    3
tour_id4xae89ya    2
                  ..
tour_id60ifd9h8    3
tour_idomwrwgkz    1
tour_idy63entdc    1
tour_idbrvvt0ty    3
tour_id58lyplbs    4
Name: cost_category, Length: 5552, dtype: int64

Now the training set is in the train DataFrame, while the dev set is in the dev DataFrame.

Saving train and test sets to files so we can just reload them later:

In [35]:
X_train.to_csv("data/X_train.csv", index=False)
X_test.to_csv("data/X_test.csv", index=False)
y_train.to_csv("data/y_train.csv", index=False)
y_test.to_csv("data/y_test.csv", index=False)