## Part 2: Exploratory Data Analysis

This part of the project is where we will try to understanda the data and extract predictors we will use for our model. Before we dive in lets brainstorm some analytical questions and theories we need insight from our data. Some will come along the way.

#### Theories
1. Most tourists come from europe.
2. High spending tourist come from europe.
3. Visiting Zanzibar significantly cost more than mainland.
4. Most tourists from Western countries to Africa are most interested in wildlife tourism.
5. From 45yrs and above its mostly women who visit Africa.


#### Questions

1. Which country has the highest number of tourists?
2. Which age group has the highest number of tourists?
3. Between male and female which gender travels the most to Africa?


In [52]:
import pandas as pd
import numpy as np
import plotly.express as px
from plotly.offline import plot, iplot, init_notebook_mode
import plotly.graph_objs as go
init_notebook_mode(connected=True)
pd.set_option('display.max_columns', None)

In [53]:
df = pd.read_csv(r'clean_data.csv',index_col=[0])

In [54]:
df.head()

Unnamed: 0,ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,package_transport_int,package_accomodation,package_food,package_transport_tz,package_sightseeing,package_guided_tour,package_insurance,night_mainland,night_zanzibar,payment_mode,first_trip_tz,most_impressing,total_cost,Total_travelers
0,tour_0,SWIZERLAND,45-64,Friends/Relatives,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Friends, relatives",Independent,No,No,No,No,No,No,No,13.0,0.0,Cash,No,Friendly People,674602.5,2.0
1,tour_10,UNITED KINGDOM,25-44,Alone,1.0,0.0,Leisure and Holidays,Cultural tourism,others,Independent,No,No,No,No,No,No,No,14.0,7.0,Cash,Yes,"Wonderful Country, Landscape, Nature",3214906.5,1.0
2,tour_1000,UNITED KINGDOM,25-44,Alone,0.0,1.0,Visiting Friends and Relatives,Cultural tourism,"Friends, relatives",Independent,No,No,No,No,No,No,No,1.0,31.0,Cash,No,Excellent Experience,3315000.0,1.0
3,tour_1002,UNITED KINGDOM,25-44,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,No,Yes,Yes,Yes,Yes,Yes,No,11.0,0.0,Cash,Yes,Friendly People,7790250.0,2.0
4,tour_1004,CHINA,1-24,Alone,1.0,0.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Independent,No,No,No,No,No,No,No,7.0,4.0,Cash,Yes,No comments,1657500.0,1.0


1. Which top 10 countries visit Africa?

In [55]:
px.histogram(df[['country','Total_travelers']].groupby('country').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).head(10).reset_index(),
x='country', y = 'Total_travelers',text_auto=True,title='Top 10 countries that visit Tanzania',
labels=dict(Total_travelers='Total Tourists'))

As shown in the visualization above, yes most tourists come from Europe. 5/10 in the top 10 countries are from Europe.

2. Which top 10 countries spend most Africa?

In [56]:
px.histogram(df[['country','total_cost']].groupby('country').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='country', y = 'total_cost',text_auto=True,title='Top 10 countries that spend most in Tanzania',
labels=dict(country="Country",total_cost='Number of Total Spent'))

Yes most top spending countries are from Europe, 6/10. However its interesting to note that Zimbabweans visit Europe but not on top 10 spenders. Lets increase the number of countries to 20 to see if it appears on the list.

In [57]:
px.histogram(df[['country','total_cost']].groupby('country').sum().sort_values('total_cost',ascending=False).head(20).reset_index(),
x='country', y = 'total_cost',text_auto=True,title='Top 20 countries that spend most in Tanzania',
labels=dict(country="Country",total_cost='Number of Total Spent'))

Interesting, Zimbabwe makes it on the top 10 visitors but on 17th positions on spending. Now I am keen to know the reason why Zimbabweans Tanzania.

In [58]:
country_group=df.groupby('country')


In [59]:
px.histogram(country_group.get_group('ZIMBABWE'),x='purpose',title='Why Zimbabweans visit Tanzania',text_auto=True)

In [60]:
zimbabwe_group=country_group.get_group('ZIMBABWE')
zimbabwe_group['total_days']= df.night_mainland + df.night_zanzibar



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [61]:
px.box(zimbabwe_group,x='total_days')

The two graphs explains why Zimbabweans visit Tanzania and spend less. 
1. First graph shows that most of the tourists are on business.
2. Second graph shows that 75% of them spend 3 nights

As a Zimbabwean I know this is true because most people who travel to Tanzania are shop owners who intend to buy clothing stock.

3. Which age group visits Tanzania the most?

In [62]:
px.histogram(df[['age_group','Total_travelers']].groupby('age_group').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).reset_index(),
x='age_group', y = 'Total_travelers',text_auto=True,title='Number of tourists by age group',
labels=dict(Total_travelers='Total Tourists'))

In [63]:
px.histogram(df[['age_group','total_cost']].groupby('age_group').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='age_group', y = 'total_cost',text_auto=True,title='Total spent by age group',
labels=dict(age_group="Age Group",total_cost='Number of Total Spent'))

In [64]:
px.histogram(df[['age_group','Total_travelers','travel_with']].groupby(['age_group','travel_with']).agg({"Total_travelers": sum}).sort_values('Total_travelers',ascending=False).reset_index(),
x='age_group', y = 'Total_travelers',color='travel_with',text_auto=True,title='Number of tourists by age group',
labels=dict(Total_travelers='Total Tourists'))

In [65]:
df.columns

Index(['ID', 'country', 'age_group', 'travel_with', 'total_female',
       'total_male', 'purpose', 'main_activity', 'info_source',
       'tour_arrangement', 'package_transport_int', 'package_accomodation',
       'package_food', 'package_transport_tz', 'package_sightseeing',
       'package_guided_tour', 'package_insurance', 'night_mainland',
       'night_zanzibar', 'payment_mode', 'first_trip_tz', 'most_impressing',
       'total_cost', 'Total_travelers'],
      dtype='object')

In [66]:
px.histogram(df[['travel_with','Total_travelers']].groupby('travel_with').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).reset_index(),
x='travel_with', y = 'Total_travelers',text_auto=True,title='Number of tourists by companion',
labels=dict(Total_travelers='Total Tourists'))

Most tourists travel with Friends/Relatives. Lets see which group spends the most. This is true because the groups are pretty large.

In [67]:
px.histogram(df[['travel_with','total_cost']].groupby('travel_with').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='travel_with', y = 'total_cost',text_auto=True,title='Total spent by companions',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent'))

Interesting, seems like when tourists travel with their spouse they spend more than friends or relatives. By this I am now assuming that the number of people travelling may not be a deciding factor of how much people spend. Lets dig deeper and see which activity by companion brings more revenue.

In [68]:
px.histogram(df[['travel_with','total_cost','main_activity']].groupby(['travel_with','main_activity']).sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='travel_with', y = 'total_cost',color='main_activity',text_auto=True,title='Companions spent vs Main Activity',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent',main_activity='Main Activity'))

As expected, most tourists come to africa for Wildlife tourism and that actually brings more revenue than beach tourism. I am now starting to think that mainland brings more revenue than Zanzibar. Now I want to know which age group of Spouses brings in more revenue. I am guessing mature couples 45-64

In [69]:
px.histogram(df[['travel_with','total_cost','age_group']].groupby(['travel_with','age_group']).sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='travel_with', y = 'total_cost',color='age_group',text_auto=True,title='Companions spent vs Age Group',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent',age_group='Age Group'))

Hmm There isnt much of a diffence in amount spent by 25-44 and 45-64 age group. But its still interesting that tourists spend more money with their spouses not as friends/relatives regardless of their numbers. This makes travel_with a good predictor.

In [70]:
df.head()

Unnamed: 0,ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,package_transport_int,package_accomodation,package_food,package_transport_tz,package_sightseeing,package_guided_tour,package_insurance,night_mainland,night_zanzibar,payment_mode,first_trip_tz,most_impressing,total_cost,Total_travelers
0,tour_0,SWIZERLAND,45-64,Friends/Relatives,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Friends, relatives",Independent,No,No,No,No,No,No,No,13.0,0.0,Cash,No,Friendly People,674602.5,2.0
1,tour_10,UNITED KINGDOM,25-44,Alone,1.0,0.0,Leisure and Holidays,Cultural tourism,others,Independent,No,No,No,No,No,No,No,14.0,7.0,Cash,Yes,"Wonderful Country, Landscape, Nature",3214906.5,1.0
2,tour_1000,UNITED KINGDOM,25-44,Alone,0.0,1.0,Visiting Friends and Relatives,Cultural tourism,"Friends, relatives",Independent,No,No,No,No,No,No,No,1.0,31.0,Cash,No,Excellent Experience,3315000.0,1.0
3,tour_1002,UNITED KINGDOM,25-44,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,No,Yes,Yes,Yes,Yes,Yes,No,11.0,0.0,Cash,Yes,Friendly People,7790250.0,2.0
4,tour_1004,CHINA,1-24,Alone,1.0,0.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Independent,No,No,No,No,No,No,No,7.0,4.0,Cash,Yes,No comments,1657500.0,1.0


In [71]:
px.histogram(df[['purpose','Total_travelers']].groupby('purpose').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).reset_index(),
x='purpose', y = 'Total_travelers',text_auto=True,title='Number of tourists by purpose of visit',
labels=dict(Total_travelers='Total Tourists'))

In [72]:
px.histogram(df[['purpose','total_cost']].groupby('purpose').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='purpose', y = 'total_cost',text_auto=True,title='Total spent by purpose of visit',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent'))

This huge difference shows that the purpose of visit is a great predictor in our model. Most Revenue comes from Leisure and Holidays

In [73]:
px.histogram(df[['main_activity','Total_travelers']].groupby('main_activity').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).reset_index(),
x='main_activity', y = 'Total_travelers',text_auto=True,title='Number of tourists by main activity',
labels=dict(Total_travelers='Total Tourists'))

In [74]:
px.histogram(df[['main_activity','total_cost']].groupby('main_activity').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='main_activity', y = 'total_cost',text_auto=True,title='Total spent by Main Activity',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent'))

As expected Wild life tourism brings in a lot of revenue making it a good predictor.

In [75]:
px.histogram(df[['info_source','Total_travelers']].groupby('info_source').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).reset_index(),
x='info_source', y = 'Total_travelers',text_auto=True,title='Number of tourists by Info Source',
labels=dict(Total_travelers='Total Tourists'))

Most tourist find infomation by travel agent or tour operator. I was expecting social media to bring in a lot of tourists. Now lets see if the information source have an effect on the total amount spent.

In [76]:
px.histogram(df[['info_source','total_cost']].groupby('info_source').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='info_source', y = 'total_cost',text_auto=True,title='Total spent by Information Source',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent'))

Wow this is interesting. The number of tourists that are referred by friends/relatives are less than twice as much as those who come with travel agents but those who are referred by agents spend 4x as much. This makes Information source a good predictor. I would recommend the Tourism minister to give more attention to travel agents and tour operators. 

In [77]:
px.histogram(df[['tour_arrangement','Total_travelers']].groupby('tour_arrangement').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).reset_index(),
x='tour_arrangement', y = 'Total_travelers',text_auto=True,title='Number of tourists by Tour Arrangement',
labels=dict(Total_travelers='Total Tourists'))

In [78]:
px.histogram(df[['tour_arrangement','total_cost']].groupby('tour_arrangement').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='tour_arrangement', y = 'total_cost',text_auto=True,title='Total spent by Tour Arrangement',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent'))

Those tourists who have a package tour spend 4x more than independant tourist. This makes Tour Arrangement a good predictor.

In [79]:
df.columns

Index(['ID', 'country', 'age_group', 'travel_with', 'total_female',
       'total_male', 'purpose', 'main_activity', 'info_source',
       'tour_arrangement', 'package_transport_int', 'package_accomodation',
       'package_food', 'package_transport_tz', 'package_sightseeing',
       'package_guided_tour', 'package_insurance', 'night_mainland',
       'night_zanzibar', 'payment_mode', 'first_trip_tz', 'most_impressing',
       'total_cost', 'Total_travelers'],
      dtype='object')

In [80]:
px.histogram(df[['package_transport_int','Total_travelers']].groupby('package_transport_int').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).reset_index(),
x='package_transport_int', y = 'Total_travelers',text_auto=True,title='Number of tourists by International Package Transport',
labels=dict(Total_travelers='Total Tourists'))

In [81]:
px.histogram(df[['package_transport_int','total_cost']].groupby('package_transport_int').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='package_transport_int', y = 'total_cost',text_auto=True,title='Total spent by International Package Transport',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent'))

From the graph shown above, a tour package having international transportation service does not have a huge impact on our total revenue. This will not be a good predictor on our model. I will not include in the model.

In [82]:
px.histogram(df[['package_accomodation','Total_travelers']].groupby('package_accomodation').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).reset_index(),
x='package_accomodation', y = 'Total_travelers',text_auto=True,title='Number of tourists by Package Accomodation',
labels=dict(Total_travelers='Total Tourists'))

In [83]:
px.histogram(df[['package_accomodation','total_cost']].groupby('package_accomodation').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='package_accomodation', y = 'total_cost',text_auto=True,title='Total spent by Package Accomodation',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent'))

If a tour package includes accomodation it really has an impact on the total revenue. The tourist with and without Accomodation is close to 50:50 as shown in the first plot but the impact it has on the total revenue is more than 9x which makes the variable a good predictor.

In [84]:
px.histogram(df[['package_food','Total_travelers']].groupby('package_food').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).reset_index(),
x='package_food', y = 'Total_travelers',text_auto=True,title='Number of tourists with a Package that includes Food',
labels=dict(Total_travelers='Total Tourists'))

In [85]:
px.histogram(df[['package_food','total_cost']].groupby('package_food').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='package_food', y = 'total_cost',text_auto=True,title='Total spent with tourists that have a Package that includes food',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent'))

If a tour package includes food it really has an impact on the total revenue. The tourist with and without food package is close to 50:50 as shown in the first plot but the impact it has on the total revenue is more than 3x which makes the variable a good predictor.

In [86]:
px.histogram(df[['package_transport_tz','Total_travelers']].groupby('package_transport_tz').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).reset_index(),
x='package_transport_tz', y = 'Total_travelers',text_auto=True,title='Number of tourists with a Package that includes Tanzanian Transport',
labels=dict(Total_travelers='Total Tourists'))

In [87]:
px.histogram(df[['package_transport_tz','total_cost']].groupby('package_transport_tz').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='package_transport_tz', y = 'total_cost',text_auto=True,title='Total spent with tourists that have a Package that includes Tanzanian Transport',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent'))

Compare to the package that had international transport, it looks like most tourists want a package that provides local transport

In [88]:
px.histogram(df[['package_sightseeing','Total_travelers']].groupby('package_sightseeing').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).reset_index(),
x='package_sightseeing', y = 'Total_travelers',text_auto=True,title='Number of tourists with a Package that includes Sight Seeing',
labels=dict(Total_travelers='Total Tourists'))

In [89]:
px.histogram(df[['package_sightseeing','total_cost']].groupby('package_sightseeing').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='package_sightseeing', y = 'total_cost',text_auto=True,title='Total spent with tourists that have a Package that includes Sight Seeing',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent'))

I was expecting to see a huge margin on packages that includes sight seeing. Since this is less than twice as much it makes this predictor weak. It will not be included in our model.

In [90]:
px.histogram(df[['package_guided_tour','Total_travelers']].groupby('package_guided_tour').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).reset_index(),
x='package_guided_tour', y = 'Total_travelers',text_auto=True,title='Number of tourists with a Package that includes Tour Guide',
labels=dict(Total_travelers='Total Tourists'))

In [101]:
px.histogram(df[['package_guided_tour','total_cost']].groupby('package_guided_tour').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='package_guided_tour', y = 'total_cost',text_auto=True,title='Total spent with tourists that have a Package that includes Tour Guide',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent'))

I was expecting to see a huge margin on packages that includes Tour Guide. Since this is less than twice as much it makes this predictor weak. It will not be included in our model.

In [92]:
px.histogram(df[['package_insurance','Total_travelers']].groupby('package_insurance').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).reset_index(),
x='package_insurance', y = 'Total_travelers',text_auto=True,title='Number of tourists with a Package that includes Insurance',
labels=dict(Total_travelers='Total Tourists'))

From the visualization above most tourists are not intereted in package Insurance

In [93]:
px.histogram(df[['package_insurance','total_cost']].groupby('package_insurance').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='package_insurance', y = 'total_cost',text_auto=True,title='Total spent with tourists that have a Package that includes Package Insurance',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent'))

Definately we will not add this on our model because it doesnt bring revenue 

In [94]:
px.histogram(df[['payment_mode','Total_travelers']].groupby('payment_mode').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).reset_index(),
x='payment_mode', y = 'Total_travelers',text_auto=True,title='Number of tourists by payment channel',
labels=dict(Total_travelers='Total Tourists'))

In [95]:
px.histogram(df[['payment_mode','total_cost']].groupby('payment_mode').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='payment_mode', y = 'total_cost',text_auto=True,title='Total spent by payment Channel',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent'))

Tourist who bring a lot of revenue pay in cash, this is a good predictor

In [96]:
px.histogram(df[['first_trip_tz','Total_travelers']].groupby('first_trip_tz').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).reset_index(),
x='first_trip_tz', y = 'Total_travelers',text_auto=True,title='Number of tourists by First Trip',
labels=dict(Total_travelers='Total Tourists'))

In [97]:
px.histogram(df[['first_trip_tz','total_cost']].groupby('first_trip_tz').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='first_trip_tz', y = 'total_cost',text_auto=True,title='Total spent First Trip',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent'))

When its a tourist's first time, they are likely to bring in more revenue than return tourists. This is a good predictor

In [98]:
px.histogram(df[['most_impressing','Total_travelers']].groupby('most_impressing').agg({"Total_travelers": sum}) .sort_values('Total_travelers',ascending=False).reset_index(),
x='most_impressing', y = 'Total_travelers',text_auto=True,title='Number of tourists Most impressed by',
labels=dict(Total_travelers='Total Tourists'))

In [99]:
px.histogram(df[['most_impressing','total_cost']].groupby('most_impressing').sum().sort_values('total_cost',ascending=False).head(10).reset_index(),
x='most_impressing', y = 'total_cost',text_auto=True,title='Total spent Impression',
labels=dict(travel_with="Companion",total_cost='Number of Total Spent'))

Most_impressing is a weak predictor as shown above. We will not include this in our model

In [125]:
gender_data=df[['total_female','total_male','Total_travelers','total_cost']]
gender_data['travelers_count']=gender_data['Total_travelers']
gender_data[['total_female','total_male','travelers_count']]=gender_data[['total_female','total_male','travelers_count']].astype(str)




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [169]:
gender_data.groupby(['travelers_count','total_female','total_male']).sum().sort_values('total_cost',ascending=False).head(5).reset_index()


Unnamed: 0,travelers_count,total_female,total_male,Total_travelers,total_cost
0,2.0,1.0,1.0,2604.0,15205860000.0
1,1.0,0.0,1.0,1443.0,4295121000.0
2,1.0,1.0,0.0,930.0,4272954000.0
3,4.0,2.0,2.0,540.0,2498488000.0
4,3.0,2.0,1.0,402.0,1582563000.0


In [174]:
px.histogram(gender_data.groupby(['travelers_count','total_female','total_male']).sum().sort_values('Total_travelers',ascending=False).head(12).reset_index(),
x=['total_female','total_male'], y = 'Total_travelers',facet_col="travelers_count",barmode="group",width=1200, height=550,
category_orders={"travelers_count": ["2.0", "1.0", "1.0", "4.0"]},
text_auto=True,title='Number of tourists by total male and female',
labels=dict(Total_travelers='Total Tourists',value='Group Composition',variable='Gender',facet_col='Travellers Group'))

The visualization above shows Total number of females and males per group. For example if the travelling group has 2 people, Its either there is i male and one female, 2 males or 2 females. This graph shows the travelling groups as Travellers_count and the distribution of genders in that group. From the way I see it, the number of males and females who travel to Tanzania are similar. Except when they are travelling alone seems like more males enjoy travelling alone to Africa. But as the group number increases the distribution is more or less the same.

In [173]:
px.histogram(gender_data.groupby(['travelers_count','total_female','total_male']).sum().sort_values('total_cost',ascending=False).head(12).reset_index(),
x=['total_female','total_male'], y = 'total_cost',facet_col="travelers_count",barmode="group",width=1200, height=550,
text_auto=True,title='Total spent by total male and female',
labels=dict(Total_travelers='Total Tourists',value='Group Composition',variable='Gender',facet_col='Travellers Group'))

The trend is similar here, males and females seem to spend more less the same. This just shows that total_female and total_male are weak predictors and will be excluded in our model.

In [175]:
px.histogram(gender_data.groupby(['travelers_count','total_female','total_male']).sum().sort_values('Total_travelers',ascending=False).head(20).reset_index(),
x='travelers_count', y = 'Total_travelers',text_auto=True,title='Number of tourists by Group Composition',
labels=dict(Total_travelers='Total Tourists'))

More tourist prefer travelling in pairs, as the group size increases the number of tourist decreases. Lets see if this trend will be the same in spending.

In [176]:
px.histogram(gender_data.groupby(['travelers_count','total_female','total_male']).sum().sort_values('total_cost',ascending=False).head(20).reset_index(),
x='travelers_count', y = 'total_cost',text_auto=True,title='Total spent by Group Composition',
labels=dict(Total_travelers='Total Tourists'))

Yes the trend is pretty much the same. This also solidifies the fact we established earlier that spouses spend more than any other group. However I dont think total number of travellers will be a good predictor in our model.

In [177]:
stay_data=df[['night_mainland','night_zanzibar','Total_travelers','total_cost']]
stay_data[['night_mainland','night_zanzibar']]=stay_data[['night_mainland','night_zanzibar']].astype(str)




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [178]:
stay_data.groupby(['night_mainland','night_zanzibar']).sum().sort_values('total_cost',ascending=False).head(5).reset_index()

Unnamed: 0,night_mainland,night_zanzibar,Total_travelers,total_cost
0,0.0,7.0,522.0,2619956000.0
1,7.0,0.0,448.0,2314093000.0
2,6.0,0.0,405.0,2184085000.0
3,10.0,0.0,359.0,1760148000.0
4,14.0,0.0,315.0,1443527000.0


In [180]:
px.histogram(stay_data.groupby(['night_mainland','night_zanzibar']).sum().sort_values('Total_travelers',ascending=False).head(20).reset_index(),
x=['night_mainland','night_zanzibar'], y = 'Total_travelers',barmode="group",width=1200, height=550,
text_auto=True,title='Number of tourists by Number of nights a tourist spent in Zanzibar and/or Mainland',
labels=dict(Total_travelers='Total Tourists'))

This distribution shows that most tourists prefer main land Tanzania than Zanzibar which explains why most of them come for wildlife. The more nights spent on mainland number of tourists decreases. It also shows that 4 440 tourists have didnt even visit Zanzibar. This is a huge number.

In [181]:
px.histogram(stay_data.groupby(['night_mainland','night_zanzibar']).sum().sort_values('total_cost',ascending=False).head(20).reset_index(),
x=['night_mainland','night_zanzibar'], y = 'total_cost',barmode="group",width=1200, height=550,
text_auto=True,title='Total spent by Number of nights a tourist spent in Zanzibar and/or Mainland',
labels=dict(Total_travelers='Total Tourists'))

Mainland brings in more revenue than Zanzibar. Lets dig deeper and see if the number of nights spent on mainland can have a strong prediction in our model.

In [183]:
px.histogram(stay_data.groupby(['night_mainland','night_zanzibar']).sum().sort_values('total_cost',ascending=False).head(20).reset_index(),
x='night_mainland', y = 'total_cost',barmode="group",width=1200, height=550,
text_auto=True,title='Total spent by Number of nights a tourist spent on Mainland',
labels=dict(Total_travelers='Total Tourists'))

Nahh, the distribution says otherwise. This is a weak predictor.

So here we will conclude our exploratory data analysis. Lets choose the columns we will pick for our final model.

In [191]:
model_df=df[['country','age_group','purpose','main_activity','info_source',
'tour_arrangement','package_accomodation','package_food','package_transport_tz',
'package_insurance','payment_mode','first_trip_tz','total_cost']]

Out of 24 columns we have chosen 12 as our good predictors for the model. The next part of this project will be to prepare the data for modelling

In [192]:
model_df.to_csv('model_data.csv')

In [193]:
model_df

Unnamed: 0,country,age_group,purpose,main_activity,info_source,tour_arrangement,package_accomodation,package_food,package_transport_tz,package_insurance,payment_mode,first_trip_tz,total_cost
0,SWIZERLAND,45-64,Leisure and Holidays,Wildlife tourism,"Friends, relatives",Independent,No,No,No,No,Cash,No,674602.5
1,UNITED KINGDOM,25-44,Leisure and Holidays,Cultural tourism,others,Independent,No,No,No,No,Cash,Yes,3214906.5
2,UNITED KINGDOM,25-44,Visiting Friends and Relatives,Cultural tourism,"Friends, relatives",Independent,No,No,No,No,Cash,No,3315000.0
3,UNITED KINGDOM,25-44,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,Yes,Yes,Yes,No,Cash,Yes,7790250.0
4,CHINA,1-24,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Independent,No,No,No,No,Cash,Yes,1657500.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4804,UAE,45-64,Business,Hunting tourism,"Friends, relatives",Independent,No,No,No,No,Credit Card,No,3315000.0
4805,UNITED STATES OF AMERICA,25-44,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,Yes,Yes,Yes,Yes,Cash,Yes,10690875.0
4806,NETHERLANDS,1-24,Leisure and Holidays,Wildlife tourism,others,Independent,No,No,No,No,Cash,Yes,2246636.7
4807,SOUTH AFRICA,25-44,Business,Beach tourism,"Travel, agent, tour operator",Independent,Yes,Yes,No,No,Credit Card,No,1160250.0
