# Final project case: building a personalised tourism recommender - EDA part

#### Problem

The government of Spain wants to build a public online platform for their citizens to discover the country and promote national tourism. Although international tourists are the main source of income for this country, they want to change the paradigm and create a more sustainable tourism with its citizens.

This public online platform will contain a personalised recommender that the user can use to discover sites for their next travel, according to their age, how are they travelling (solo, in family, in couple, in group) and how she or he is feeling about this trip.

This first MVP will be for new users and it will be based on the average preferences of other users for each site.


#### Solution

As a junior data analyst, the government has given me some data about places in different cities and regions: Madrid, Vigo, Barcelona, Euskadi (as a region) and La Palma (as a region) to create an MVP of this platform.

For new users, the government has already started a focus group to recollect information about places and they will give me the average data of each site.

# Loading the libraries

In [3]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt

# Loading the data

In reality, I have retrieved the data from datos.gob.es, in the tourist point of interest section.
I combine this with looking Google Maps API and looking for its latitude, longitude and rating. 

After that, I created some more features tu make it personalised.

In [9]:
data = pd.read_excel("data_final_not_encoded.xlsx")

In [10]:
data

Unnamed: 0,id,name,average_sentiment,cat_detailed,cat_reduced,latitud,longitud,address,area,age,way_travel,rating
0,0,Teatro Flamenco Madrid,history_freak,theater,"experiences, cultural centers, theaters and music",40.423258,-3.704502,"C. del Pez, 10, 28004 Madrid, España",Madrid,adult,in couple,4.7
1,1,Urban Safari,adventurous,experience,"experiences, cultural centers, theaters and music",40.404300,-3.694670,"Calle de las Delicias, 9, 28045 Madrid, España",Madrid,old,in couple,4.9
2,2,Museo Gran Via 15,artsy,museum,"historic building, museums and archaeological ...",40.419750,-3.700222,"C/ Gran Vía, 15, Local, 28013 Madrid, España",Madrid,adult,in family,4.4
3,3,Parque Princesa Leonor,relax,park,"town, parks or lookouts",40.488108,-3.619405,"F9RJ+4F, 28055 Madrid, España",Madrid,young,in couple,4.4
4,4,Parque Juan Pablo II,relax,park,"town, parks or lookouts",40.454777,-3.626851,"Av. de Machupichu, 1, 28043 Madrid, España",Madrid,young,in couple,4.5
...,...,...,...,...,...,...,...,...,...,...,...,...
3793,3793,Parketxe Arantzazu,curious,museum,"historic building, museums and archaeological ...",42.980086,-2.401119,"Arantzazu auzoa, Gandiaga II eraikina, 20567 O...",Euskadi,adult,in family,4.6
3794,3794,Kantauri Ondare Museoa,curious,museum,"historic building, museums and archaeological ...",43.295369,-2.256645,"Kantauri Plaza, 5, 20750 Zumaia, Gipuzkoa, España",Euskadi,young,in group,5.0
3795,3795,Ekoetxea Meatzaldea,curious,museum,"historic building, museums and archaeological ...",43.282968,-3.075280,"Carretera forestal La Arboleda, Muskiz km 1,6,...",Euskadi,adult,in family,4.4
3796,3796,Mufomi - Museo de Fósiles y Minerales de Elgoibar,curious,museum,"historic building, museums and archaeological ...",43.218636,-2.407862,"correos, Artetxe kalea, s/n, Apdo, Nº 20, 2087...",Euskadi,adult,in family,4.9


# Brief context of the data

To get to this point, there has being a huge process that came before loading this data. In the case study I said the government gave me this data, retrieved from a focus group that they made. 

Nonetheless, the reality is much complex. You can follow how I did this proccess in the folders: 1. Gathered data from datos.gob.es and 2. Creating the dataset.

**-** First we gathered the data from datos.gob.es, an open data platform made by the government of Spain. Each city or region have multiple datasets of every point of interest category. So I downloaded a lot of datasets from various regions and analyze the quality of the data. Finally, I stayed with data from Euskadi, as a region, La Palma, as a region, and Madrid, Vigo and Barcelona as cities.

**-** I created a unique dataset with common categories that I created myself. I analyzed manually every site to give it a "average sentiment or mood". I also deleted the sites that were duplicated.

**-** At this point I had like more of 5000 sites. I ran them through Google maps API and kept only the ones that had latitude, longitude and address. Which gave me around 3500 sites.

**-** I translated Spanish columns that were inherited from the Spanish dataset.

**-** In the model iterations I realised that the model performed better with less categories, so I reduced them. 

**-**  To make the recommender personalised, I needed to created more features for the sites. Let's break this down. 

    - Common travel places such as Tripadvisor don't give us this type of data. Its ratings of the places are based in what all people think.
    
    - I want to segment places in mood of the site (created previously in step 2, and average age of the common traveller and the way they visit that place.
    
    - Since I couldn't find this data because it was a new approach of travel reccomendations, I needed to created myself. 
    
    - Check: 2. Creating the dataset > 4.age and way travel column creation folder to discover how I did it. I assumed to justify this in the project that the government made a focus group to retrieved this data and they gave that to me so I could build columns for that to run the MVP. 

#  Analyzing the data that we have

In [12]:
data.shape

(3798, 12)

In [13]:
data.columns

Index(['id', 'name', 'average_sentiment', 'cat_detailed', 'cat_reduced',
       'latitud', 'longitud', 'address', 'area', 'age', 'way_travel',
       'rating'],
      dtype='object')

**Columns description**
- id: id of the site
- name: name of the site
- average_sentiment: How people feel when they entered that site. 
- cat_detailed: category of the site. 
- cat_reduced: the one I reduced to use it for the model.
- latitute and longitude retrieved from Google Maps using its API.
- address: address of the site retrieved from Google Maps using its API.
- area: area of the site.
- average_age: the average age of people that go to that site.
- age: average age of the typical traveller: young, adult, old.
- way_travel: Average way of the typical traveller that visit that site: alone, in group, in couple, in family.
- Rating: retrieved from Google Maps using its API.


#### How data is distributed

In [14]:
data["area"].value_counts()

area
Euskadi      1272
Barcelona    1250
Madrid        875
Vigo          373
La Palma       28
Name: count, dtype: int64

In [15]:
data["average_sentiment"].value_counts()

average_sentiment
history_freak    1297
curious          1011
relax             859
artsy             442
adventurous       189
Name: count, dtype: int64

In [16]:
data["cat_detailed"].value_counts()

cat_detailed
historic_building        766
religious                367
point_of_interest        296
museum                   281
town                     277
park                     261
experience               257
route                    213
contemporary_building    182
square                   164
cultural_center          157
art_gallery              121
theater                   68
sculpture                 60
urban_route               58
food                      52
lookout                   47
seaside                   46
other                     41
music                     29
archaeological_rest       29
sport                     26
Name: count, dtype: int64

In [17]:
site_distribution = data.groupby('area')['cat_detailed'].value_counts()
site_distribution

area       cat_detailed         
Barcelona  historic_building        235
           square                   164
           point_of_interest        163
           park                     158
           religious                126
           contemporary_building     86
           urban_route               58
           museum                    57
           cultural_center           39
           seaside                   33
           food                      32
           lookout                   30
           sport                     24
           sculpture                 12
           art_gallery               10
           theater                   10
           town                       7
           other                      6
Euskadi    historic_building        318
           town                     270
           experience               219
           route                    213
           museum                   136
           religious                 83
       

In [18]:
data.isna().sum()

id                   0
name                 0
average_sentiment    0
cat_detailed         0
cat_reduced          0
latitud              0
longitud             0
address              0
area                 0
age                  0
way_travel           0
rating               0
dtype: int64

In [19]:
category_avg_ratings = data.groupby('cat_detailed')['rating'].mean()

In [20]:
category_avg_ratings

cat_detailed
archaeological_rest      3.714286
art_gallery              4.266116
contemporary_building    3.909317
cultural_center          4.414286
experience               4.434894
food                     4.292308
historic_building        3.897147
lookout                  4.412195
museum                   4.316129
music                    4.293103
other                    3.945000
park                     4.284746
point_of_interest        3.957028
religious                4.242458
route                    4.333333
sculpture                4.186207
seaside                  3.775556
sport                    4.330769
square                   4.214286
theater                  4.473529
town                     4.450000
urban_route              3.021053
Name: rating, dtype: float64

Data viz was made with Tableau and the cleaning part took place when I retrieved the data, so no more EDA needed.