# Local Satisfaction data preparation  
This notebook shows the process of getting data from **wide format** (for each observation, as many columns as questions) to **long format** (a column for the question, and a column for the answer, for all the data)

In [1]:
import pandas as pd 
import numpy as np

## 1) Load data  
Load the original data from the google sheet

In [11]:
sheet_id = '1iXFCOE7iAhpajY9v2GjtZM21GDDTjWFRzeO8Q_InP7E'
sheet_name = 'simplified_data'
url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'

original_data = pd.read_csv(url)
original_data['id'] = original_data.index
original_data

Unnamed: 0,Satisfaction with life,Sense of belonging in community,Satisfaction with tourism,Satisfaction with tourism .1,Tourism,Jobs with tourism,Entrepreneurship with tourism,Local culture from tourism,Production of local productions with tourism,Views on policies on tourism,Satisfaction with the quality of basic educational services,Satisfaction with the quality of basic healh services,Satisfaction with access to recreation,Satisfaction with access to cultural activities,Satisfaction with safety,Sense of a healthy environment,id
0,80,75.0,70.0,Increase,,,100.0,100.0,100.0,100.0,100.0,,75.0,100.0,50.0,75.0,0
1,60,0.0,,,,,,,,,,,75.0,50.0,100.0,75.0,1
2,80,100.0,,Increase,,,100.0,100.0,100.0,100.0,100.0,,100.0,0.0,100.0,75.0,2
3,70,50.0,,,,,,,,,,,50.0,75.0,75.0,50.0,3
4,70,100.0,90.0,Increase,Nearby BC communities | All of BC | Ot...,British Columbia Visitors | Canadian V...,75.0,75.0,75.0,75.0,75.0,Yes,75.0,75.0,50.0,50.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1360,90,75.0,70.0,Increase,All of BC | Other Canadian provinces | United ...,Canadian Visitors | United States of America ...,75.0,75.0,75.0,75.0,0.0,Yes,25.0,25.0,,75.0,1360
1361,20,,,,,,,,,,,,,,,,1361
1362,70,75.0,,Stay the same,Nearby BC communities | All of BC | Ot...,British Columbia Visitors | Canadian V...,75.0,50.0,50.0,75.0,50.0,Yes,75.0,75.0,75.0,75.0,1362
1363,80,0.0,10.0,I do not have an opinion.,Nearby BC communities | All of BC | Ot...,British Columbia Visitors,25.0,25.0,25.0,25.0,50.0,No,25.0,25.0,0.0,50.0,1363


## Transform to long format

In [12]:
df_long = pd.melt(original_data, id_vars=['id'], var_name='Topic', value_name='Answer')
df_long

Unnamed: 0,id,Topic,Answer
0,0,Satisfaction with life,80
1,1,Satisfaction with life,60
2,2,Satisfaction with life,80
3,3,Satisfaction with life,70
4,4,Satisfaction with life,70
...,...,...,...
21835,1360,Sense of a healthy environment,75.0
21836,1361,Sense of a healthy environment,
21837,1362,Sense of a healthy environment,75.0
21838,1363,Sense of a healthy environment,50.0


## 2) Modify / group data  
The following categories (questions/topics) need cleaning, separating multiple choice answers and/or removing invalid answers  
- Jobs with tourism  
- Tourism  
- Satisfaction with tourism .1  

The process will be to get the subset of data, modify it, and then remove and replace on the initial data table.

### Jobs with tourism 
**Question:**  
_If you saw an advertisement promoting your region as a place for tourists to visit would you agree it was appropriate for the following locations?  (Check all that apply)_

In [13]:
jobs = df_long.loc[df_long['Topic'] == 'Jobs with tourism '].copy()
jobs

Unnamed: 0,id,Topic,Answer
6825,0,Jobs with tourism,
6826,1,Jobs with tourism,
6827,2,Jobs with tourism,
6828,3,Jobs with tourism,
6829,4,Jobs with tourism,British Columbia Visitors | Canadian V...
...,...,...,...
8185,1360,Jobs with tourism,Canadian Visitors | United States of America ...
8186,1361,Jobs with tourism,
8187,1362,Jobs with tourism,British Columbia Visitors | Canadian V...
8188,1363,Jobs with tourism,British Columbia Visitors


In [15]:
jobs.Answer.unique()

array([nan,
       'British Columbia Visitors         | Canadian Visitors | United States of America        | Other countries',
       'British Columbia Visitors        ',
       'British Columbia Visitors         | Canadian Visitors | United States of America       ',
       'Canadian Visitors',
       'British Columbia Visitors         | Canadian Visitors | Other countries',
       'Other countries',
       'British Columbia Visitors         | Canadian Visitors',
       'United States of America       ',
       'British Columbia Visitors         | United States of America        | Other countries',
       'Canadian Visitors | United States of America       ',
       'Canadian Visitors | Other countries',
       'Canadian Visitors | United States of America        | Other countries',
       'British Columbia Visitors         | Other countries',
       'British Columbia Visitors         | United States of America       ',
       'United States of America        | Other countries'], dty

Get choices in different columns and then in long format

In [18]:
jobs_exp = jobs['Answer'].dropna().str.split('|', expand=True)
jobs_exp['id'] = jobs_exp.index
jobs_exp

Unnamed: 0,0,1,2,3,id
6829,British Columbia Visitors,Canadian Visitors,United States of America,Other countries,6829
6830,British Columbia Visitors,Canadian Visitors,United States of America,Other countries,6830
6831,British Columbia Visitors,Canadian Visitors,United States of America,Other countries,6831
6832,British Columbia Visitors,Canadian Visitors,United States of America,Other countries,6832
6833,British Columbia Visitors,Canadian Visitors,United States of America,Other countries,6833
...,...,...,...,...,...
8184,British Columbia Visitors,Canadian Visitors,United States of America,Other countries,8184
8185,Canadian Visitors,United States of America,Other countries,,8185
8187,British Columbia Visitors,Canadian Visitors,,,8187
8188,British Columbia Visitors,,,,8188


In [26]:
jobs_long = pd.melt(jobs_exp, id_vars=['id'], var_name='Topic', value_name='Answer')
jobs_long['Topic'] = 'Jobs with tourism'
jobs_long = jobs_long[~jobs_long['Answer'].isna()]
jobs_long['Answer'] = jobs_long['Answer'].str.strip()
jobs_long

Unnamed: 0,id,Topic,Answer
0,6829,Jobs with tourism,British Columbia Visitors
1,6830,Jobs with tourism,British Columbia Visitors
2,6831,Jobs with tourism,British Columbia Visitors
3,6832,Jobs with tourism,British Columbia Visitors
4,6833,Jobs with tourism,British Columbia Visitors
...,...,...,...
4097,8173,Jobs with tourism,Other countries
4100,8178,Jobs with tourism,Other countries
4101,8180,Jobs with tourism,Other countries
4102,8181,Jobs with tourism,Other countries


### Tourism  
**Question:**  
_I would welcome visitors from: (Check all that appy)_

Same process as before

In [30]:
tourism = df_long.loc[df_long['Topic'] == 'Tourism '].copy()
tourism_exp = tourism['Answer'].dropna().str.split('|', expand=True)
tourism_exp['id'] = tourism_exp.index
tourism_exp.head()



Unnamed: 0,0,1,2,3,4,id
5464,Nearby BC communities,All of BC,Other Canadian provinces,United States of America,Other countries,5464
5465,Nearby BC communities,All of BC,Other Canadian provinces,United States of America,Other countries,5465
5466,Nearby BC communities,All of BC,Other Canadian provinces,United States of America,Other countries,5466
5467,Nearby BC communities,All of BC,Other Canadian provinces,United States of America,Other countries,5467
5468,Nearby BC communities,All of BC,Other Canadian provinces,United States of America,Other countries,5468


In [32]:
tourism_long = pd.melt(tourism_exp, id_vars=['id'], var_name='Topic', value_name='Answer')
tourism_long['Topic'] = 'Tourism'
tourism_long = tourism_long[~tourism_long['Answer'].isna()]
tourism_long['Answer'] = tourism_long['Answer'].str.strip()
tourism_long

Unnamed: 0,id,Topic,Answer
0,5464,Tourism,Nearby BC communities
1,5465,Tourism,Nearby BC communities
2,5466,Tourism,Nearby BC communities
3,5467,Tourism,Nearby BC communities
4,5468,Tourism,Nearby BC communities
...,...,...,...
5189,6816,Tourism,Other countries
5190,6819,Tourism,Other countries
5192,6822,Tourism,Other countries
5193,6823,Tourism,Other countries


### Satisfaction with tourism .1

In [34]:
df_long[df_long['Topic'] == 'Satisfaction with tourism .1']['Answer'].unique()

array(['Increase ', nan, 'Stay the same ', 'I do not have an opinion.',
       'Decrease', '特に意見はない', 'Meningkat'], dtype=object)

In [36]:
satisfaction = df_long.loc[(df_long['Topic'] == 'Satisfaction with tourism .1') & (df_long['Answer']\
    .isin(['Increase ', 'Stay the same ', 'I do not have an opinion.','Decrease']))].copy()
satisfaction

Unnamed: 0,id,Topic,Answer
4095,0,Satisfaction with tourism .1,Increase
4097,2,Satisfaction with tourism .1,Increase
4099,4,Satisfaction with tourism .1,Increase
4100,5,Satisfaction with tourism .1,Stay the same
4102,7,Satisfaction with tourism .1,Increase
...,...,...,...
5454,1359,Satisfaction with tourism .1,Stay the same
5455,1360,Satisfaction with tourism .1,Increase
5457,1362,Satisfaction with tourism .1,Stay the same
5458,1363,Satisfaction with tourism .1,I do not have an opinion.


### Combine with dataset  

In [37]:
df_long_updated = df_long.loc[~(df_long['Topic'].isin(['Satisfaction with tourism .1', 'Tourism ', 'Jobs with tourism ']))]
df_long_updated = df_long_updated.append([satisfaction, jobs_long, tourism_long])

In [43]:
df_long_updated[df_long_updated['Topic'] == 'Jobs with tourism']['Answer'].value_counts()

British Columbia Visitors    941
Canadian Visitors            858
Other countries              719
United States of America     678
Name: Answer, dtype: int64

In [44]:
df_long_updated['Topic'] = df_long_updated['Topic'].str.strip()

## 3) Add questions to dataset  
Load corresponding sheet from workbook, join by topic

In [45]:
sheet_name = 'questions_table'
url = f'https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'

question_data = pd.read_csv(url)
question_data['Topic'] = question_data['Topic'].str.strip()
question_data


Unnamed: 0,Topic,Question
0,Satisfaction with life,"Overall, how satisfied are you with your life ..."
1,Sense of belonging in community,How would you describe your feeling of belongi...
2,Satisfaction with tourism,How satisfied are you with the state of touris...
3,Satisfaction with tourism,"Overall, the number of tourists to my site sho..."
4,Tourism,I would welcome visitors from: (Check all that...
5,Jobs with tourism,If you saw an advertisement promoting your reg...
6,Entrepreneurship with tourism,Tourism creates jobs for local people at my site.
7,Local culture from tourism,Tourism promotes local entrepreneurship at my ...
8,Production of local productions with tourism,Tourism promotes the local culture at my site.
9,Views on policies on tourism,Tourism promotes production of local products ...


In [47]:
df_long_updated.Topic.unique()

array(['Satisfaction with life', 'Sense of belonging in community',
       'Satisfaction with tourism', 'Entrepreneurship with tourism',
       'Local culture from tourism',
       'Production of local productions with tourism',
       'Views on policies on tourism',
       'Satisfaction with the quality of basic educational services',
       'Satisfaction with the quality of basic healh services',
       'Satisfaction with access to recreation',
       'Satisfaction with access to cultural activities',
       'Satisfaction with safety', 'Sense of a healthy environment',
       'Satisfaction with tourism .1', 'Jobs with tourism', 'Tourism'],
      dtype=object)

In [48]:
question_data.iloc[3]['Topic'] = 'Satisfaction with tourism .1'
question_data

Unnamed: 0,Topic,Question
0,Satisfaction with life,"Overall, how satisfied are you with your life ..."
1,Sense of belonging in community,How would you describe your feeling of belongi...
2,Satisfaction with tourism,How satisfied are you with the state of touris...
3,Satisfaction with tourism .1,"Overall, the number of tourists to my site sho..."
4,Tourism,I would welcome visitors from: (Check all that...
5,Jobs with tourism,If you saw an advertisement promoting your reg...
6,Entrepreneurship with tourism,Tourism creates jobs for local people at my site.
7,Local culture from tourism,Tourism promotes local entrepreneurship at my ...
8,Production of local productions with tourism,Tourism promotes the local culture at my site.
9,Views on policies on tourism,Tourism promotes production of local products ...


In [49]:
df_long = pd.merge(df_long_updated, question_data, on='Topic', how='left')
df_long

Unnamed: 0,id,Topic,Answer,Question
0,0,Satisfaction with life,80,"Overall, how satisfied are you with your life ..."
1,1,Satisfaction with life,60,"Overall, how satisfied are you with your life ..."
2,2,Satisfaction with life,80,"Overall, how satisfied are you with your life ..."
3,3,Satisfaction with life,70,"Overall, how satisfied are you with your life ..."
4,4,Satisfaction with life,70,"Overall, how satisfied are you with your life ..."
...,...,...,...,...
26051,6816,Tourism,Other countries,I would welcome visitors from: (Check all that...
26052,6819,Tourism,Other countries,I would welcome visitors from: (Check all that...
26053,6822,Tourism,Other countries,I would welcome visitors from: (Check all that...
26054,6823,Tourism,Other countries,I would welcome visitors from: (Check all that...


## 4) Save data  
The dataset is ready for use in visualizations (Tableau or Python) or for further formatting for the API when the data model is ready

In [50]:
df_long.to_csv('../data/local_satisfaction_long_data.csv', index=False)