# Analysing the data

Getting insights from the complete and tidy dataset to plot into graphs on **Datawrapper**.

In [20]:
# Import libraries
import pandas as pd
import warnings

In [21]:
# Open dataset
df = pd.read_csv("../04_tidy_data/china_cleaned.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 164 entries, 0 to 163
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   date                  164 non-null    object 
 1   city                  164 non-null    object 
 2   expense               164 non-null    object 
 3   payment_source        164 non-null    object 
 4   payment_type          164 non-null    object 
 5   category              164 non-null    object 
 6   price                 164 non-null    float64
 7   price_usd             164 non-null    float64
 8   price_usd_per_capita  164 non-null    float64
dtypes: float64(3), object(6)
memory usage: 11.7+ KB


In [22]:
# Looks like our date format was lost in the cleaning process. Let's fix it again:
df["date"] = pd.Series(df["date"])
df["date"] = pd.to_datetime(df["date"], format='%b-%d')
df = df.sort_values("date", ascending=True)
df["date"] = df["date"].dt.strftime('%b-%d')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 164 entries, 0 to 163
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   date                  164 non-null    object 
 1   city                  164 non-null    object 
 2   expense               164 non-null    object 
 3   payment_source        164 non-null    object 
 4   payment_type          164 non-null    object 
 5   category              164 non-null    object 
 6   price                 164 non-null    float64
 7   price_usd             164 non-null    float64
 8   price_usd_per_capita  164 non-null    float64
dtypes: float64(3), object(6)
memory usage: 12.8+ KB


In [23]:
df.head(5)

Unnamed: 0,date,city,expense,payment_source,payment_type,category,price,price_usd,price_usd_per_capita
0,May-11,Beijing,Taxi from the airport to Renata's house,Carol,apps,Transportation,87.0,12.08,6.04
117,May-11,China,Round-trip flight from São Paulo to Beijing,Paula,credit card,Transportation,23791.9,3304.43,1652.21
11,May-12,Beijing,Didi to the Pearl Market (Hongqiao Market),Carol,apps,Transportation,13.3,1.85,0.92
10,May-12,Beijing,Mountain Coffee,Carol,apps,Food,25.0,3.47,1.74
8,May-12,Beijing,Didi home,Carol,apps,Transportation,45.0,6.25,3.12


## 1- Comparing expenses in general

#### What were our top 10 largest unique expenses?

In [25]:
top10 = df.sort_values(by="price_usd_per_capita", ascending=False).nlargest(10, "price_usd_per_capita")
top10

Unnamed: 0,date,city,expense,payment_source,payment_type,category,price,price_usd,price_usd_per_capita
117,May-11,China,Round-trip flight from São Paulo to Beijing,Paula,credit card,Transportation,23791.9,3304.43,1652.21
114,May-24,Lhasa,Tibet tour package,Paula,credit card,Tour Agency,15183.0,2108.75,702.92
112,May-24,Lhasa,Round-trip flight from Beijing to Lhasa,Paula,credit card,Transportation,9169.11,1273.49,424.5
149,May-31,Guangzhou,Round-trip flight from Beijing to Guangzhou,Renata,apps,Transportation,6150.0,854.17,284.72
87,May-20,Shanghai,Flight from Shanghai to Beijing,Renata,apps,Transportation,1460.0,202.78,101.39
84,May-19,Shanghai,Uniqlo haul,Paula,credit card,Shopping,1162.0,161.39,80.69
147,May-30,Beijing,Silk scarves,Paula,credit card,Shopping,1145.0,159.03,79.51
15,May-13,Datong,Datong tourism package,Paula,credit card,Tour Agency,1080.0,150.0,75.0
70,May-18,Shanghai,Homeinn Hotel,Renata,apps,Hotel,970.0,134.72,67.36
148,May-31,Guangzhou,SunYat Sen University Kaifeng Hotel,Renata,apps,Hotel,1382.0,191.94,63.98


Predictable: **airfares** are the more expensive thing, and longer international flights are on top.
<br>
The **tour packages** (especially the farthest and longest one, to Tibet), are also there.
<br>
And then we have some more expensive **hotels**... and *two specific purchases* that stand out and we can explore in the dataviz.

## 2- Comparing expenses over time

#### How much did we spend per person per day?

In [26]:
# Group by date and sum the total spent in each of them
by_day = df.groupby("date")["price_usd_per_capita"].sum().reset_index()

# Fix the date column to order the table by day
by_day["date"] = pd.Series(by_day["date"])
by_day["date"] = pd.to_datetime(by_day["date"], format='%b-%d')
by_day = by_day.sort_values("date", ascending=True)
by_day["date"] = by_day["date"].dt.strftime('%b-%d')

by_day

Unnamed: 0,date,price_usd_per_capita
3,May-11,1658.25
4,May-12,36.56
5,May-13,152.83
6,May-14,43.38
7,May-15,37.69
8,May-16,67.06
9,May-17,93.47
10,May-18,125.11
11,May-19,145.97
12,May-20,155.93


#### How much was spent by day and by category?

In [28]:
# Group by date and category and sum the total spent for each category in each of them
by_day_category = df.groupby(["date", "category"])["price_usd_per_capita"].sum().reset_index()

# Fix the date column to order the table by day
by_day_category["date"] = pd.Series(by_day_category["date"])
by_day_category["date"] = pd.to_datetime(by_day_category["date"], format='%b-%d')
by_day_category = by_day_category.sort_values("date", ascending=True)
by_day_category["date"] = by_day_category["date"].dt.strftime('%b-%d')

by_day_category

Unnamed: 0,date,category,price_usd_per_capita
9,May-11,Transportation,1658.25
12,May-12,Transportation,11.63
11,May-12,Shopping,18.06
10,May-12,Food,6.87
13,May-13,Food,13.69
...,...,...,...
5,Jun-02,Tickets,0.46
4,Jun-02,Shopping,1.22
3,Jun-02,Food,2.08
8,Jun-03,Transportation,6.25


We'll pivot this table to get a dataframe that we can turn into a **heatmap** or a **streamflow** on our website.

In [64]:
# Pivot the dataframe to a wide format
by_day_category_wide = by_day_category.pivot(index="date", columns="category", values="price_usd_per_capita")

# Flatten the dataframe
by_day_category_wide = pd.DataFrame(by_day_category_wide.to_records())

# Fix the date column (again!) and order by day
by_day_category_wide["date"] = pd.Series(by_day_category_wide["date"])
by_day_category_wide["date"] = pd.to_datetime(by_day_category_wide["date"], format='%b-%d')
by_day_category_wide = by_day_category_wide.sort_values("date", ascending=True)
by_day_category_wide["date"] = by_day_category_wide["date"].dt.strftime('%b-%d')

# Replace all NAs with 0 (in this case it's fine because it means there were 0 expenses in that day for that category)
by_day_category_wide = by_day_category_wide.fillna(0)

# Create a new column for the total expense in each city
by_day_category_wide["total_expenses"] = by_day_category_wide["Food"] + by_day_category_wide["Hotel"] + by_day_category_wide["Shopping"] + by_day_category_wide["Tickets"] + by_day_category_wide["Tour Agency"] + by_day_category_wide["Transportation"]

# Rename the variables so they are following best practices in tidy data:
by_day_category_wide.rename(columns={ "Food":"food",
                                      "Hotel":"hotel",
                                      "Shopping":"shopping",
                                      "Tickets":"tickets",
                                      "Tour Agency":"tour_agency",
                                      "Transportation":"transportation"},
                            inplace=True)

by_day_category_wide

Unnamed: 0,date,food,hotel,shopping,tickets,tour_agency,transportation,total_expenses
3,May-11,0.0,0.0,0.0,0.0,0.0,1658.25,1658.25
4,May-12,6.87,0.0,18.06,0.0,0.0,11.63,36.56
5,May-13,13.69,20.28,5.42,2.36,75.0,36.08,152.83
6,May-14,17.5,0.0,15.11,7.99,0.0,2.78,43.38
7,May-15,2.08,0.0,6.04,0.0,0.0,29.57,37.69
8,May-16,55.64,0.0,0.0,6.95,0.0,4.47,67.06
9,May-17,29.51,0.0,2.78,0.0,0.0,61.18,93.47
10,May-18,7.81,97.73,0.46,9.26,0.0,9.85,125.11
11,May-19,17.44,0.0,93.24,30.28,0.0,5.01,145.97
12,May-20,12.23,0.0,23.42,4.17,0.0,116.11,155.93


And now transforming the values into percentages.

In [84]:
# Divide each column by its total
by_day_category_wide["pct_food"] = (by_day_category_wide["food"] / by_day_category_wide["total_expenses"]).round(3)
by_day_category_wide["pct_hotel"] = (by_day_category_wide["hotel"] / by_day_category_wide["total_expenses"]).round(3)
by_day_category_wide["pct_shopping"] = (by_day_category_wide["shopping"] / by_day_category_wide["total_expenses"]).round(3)
by_day_category_wide["pct_tickets"] = (by_day_category_wide["tickets"] / by_day_category_wide["total_expenses"]).round(3)
by_day_category_wide["pct_tour_agency"] = (by_day_category_wide["tour_agency"] / by_day_category_wide["total_expenses"]).round(3)
by_day_category_wide["pct_transportation"] = (by_day_category_wide["transportation"] / by_day_category_wide["total_expenses"]).round(3)	

# Select only the pct columns and the total
by_day_category_wide_pct = by_day_category_wide[["date",
                                                 "pct_food",
                                                 "pct_hotel",
                                                 "pct_shopping",
                                                 "pct_tickets",
                                                 "pct_tour_agency",
                                                 "pct_transportation",
                                                 "total_expenses"]]

by_day_category_wide_pct

Unnamed: 0,date,pct_food,pct_hotel,pct_shopping,pct_tickets,pct_tour_agency,pct_transportation,total_expenses
3,May-11,0.0,0.0,0.0,0.0,0.0,1.0,1658.25
4,May-12,0.188,0.0,0.494,0.0,0.0,0.318,36.56
5,May-13,0.09,0.133,0.035,0.015,0.491,0.236,152.83
6,May-14,0.403,0.0,0.348,0.184,0.0,0.064,43.38
7,May-15,0.055,0.0,0.16,0.0,0.0,0.785,37.69
8,May-16,0.83,0.0,0.0,0.104,0.0,0.067,67.06
9,May-17,0.316,0.0,0.03,0.0,0.0,0.655,93.47
10,May-18,0.062,0.781,0.004,0.074,0.0,0.079,125.11
11,May-19,0.119,0.0,0.639,0.207,0.0,0.034,145.97
12,May-20,0.078,0.0,0.15,0.027,0.0,0.745,155.93


## 3- Comparing expenses by city

Let's break down how much we spent by day on average in each city:

In [34]:
# Get the total for each city
by_city_expenses = df.groupby("city")["price_usd_per_capita"].sum().reset_index().sort_values(by="price_usd_per_capita", ascending=False)
by_city_expenses

Unnamed: 0,city,price_usd_per_capita
1,China,1652.21
4,Lhasa,1240.2
3,Guangzhou,386.95
0,Beijing,371.23
6,Shanghai,365.53
2,Datong,217.33
8,Suzhou,114.41
5,Mutianyu,65.24
7,Shigatse,39.49


In [40]:
# Get the number of days we spent in each city
by_city_days = df.groupby("city")["date"].nunique().reset_index().sort_values(by="date", ascending=False)
by_city_days

Unnamed: 0,city,date
0,Beijing,13
4,Lhasa,4
2,Datong,3
3,Guangzhou,3
6,Shanghai,3
7,Shigatse,2
8,Suzhou,2
1,China,1
5,Mutianyu,1


In [42]:
# Merge these two tables and create a new column with the simple average
by_city = by_city_expenses.merge(by_city_days, on="city", how='inner')
by_city["daily_average"] = by_city["price_usd_per_capita"]/by_city["date"]
by_city["daily_average"] = by_city["daily_average"].round(0)

# We'll remove the row that says "China" because it only has the airfare from São Paulo to Beijing
by_city = by_city.query("city != 'China'").sort_values(by="daily_average", ascending=False)
by_city

Unnamed: 0,city,price_usd_per_capita,date,daily_average
1,Lhasa,1240.2,4,310.0
2,Guangzhou,386.95,3,129.0
4,Shanghai,365.53,3,122.0
5,Datong,217.33,3,72.0
7,Mutianyu,65.24,1,65.0
6,Suzhou,114.41,2,57.0
3,Beijing,371.23,13,29.0
8,Shigatse,39.49,2,20.0


This is a good table for a **bar chart** comparing expenses.
<br>
We can also get a **heatmap** or a **streamflow** if we get expenses by city and category:

In [12]:
# Getting the data in long format
by_city_category = df.groupby(["city", "category"])["price_usd_per_capita"].sum().reset_index()
by_city_category

Unnamed: 0,city,category,price_usd_per_capita
0,Beijing,Food,98.86
1,Beijing,Shopping,189.99
2,Beijing,Tickets,13.48
3,Beijing,Transportation,68.9
4,China,Transportation,1652.21
5,Datong,Food,32.71
6,Datong,Hotel,20.28
7,Datong,Shopping,21.15
8,Datong,Tickets,7.99
9,Datong,Tour Agency,75.0


In [81]:
# Pivot the dataframe to a wide format
by_city_category_wide = by_city_category.pivot(index="city", columns="category", values="price_usd_per_capita")

# Flatten the dataframe
by_city_category_wide = pd.DataFrame(by_city_category_wide.to_records())

# Replace all NAs with 0 (in this case it's fine because it means there were 0 expenses in that day for that category)
by_city_category_wide = by_city_category_wide.fillna(0)

# Create a new column for the total expense in each city
by_city_category_wide["total_expenses"] = by_city_category_wide["Food"] + by_city_category_wide["Hotel"] + by_city_category_wide["Shopping"] + by_city_category_wide["Tickets"] + by_city_category_wide["Tour Agency"] + by_city_category_wide["Transportation"]

# We'll remove the row that says "China" because it only has the airfare from São Paulo to Beijing
by_city_category_wide = by_city_category_wide.query("city != 'China'").sort_values(by="total_expenses", ascending=False)

# Rename the variables so they are following best practices in tidy data:
by_city_category_wide.rename(columns={ "Food":"food",
                                      "Hotel":"hotel",
                                      "Shopping":"shopping",
                                      "Tickets":"tickets",
                                      "Tour Agency":"tour_agency",
                                      "Transportation":"transportation"},
                            inplace=True)

by_city_category_wide

Unnamed: 0,city,food,hotel,shopping,tickets,tour_agency,transportation,total_expenses
4,Lhasa,23.2,0.0,38.84,44.45,702.92,430.79,1240.2
3,Guangzhou,11.81,63.98,1.22,17.54,0.0,292.4,386.95
0,Beijing,98.86,0.0,189.99,13.48,0.0,68.9,371.23
6,Shanghai,36.55,67.36,116.66,34.45,0.0,110.51,365.53
2,Datong,32.71,20.28,21.15,7.99,75.0,60.2,217.33
8,Suzhou,2.39,30.37,3.24,9.26,0.0,69.15,114.41
5,Mutianyu,7.5,0.0,2.78,27.78,0.0,27.18,65.24
7,Shigatse,12.42,0.0,27.07,0.0,0.0,0.0,39.49


Getting the percentages:

In [115]:
# Divide each column by its total
by_city_category_wide["pct_food"] = (by_city_category_wide["food"] / by_city_category_wide["total_expenses"]).round(3)
by_city_category_wide["pct_hotel"] = (by_city_category_wide["hotel"] / by_city_category_wide["total_expenses"]).round(3)
by_city_category_wide["pct_shopping"] = (by_city_category_wide["shopping"] / by_city_category_wide["total_expenses"]).round(3)
by_city_category_wide["pct_tickets"] = (by_city_category_wide["tickets"] / by_city_category_wide["total_expenses"]).round(3)
by_city_category_wide["pct_tour_agency"] = (by_city_category_wide["tour_agency"] / by_city_category_wide["total_expenses"]).round(3)
by_city_category_wide["pct_transportation"] = (by_city_category_wide["transportation"] / by_city_category_wide["total_expenses"]).round(3)	

# Select only the pct columns and the total
by_city_category_wide_pct = by_city_category_wide[["city",
                                                 "pct_food",
                                                 "pct_hotel",
                                                 "pct_shopping",
                                                 "pct_tickets",
                                                 "pct_tour_agency",
                                                 "pct_transportation",
                                                 "total_expenses"]]

by_city_category_wide_pct

Unnamed: 0,city,pct_food,pct_hotel,pct_shopping,pct_tickets,pct_tour_agency,pct_transportation,total_expenses
4,Lhasa,0.019,0.0,0.031,0.036,0.567,0.347,1240.2
3,Guangzhou,0.031,0.165,0.003,0.045,0.0,0.756,386.95
0,Beijing,0.266,0.0,0.512,0.036,0.0,0.186,371.23
6,Shanghai,0.1,0.184,0.319,0.094,0.0,0.302,365.53
2,Datong,0.151,0.093,0.097,0.037,0.345,0.277,217.33
8,Suzhou,0.021,0.265,0.028,0.081,0.0,0.604,114.41
5,Mutianyu,0.115,0.0,0.043,0.426,0.0,0.417,65.24
7,Shigatse,0.315,0.0,0.685,0.0,0.0,0.0,39.49


## 4- Breakdown by city

### 4.1- Beijing

In [143]:
# All expenses from Beijing
df_beijing = df[df["city"] == "Beijing"]

In [144]:
# Top expenses from Beijing
top5_beijing = df_beijing.sort_values(by="price_usd_per_capita", ascending=False).nlargest(5, "price_usd_per_capita")
top5_beijing

Unnamed: 0,date,city,expense,payment_source,payment_type,category,price,price_usd,price_usd_per_capita
147,May-30,Beijing,Silk scarves,Paula,credit card,Shopping,1145.0,159.03,79.51
161,Jun-03,Beijing,Taobao and Meituan,Renata,apps,Shopping,840.47,116.73,58.37
42,May-16,Beijing,Dinner at Migas (Spanish),Diva,apps,Food,645.0,89.58,44.79
61,May-17,Beijing,Dinner,Tica,apps,Food,303.0,42.08,21.04
2,May-12,Beijing,Necklace and earrings Carol + gift earring fro...,Carol,apps,Shopping,170.0,23.61,11.81


In [145]:
# Expenses in Beijing by category
category_beijing = df_beijing.groupby("category")["price_usd_per_capita"].sum().reset_index().sort_values(by="price_usd_per_capita", ascending=False)
category_beijing

Unnamed: 0,category,price_usd_per_capita
1,Shopping,189.99
0,Food,98.86
3,Transportation,68.9
2,Tickets,13.48


### 4.2- Datong

In [146]:
# All expenses from Datong
df_datong = df[df["city"] == "Datong"]

In [147]:
# Top expenses from Datong
top5_datong = df_datong.sort_values(by="price_usd_per_capita", ascending=False).nlargest(5, "price_usd_per_capita")
top5_datong

Unnamed: 0,date,city,expense,payment_source,payment_type,category,price,price_usd,price_usd_per_capita
15,May-13,Datong,Datong tourism package,Paula,credit card,Tour Agency,1080.0,150.0,75.0
12,May-13,Datong,Train from Beijing to Datong,Renata,apps,Transportation,378.0,52.5,26.25
38,May-15,Datong,Train from Datong to Beijing,Renata,apps,Transportation,366.0,50.83,25.42
22,May-13,Datong,Da Tong Weidu International Hotel,Diva,apps,Hotel,292.0,40.56,20.28
33,May-14,Datong,Lunch at Yunjinhui,Diva,apps,Food,232.0,32.22,16.11


In [148]:
# Expenses in Datong by category
category_datong = df_datong.groupby("category")["price_usd_per_capita"].sum().reset_index().sort_values(by="price_usd_per_capita", ascending=False)
category_datong

Unnamed: 0,category,price_usd_per_capita
4,Tour Agency,75.0
5,Transportation,60.2
0,Food,32.71
2,Shopping,21.15
1,Hotel,20.28
3,Tickets,7.99


### 4.3- Suzhou

In [149]:
# All expenses from Suzhou
df_suzhou = df[df["city"] == "Suzhou"]

In [150]:
# Top expenses from Suzhou
top5_suzhou = df_suzhou.sort_values(by="price_usd_per_capita", ascending=False).nlargest(5, "price_usd_per_capita")
top5_suzhou

Unnamed: 0,date,city,expense,payment_source,payment_type,category,price,price_usd,price_usd_per_capita
55,May-17,Suzhou,Train from Beijing to Suzhou,Renata,apps,Transportation,1224.0,170.0,56.67
59,May-18,Suzhou,HanTin Premium Hotel,Renata,apps,Hotel,656.0,91.11,30.37
63,May-18,Suzhou,Entrance Humble Administrator's Garden,Carol,apps,Tickets,200.0,27.78,9.26
62,May-18,Suzhou,Train from Suzhou to Shanghai,Renata,apps,Transportation,168.0,23.33,7.78
54,May-17,Suzhou,Didi to Beijingnan station,Diva,apps,Transportation,47.0,6.53,3.26


In [151]:
# Expenses in Suzhou by category
category_suzhou = df_suzhou.groupby("category")["price_usd_per_capita"].sum().reset_index().sort_values(by="price_usd_per_capita", ascending=False)
category_suzhou

Unnamed: 0,category,price_usd_per_capita
4,Transportation,69.15
1,Hotel,30.37
3,Tickets,9.26
2,Shopping,3.24
0,Food,2.39


### 4.4- Shanghai

In [152]:
# All expenses from Shanghai
df_shanghai = df[df["city"] == "Shanghai"]

In [153]:
# Top expenses from Shanghai
top5_shanghai = df_shanghai.sort_values(by="price_usd_per_capita", ascending=False).nlargest(5, "price_usd_per_capita")
top5_shanghai

Unnamed: 0,date,city,expense,payment_source,payment_type,category,price,price_usd,price_usd_per_capita
87,May-20,Shanghai,Flight from Shanghai to Beijing,Renata,apps,Transportation,1460.0,202.78,101.39
84,May-19,Shanghai,Uniqlo haul,Paula,credit card,Shopping,1162.0,161.39,80.69
70,May-18,Shanghai,Homeinn Hotel,Renata,apps,Hotel,970.0,134.72,67.36
74,May-19,Shanghai,Oriental Pearl Tower ticket,Diva,apps,Tickets,256.0,35.56,17.78
78,May-19,Shanghai,MAP Museum of Art of Pudong,Diva,apps,Tickets,180.0,25.0,12.5


In [154]:
# Expenses in Shanghai by category
category_shanghai = df_shanghai.groupby("category")["price_usd_per_capita"].sum().reset_index().sort_values(by="price_usd_per_capita", ascending=False)
category_shanghai

Unnamed: 0,category,price_usd_per_capita
2,Shopping,116.66
4,Transportation,110.51
1,Hotel,67.36
0,Food,36.55
3,Tickets,34.45


### 4.5- Lhasa

In [162]:
# All expenses from Lhasa
df_lhasa = df[df["city"] == "Lhasa"]
df_lhasa.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17 entries, 114 to 137
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   date                  17 non-null     object 
 1   city                  17 non-null     object 
 2   expense               17 non-null     object 
 3   payment_source        17 non-null     object 
 4   payment_type          17 non-null     object 
 5   category              17 non-null     object 
 6   price                 17 non-null     float64
 7   price_usd             17 non-null     float64
 8   price_usd_per_capita  17 non-null     float64
dtypes: float64(3), object(6)
memory usage: 1.3+ KB


In [163]:
# Top expenses from Lhasa
top5_lhasa = df_lhasa.sort_values(by="price_usd_per_capita", ascending=False).nlargest(5, "price_usd_per_capita")
top5_lhasa

Unnamed: 0,date,city,expense,payment_source,payment_type,category,price,price_usd,price_usd_per_capita
114,May-24,Lhasa,Tibet tour package,Paula,credit card,Tour Agency,15183.0,2108.75,702.92
112,May-24,Lhasa,Round-trip flight from Beijing to Lhasa,Paula,credit card,Transportation,9169.11,1273.49,424.5
115,May-24,Lhasa,Princess Wejcheng Show,Tica,apps,Tickets,840.0,116.67,38.89
113,May-24,Lhasa,Surprise hot pot lunch,Tica,apps,Food,438.0,60.83,20.28
140,May-29,Lhasa,Shopping at Norbulingka,Diva,apps,Shopping,308.0,42.78,14.26


In [164]:
# Expenses in Lhasa by category
category_lhasa = df_lhasa.groupby("category")["price_usd_per_capita"].sum().reset_index().sort_values(by="price_usd_per_capita", ascending=False)
category_lhasa

Unnamed: 0,category,price_usd_per_capita
3,Tour Agency,702.92
4,Transportation,430.79
2,Tickets,44.45
1,Shopping,38.84
0,Food,23.2


### 4.6- Shigatse

In [165]:
# All expenses from Shigatse
df_shigatse = df[df["city"] == "Shigatse"]
df_shigatse.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11 entries, 127 to 129
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   date                  11 non-null     object 
 1   city                  11 non-null     object 
 2   expense               11 non-null     object 
 3   payment_source        11 non-null     object 
 4   payment_type          11 non-null     object 
 5   category              11 non-null     object 
 6   price                 11 non-null     float64
 7   price_usd             11 non-null     float64
 8   price_usd_per_capita  11 non-null     float64
dtypes: float64(3), object(6)
memory usage: 880.0+ bytes


In [166]:
# Top expenses from Shigatse
top5_shigatse = df_shigatse.sort_values(by="price_usd_per_capita", ascending=False).nlargest(5, "price_usd_per_capita")
top5_shigatse

Unnamed: 0,date,city,expense,payment_source,payment_type,category,price,price_usd,price_usd_per_capita
127,May-28,Shigatse,Shopping at the Old Market,Carol,apps,Shopping,125.0,17.36,8.68
123,May-27,Shigatse,Tip for Samsara and Kunga,Carol,apps,Shopping,100.0,13.89,6.94
131,May-28,Shigatse,Dinner at Pizza Hut,Diva,apps,Food,57.0,7.92,3.96
124,May-27,Shigatse,Dinner,Carol,apps,Food,52.0,7.22,3.61
130,May-28,Shigatse,Shopping,Carol,apps,Shopping,50.0,6.94,3.47


In [133]:
# Expenses in Shigatse by category
category_shigatse = df_shigatse.groupby("category")["price_usd_per_capita"].sum().reset_index().sort_values(by="price_usd_per_capita", ascending=False)
category_shigatse

Unnamed: 0,category,price_usd_per_capita
1,Shopping,27.07
0,Food,12.42


### 4.7- Guangzhou

In [167]:
# All expenses from Guangzhou
df_guangzhou = df[df["city"] == "Guangzhou"]
df_guangzhou.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, 149 to 157
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   date                  12 non-null     object 
 1   city                  12 non-null     object 
 2   expense               12 non-null     object 
 3   payment_source        12 non-null     object 
 4   payment_type          12 non-null     object 
 5   category              12 non-null     object 
 6   price                 12 non-null     float64
 7   price_usd             12 non-null     float64
 8   price_usd_per_capita  12 non-null     float64
dtypes: float64(3), object(6)
memory usage: 960.0+ bytes


In [168]:
# Top expenses from Guangzhou
top5_guangzhou = df_guangzhou.sort_values(by="price_usd_per_capita", ascending=False).nlargest(5, "price_usd_per_capita")
top5_guangzhou

Unnamed: 0,date,city,expense,payment_source,payment_type,category,price,price_usd,price_usd_per_capita
149,May-31,Guangzhou,Round-trip flight from Beijing to Guangzhou,Renata,apps,Transportation,6150.0,854.17,284.72
148,May-31,Guangzhou,SunYat Sen University Kaifeng Hotel,Renata,apps,Hotel,1382.0,191.94,63.98
150,Jun-01,Guangzhou,Pearl River Cruise,Renata,apps,Tickets,369.0,51.25,17.08
151,Jun-01,Guangzhou,Car from airport to hotel,Renata,apps,Transportation,130.0,18.06,6.02
153,Jun-01,Guangzhou,Dinner,Diva,apps,Food,120.0,16.67,5.56


In [169]:
# Expenses in Guangzhou by category
category_guangzhou = df_guangzhou.groupby("category")["price_usd_per_capita"].sum().reset_index().sort_values(by="price_usd_per_capita", ascending=False)
category_guangzhou

Unnamed: 0,category,price_usd_per_capita
4,Transportation,292.4
1,Hotel,63.98
3,Tickets,17.54
0,Food,11.81
2,Shopping,1.22


## 5- Comparing expenses by type of payment

#### How much did we spend using each payment type?

In [171]:
total_by_type = df.groupby("payment_type")["price"].sum().reset_index()
total_by_type

Unnamed: 0,payment_type,price
0,apps,24443.66
1,credit card,52149.01


#### How many payments did we make with each type?

In [172]:
count_payment_type = df.groupby("payment_type")["payment_type"].value_counts().reset_index()
count_payment_type

Unnamed: 0,payment_type,count
0,apps,154
1,credit card,10


#### What was the average price per capita of expenses from each type of payment?

In [174]:
avg_payment_type = df.groupby("payment_type")["price_usd_per_capita"].mean().reset_index()
avg_payment_type

Unnamed: 0,payment_type,price_usd_per_capita
0,apps,9.057468
1,credit card,305.774


## 5- Curious insights

#### What did we buy the most?
Making a **word cloud** with this data.

In [183]:
words = pd.Series(df["expense"])
words
word_count = words.str.lower().str.findall(r'\b\w+\b').explode().value_counts().reset_index()
word_count.head(20)

Unnamed: 0,expense,count
0,to,39
1,didi,26
2,the,23
3,at,18
4,from,14
5,hotel,13
6,subway,12
7,shopping,12
8,lunch,12
9,and,10


#### Who was the biggest shopper?
Two interesting purchases stand out.

In [196]:
shopping = df[df["category"].isin(["Shopping"])].reset_index().sort_values(by="price_usd_per_capita", ascending=False)
shopping

Unnamed: 0,index,date,city,expense,payment_source,payment_type,category,price,price_usd,price_usd_per_capita
0,84,May-19,Shanghai,Uniqlo haul,Paula,credit card,Shopping,1162.0,161.39,80.69
1,147,May-30,Beijing,Silk scarves,Paula,credit card,Shopping,1145.0,159.03,79.51
2,161,Jun-03,Beijing,Taobao and Meituan,Renata,apps,Shopping,840.47,116.73,58.37
3,140,May-29,Lhasa,Shopping at Norbulingka,Diva,apps,Shopping,308.0,42.78,14.26
4,2,May-12,Beijing,Necklace and earrings Carol + gift earring fro...,Carol,apps,Shopping,170.0,23.61,11.81
5,108,May-22,Beijing,Little shopping at museum,Diva,apps,Shopping,165.0,22.92,11.46
6,162,Jun-03,Beijing,Foot massage,Carol,apps,Shopping,156.0,21.67,10.83
7,99,May-20,Shanghai,Chocolates and Inácio’s toy car,Diva,apps,Shopping,128.0,17.78,8.89
8,127,May-28,Shigatse,Shopping at the Old Market,Carol,apps,Shopping,125.0,17.36,8.68
9,97,May-20,Shanghai,Little shopping,Diva,apps,Shopping,120.0,16.67,8.33


#### The priceless things
Things that cost zero yuans, but created invaluable memories.