# Data cleaning
We have our raw dataset, now we need to make it ready for analysis

### Import libraries

In [245]:
import pandas as pd
import numpy as np
import json
import warnings

### Load raw dataset

In [246]:
raw = pd.read_csv("../00_raw/china_raw.csv", dtype={"cidade": str})
raw

Unnamed: 0,day,city,expense,price,payment_source,category
0,12-May,Pequim,Didi pro templo do céu (tava fechado),38.42,Carol,Transporte
1,12-May,Pequim,Didi pro shopping das Pérolas (Hungqiao Market),13.30,Carol,Transporte
2,12-May,Pequim,Colar e brincos Carol + brinco presente da Lara,170.00,Carol,Compras/Presentes
3,12-May,Pequim,Didi pra Qianmen,14.80,Carol,Transporte
4,12-May,Pequim,Almoço Qianmen,64.00,Carol,Alimentação
...,...,...,...,...,...,...
124,29-May,Lhasa,Didi para Bokar,,Diva,Transporte
125,29-May,Lhasa,Almoço,33.00,Diva,Alimentação
126,29-May,Lhasa,Compras Bokar Supermarket,105.00,Carol,Compras/Presentes
127,29-May,Lhasa,Massagem no aeroporto,30.00,Diva,Compras/Presentes


### Add missing rows
When checking the AI generated dataset we noticed two issues:
<br>
1- Some values are null
<br>
2- Others are simply absent (from the dataset and the original source)
<br>
There's also a third issue:
<br>
3- The price is sometimes for two people (Carol and Diva) and sometimes for three people (Carol, Diva and Renata). It's reasonable to say we want to find how much more one spends travelling to and in China.
<br><br>
The first issue was solved with **.apply()**:

In [247]:
# Issue 1: some expenses have an NA on the "price" column
no_price = raw.loc[pd.isna(raw["price"])]
no_price

Unnamed: 0,day,city,expense,price,payment_source,category
51,17-May,Pequim,Metrô,,Diva,Transporte
96,22-May,Pequim,Didi para Tiannanmen,,Diva,Transporte
97,22-May,Pequim,Almoço no museu,,Diva,Alimentação
98,22-May,Pequim,Comprinhas museu,,Diva,Compras/Presentes
102,24-May,Lhasa,Didi pro aeroporto,,Carol,Transporte
111,27-May,Shigatse,Jantar,,Carol,Alimentação
124,29-May,Lhasa,Didi para Bokar,,Diva,Transporte


In [248]:
# Solution: apply and lambda
raw.loc[raw["price"].isna(), "price"] = raw.loc[raw["price"].isna()].apply(
    lambda row: 42 if "Didi para Tiannanmen" in row["expense"] else pd.NA, axis=1)
raw.loc[raw["price"].isna(), "price"] = raw.loc[raw["price"].isna()].apply(
    lambda row: 8 if "Metrô" in row["expense"] else pd.NA, axis=1)
raw.loc[raw["price"].isna(), "price"] = raw.loc[raw["price"].isna()].apply(
    lambda row: 165 if "Comprinhas museu" in row["expense"] else pd.NA, axis=1)
raw.loc[raw["price"].isna(), "price"] = raw.loc[raw["price"].isna()].apply(
    lambda row: 26 if "Almoço no museu" in row["expense"] else pd.NA, axis=1)
raw.loc[raw["price"].isna(), "price"] = raw.loc[raw["price"].isna()].apply(
    lambda row: 107 if "Didi pro aeroporto" in row["expense"] else pd.NA, axis=1)
raw.loc[raw["price"].isna(), "price"] = raw.loc[raw["price"].isna()].apply(
    lambda row: 52 if "Jantar" in row["expense"] else pd.NA, axis=1)
raw.loc[raw["price"].isna(), "price"] = raw.loc[raw["price"].isna()].apply(
    lambda row: 14 if "Didi para Bokar" in row["expense"] else pd.NA, axis=1)
raw

  raw.loc[raw["price"].isna(), "price"] = raw.loc[raw["price"].isna()].apply(


Unnamed: 0,day,city,expense,price,payment_source,category
0,12-May,Pequim,Didi pro templo do céu (tava fechado),38.42,Carol,Transporte
1,12-May,Pequim,Didi pro shopping das Pérolas (Hungqiao Market),13.3,Carol,Transporte
2,12-May,Pequim,Colar e brincos Carol + brinco presente da Lara,170.0,Carol,Compras/Presentes
3,12-May,Pequim,Didi pra Qianmen,14.8,Carol,Transporte
4,12-May,Pequim,Almoço Qianmen,64.0,Carol,Alimentação
...,...,...,...,...,...,...
124,29-May,Lhasa,Didi para Bokar,14,Diva,Transporte
125,29-May,Lhasa,Almoço,33.0,Diva,Alimentação
126,29-May,Lhasa,Compras Bokar Supermarket,105.0,Carol,Compras/Presentes
127,29-May,Lhasa,Massagem no aeroporto,30.0,Diva,Compras/Presentes


I solved the second issue (values that are relevant expenses but weren't on the original notes) by creating those rows myself and including other major expenses that weren't kept on the original file, such as airfares, train tickets, hotels and tour agency packages. These were paid for in advance, but I kept the day they were used to make the timeline better.
<br><br>
I saved that file as a json called **missing_data.json**, so now i can add the rows to the raw dataset, and get a more completed dataframe:

In [249]:
# Open the json with the data that was still missing:
missing_data = pd.read_json("../00_raw/missing_data.json", dtype={"cidade": str})
missing_data.head()

Unnamed: 0,day,city,expense,price,payment_source,category
0,13-May,Datong,Da Tong Weidu International Hotel,292.0,Diva,Hotel
1,17-May,Suzhou,HanTin Premium Hotel,656.0,Renata,Hotel
2,18-May,Xangai,Homeinn Hotel,970.0,Renata,Hotel
3,31-May,Guangzhou,SunYat Sen University Kaifeng Hotel,1382.0,Renata,Hotel
4,13-May,Datong,trem de Pequim para Datong,378.0,Renata,Transporte


In [250]:
# Now let's concatenate both datasets
df_concat = pd.concat([raw, missing_data])
df_concat

Unnamed: 0,day,city,expense,price,payment_source,category
0,12-May,Pequim,Didi pro templo do céu (tava fechado),38.42,Carol,Transporte
1,12-May,Pequim,Didi pro shopping das Pérolas (Hungqiao Market),13.3,Carol,Transporte
2,12-May,Pequim,Colar e brincos Carol + brinco presente da Lara,170.0,Carol,Compras/Presentes
3,12-May,Pequim,Didi pra Qianmen,14.8,Carol,Transporte
4,12-May,Pequim,Almoço Qianmen,64.0,Carol,Alimentação
...,...,...,...,...,...,...
31,02-Jun,Guangzhou,Didi,18.79,Carol,Transporte
32,02-Jun,Pequim,Didi do aeroporto,110.0,Renata,Transporte
33,03-Jun,Pequim,Massagem nos pés,156.0,Carol,Compras/presentes
34,03-Jun,Pequim,Didi para o aeroporto,90.0,Renata,Transporte


In [251]:
# And reorder the rows by day
df = df_concat.sort_values("day")
df

Unnamed: 0,day,city,expense,price,payment_source,category
27,01-Jun,Guangzhou,Jantar,120.0,Diva,Alimentação
26,01-Jun,Guangzhou,Cruzeiro Rio das Pérolas,369.0,Renata,Ingressos
23,01-Jun,Guangzhou,Almoço,90.0,Diva,Alimentação
22,01-Jun,Guangzhou,Carro do aeroporto para o hotel,130.0,Renata,Transporte
32,02-Jun,Pequim,Didi do aeroporto,110.0,Renata,Transporte
...,...,...,...,...,...,...
17,30-May,Pequim,Almoço no bairro da Renata,66.0,Diva,Alimentação
20,30-May,Pequim,Supermercado,19.39,Carol,Alimentação
16,30-May,Pequim,Supermercado,21.3,Carol,Compras/presentes
10,31-May,Guangzhou,Avião ida e volta de Pequim para Guangzhou,6150.0,Renata,Transporte


As for the third issue, I'm going to solve it using this logic:
- On the cities only my mother and I visited (Beijing, Datong, Shanghai and Shigatse), I'll divide the price of expenses by 2.
- On the other places (Suzhou, Lhasa and Guangzhou), I'll divide it by 3.
<br>
But first, let's convert the values to USD for a broader sense of the costs.

In [252]:
# Create a column converting the prices from yuan (or RMB, rembibi) to USD.
# We'll consider the currency conversion as 7.20 RMB for each dollar, more or less the value during May 2025.
df["price_usd"] = df["price"]/7.20
df

Unnamed: 0,day,city,expense,price,payment_source,category,price_usd
27,01-Jun,Guangzhou,Jantar,120.0,Diva,Alimentação,16.666667
26,01-Jun,Guangzhou,Cruzeiro Rio das Pérolas,369.0,Renata,Ingressos,51.25
23,01-Jun,Guangzhou,Almoço,90.0,Diva,Alimentação,12.5
22,01-Jun,Guangzhou,Carro do aeroporto para o hotel,130.0,Renata,Transporte,18.055556
32,02-Jun,Pequim,Didi do aeroporto,110.0,Renata,Transporte,15.277778
...,...,...,...,...,...,...,...
17,30-May,Pequim,Almoço no bairro da Renata,66.0,Diva,Alimentação,9.166667
20,30-May,Pequim,Supermercado,19.39,Carol,Alimentação,2.693056
16,30-May,Pequim,Supermercado,21.3,Carol,Compras/presentes,2.958333
10,31-May,Guangzhou,Avião ida e volta de Pequim para Guangzhou,6150.0,Renata,Transporte,854.166667


In [253]:
# Create a dataset only with the expenses in Beijing, Datong, Shanghai and Shigatse and divide the price by TWO:
trip_for_two = df[df["city"].isin(["Pequim", "Datong", "Xangai", "Shigatse"])]
trip_for_two["price_usd"].astype(float)
trip_for_two["price_usd_per_capita"] = trip_for_two["price_usd"]/2
trip_for_two

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  trip_for_two["price_usd_per_capita"] = trip_for_two["price_usd"]/2


Unnamed: 0,day,city,expense,price,payment_source,category,price_usd,price_usd_per_capita
32,02-Jun,Pequim,Didi do aeroporto,110.0,Renata,Transporte,15.277778,7.638889
35,03-Jun,Pequim,Taobao e Meituan,840.47,Renata,Compras/presentes,116.731944,58.365972
33,03-Jun,Pequim,Massagem nos pés,156.0,Carol,Compras/presentes,21.666667,10.833333
34,03-Jun,Pequim,Didi para o aeroporto,90.0,Renata,Transporte,12.5,6.25
14,11-May,Pequim,Táxi do aeroporto para a casa da Renata,87.0,Carol,Transporte,12.083333,6.041667
...,...,...,...,...,...,...,...,...
19,30-May,Pequim,Brincos e colar de pérola,70.0,Diva,Compras/presentes,9.722222,4.861111
18,30-May,Pequim,Lenços de seda,1145.0,Paula,Compras/presentes,159.027778,79.513889
17,30-May,Pequim,Almoço no bairro da Renata,66.0,Diva,Alimentação,9.166667,4.583333
20,30-May,Pequim,Supermercado,19.39,Carol,Alimentação,2.693056,1.346528


In [254]:
# Create a dataset only with the expenses in Suzhou, Lhasa and Guangzhou and divide the price by THREE:
trip_for_three = df[df["city"].isin(["Suzhou", "Lhasa", "Guangzhou"])]
trip_for_three["price_usd"].astype(float)
trip_for_three["price_usd_per_capita"] = trip_for_three["price_usd"]/3
trip_for_three

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  trip_for_three["price_usd_per_capita"] = trip_for_three["price_usd"]/3


Unnamed: 0,day,city,expense,price,payment_source,category,price_usd,price_usd_per_capita
27,01-Jun,Guangzhou,Jantar,120.0,Diva,Alimentação,16.666667,5.555556
26,01-Jun,Guangzhou,Cruzeiro Rio das Pérolas,369.0,Renata,Ingressos,51.25,17.083333
23,01-Jun,Guangzhou,Almoço,90.0,Diva,Alimentação,12.5,4.166667
22,01-Jun,Guangzhou,Carro do aeroporto para o hotel,130.0,Renata,Transporte,18.055556,6.018519
31,02-Jun,Guangzhou,Didi,18.79,Carol,Transporte,2.609722,0.869907
30,02-Jun,Guangzhou,Comprinhas Miniso,15.0,Carol,Compras/presentes,2.083333,0.694444
29,02-Jun,Guangzhou,Didi,17.1,Carol,Transporte,2.375,0.791667
28,02-Jun,Guangzhou,Museu,10.0,Carol,Ingressos,1.388889,0.462963
25,02-Jun,Guangzhou,Café latte no M Stand,45.0,Diva,Alimentação,6.25,2.083333
24,02-Jun,Guangzhou,Livraria,11.5,Carol,Compras/presentes,1.597222,0.532407


In [255]:
# Concatenate the rows from the two datasets
df = pd.concat([trip_for_two, trip_for_three]).sort_values("day")
df

Unnamed: 0,day,city,expense,price,payment_source,category,price_usd,price_usd_per_capita
27,01-Jun,Guangzhou,Jantar,120.0,Diva,Alimentação,16.666667,5.555556
23,01-Jun,Guangzhou,Almoço,90.0,Diva,Alimentação,12.5,4.166667
22,01-Jun,Guangzhou,Carro do aeroporto para o hotel,130.0,Renata,Transporte,18.055556,6.018519
26,01-Jun,Guangzhou,Cruzeiro Rio das Pérolas,369.0,Renata,Ingressos,51.25,17.083333
31,02-Jun,Guangzhou,Didi,18.79,Carol,Transporte,2.609722,0.869907
...,...,...,...,...,...,...,...,...
18,30-May,Pequim,Lenços de seda,1145.0,Paula,Compras/presentes,159.027778,79.513889
19,30-May,Pequim,Brincos e colar de pérola,70.0,Diva,Compras/presentes,9.722222,4.861111
21,30-May,Pequim,Metrô,14.0,Paula,Transporte,1.944444,0.972222
10,31-May,Guangzhou,Avião ida e volta de Pequim para Guangzhou,6150.0,Renata,Transporte,854.166667,284.722222


Now, for the final part in getting the data tidy, I'm going to specified the payment type.
<br><br>
This is important because China, unlike the US and Brazil, has the so-called superapps, such as Alipay, used to call a Didi, take the metro, order at restaurants and pay for things in stores, or WeChat, more common in more remote locations when a business or vendor wouldn't accept Alipay.
<br><br>
I can do this because I know that all expenses that have "Paula" as payment_source were paid for using our credit cards connected to Paula's bank account in the US.
<br><br>
And all the other names mean that the payment was made using Alipay, WeChat or other app purchases, like Taobao and Meituan. The money comes from Renata's bank account in China and stays in a digital wallet inside the apps.
<br><br>
I want to see **how much we've spent in total from each payment type**, and the *average spent in each purchase* from these two groups.

In [256]:
# Create new column "payment_type" and filling it according to the values in column "payment_source"
df["payment_type"] = df["payment_source"].apply(lambda x: "credit card" if x == "Paula" else "apps")
df

Unnamed: 0,day,city,expense,price,payment_source,category,price_usd,price_usd_per_capita,payment_type
27,01-Jun,Guangzhou,Jantar,120.0,Diva,Alimentação,16.666667,5.555556,apps
23,01-Jun,Guangzhou,Almoço,90.0,Diva,Alimentação,12.5,4.166667,apps
22,01-Jun,Guangzhou,Carro do aeroporto para o hotel,130.0,Renata,Transporte,18.055556,6.018519,apps
26,01-Jun,Guangzhou,Cruzeiro Rio das Pérolas,369.0,Renata,Ingressos,51.25,17.083333,apps
31,02-Jun,Guangzhou,Didi,18.79,Carol,Transporte,2.609722,0.869907,apps
...,...,...,...,...,...,...,...,...,...
18,30-May,Pequim,Lenços de seda,1145.0,Paula,Compras/presentes,159.027778,79.513889,credit card
19,30-May,Pequim,Brincos e colar de pérola,70.0,Diva,Compras/presentes,9.722222,4.861111,apps
21,30-May,Pequim,Metrô,14.0,Paula,Transporte,1.944444,0.972222,credit card
10,31-May,Guangzhou,Avião ida e volta de Pequim para Guangzhou,6150.0,Renata,Transporte,854.166667,284.722222,apps


Just some final steps to get the dataframe as tidy as possible:

In [257]:
# Looks like our date format isn't helpful once there are two months involved.
# Let's fix it:
df["date"] = pd.Series(df["day"])
df["date"] = pd.to_datetime(df["date"], format='%d-%b')
df = df.sort_values("date", ascending=True)
df["date"] = df["date"].dt.strftime('%b-%d')
df.drop("day", axis=1, inplace=True)

In [258]:
# Turn the "price" and "price_per_capita" columns into float
df["price"] = df["price"].astype(float).round(2)
df["price_usd"] = df["price_usd"].astype(float).round(2)
df["price_usd_per_capita"] = df["price_usd_per_capita"].astype(float).round(2)

# Checking if it worked
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 164 entries, 14 to 34
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   city                  164 non-null    object 
 1   expense               164 non-null    object 
 2   price                 164 non-null    float64
 3   payment_source        164 non-null    object 
 4   category              164 non-null    object 
 5   price_usd             164 non-null    float64
 6   price_usd_per_capita  164 non-null    float64
 7   payment_type          164 non-null    object 
 8   date                  164 non-null    object 
dtypes: float64(3), object(6)
memory usage: 12.8+ KB


In [259]:
df.head()

Unnamed: 0,city,expense,price,payment_source,category,price_usd,price_usd_per_capita,payment_type,date
14,Pequim,Táxi do aeroporto para a casa da Renata,87.0,Carol,Transporte,12.08,6.04,apps,May-11
10,Pequim,Didi pro restaurante de dumpling fritos,49.93,Carol,Transporte,6.93,3.47,apps,May-12
2,Pequim,Colar e brincos Carol + brinco presente da Lara,170.0,Carol,Compras/Presentes,23.61,11.81,apps,May-12
8,Pequim,3 Baralhos,90.0,Carol,Compras/Presentes,12.5,6.25,apps,May-12
6,Pequim,Metrô,6.0,Carol,Transporte,0.83,0.42,apps,May-12


Okay, now my dataframe is almost ready for the analyses I wanna do.
<br>
But it seems clear that it will be much better 100% in English.
<br>
So I'm going to save it as a json, to see if DeepSeek would translate it for me.

In [260]:
df.to_json("../04_tidy_data/china_df.json", orient="records", force_ascii=False, indent=4)

Should I have translated everything from the get go? **Maybe**.
<br>
Can we go back in time? **No**.
<br>
Am I learning something new everyday and becoming a Markdown genius? **Abso-*freaking*-lutely**!
<br>
Now let's get the translated json and check if the transation worked.

In [261]:
# Opening the new json into a dataframe
translated = pd.read_json("../04_tidy_data/china_df_t.json", dtype={"cidade": str})
translated

Unnamed: 0,date,city,expense,payment_source,payment_type,category,price,price_usd,price_usd_per_capita
0,May-11,Beijing,Taxi from the airport to Renata's house,Carol,apps,Transport,87.00,12.08,6.04
1,May-12,Beijing,Didi to the fried dumpling restaurant,Carol,apps,Transport,49.93,6.93,3.47
2,May-12,Beijing,Necklace and earrings Carol + gift earring fro...,Carol,apps,Shopping/Gifts,170.00,23.61,11.81
3,May-12,Beijing,3 Decks of cards,Carol,apps,Shopping/Gifts,90.00,12.50,6.25
4,May-12,Beijing,Subway,Carol,apps,Transport,6.00,0.83,0.42
...,...,...,...,...,...,...,...,...,...
159,Jun-02,Guangzhou,Shopping at Miniso,Carol,apps,Shopping/Gifts,15.00,2.08,0.69
160,Jun-02,Guangzhou,Didi,Carol,apps,Transport,18.79,2.61,0.87
161,Jun-03,Beijing,Taobao and Meituan,Renata,apps,Shopping/Gifts,840.47,116.73,58.37
162,Jun-03,Beijing,Foot massage,Carol,apps,Shopping/Gifts,156.00,21.67,10.83


In [262]:
# Comparing the shape of both dataframes
df.shape

(164, 9)

In [263]:
translated.shape

(164, 9)

We have the same number of rows and columns, great.
<br>
Now let's see if the content has been translated by looking at columns with few possible values:

In [264]:
# Cities 
translated.city.unique()

array(['Beijing', 'Datong', 'Suzhou', 'Shanghai', 'Mutianyu', 'Lhasa',
       'China', 'Shigatse', 'Guangzhou'], dtype=object)

In [265]:
# Payment type
translated.payment_type.unique()

array(['apps', 'credit card'], dtype=object)

In [266]:
# Categories 
translated.category.unique()

array(['Transport', 'Shopping/Gifts', 'Food', 'Agency', 'Hotel',
       'Tickets'], dtype=object)

In [267]:
# I didn't like how it translated some category names, so I'll replace them
translated["category"] = translated["category"].str.replace("Transport", "Transportation")
translated["category"] = translated["category"].str.replace("Shopping/Gifts", "Shopping")
translated["category"] = translated["category"].str.replace("Agency", "Tour Agency")
translated.rename(columns={ "date" : "month_day"}, inplace=True)
translated

Unnamed: 0,month_day,city,expense,payment_source,payment_type,category,price,price_usd,price_usd_per_capita
0,May-11,Beijing,Taxi from the airport to Renata's house,Carol,apps,Transportation,87.00,12.08,6.04
1,May-12,Beijing,Didi to the fried dumpling restaurant,Carol,apps,Transportation,49.93,6.93,3.47
2,May-12,Beijing,Necklace and earrings Carol + gift earring fro...,Carol,apps,Shopping,170.00,23.61,11.81
3,May-12,Beijing,3 Decks of cards,Carol,apps,Shopping,90.00,12.50,6.25
4,May-12,Beijing,Subway,Carol,apps,Transportation,6.00,0.83,0.42
...,...,...,...,...,...,...,...,...,...
159,Jun-02,Guangzhou,Shopping at Miniso,Carol,apps,Shopping,15.00,2.08,0.69
160,Jun-02,Guangzhou,Didi,Carol,apps,Transportation,18.79,2.61,0.87
161,Jun-03,Beijing,Taobao and Meituan,Renata,apps,Shopping,840.47,116.73,58.37
162,Jun-03,Beijing,Foot massage,Carol,apps,Shopping,156.00,21.67,10.83


In [268]:
# Get a full date to use on charts afterwards

# Create a new object
df_cleaned = translated

# Add a column where every row is the year 2025 (as a string)
df_cleaned["year"] = "2025"

# Fetch from the "date" column only the first three carachters (the name of the month)
# Then replace the name by the number of the month
df_cleaned["month1"] = df_cleaned["month_day"].str.slice(start=0, stop=3)
df_cleaned["month2"] = df_cleaned["month1"].apply(lambda x: "05" if x == "May" else "06")

# Fetch from the "date" column only the last two carachters (the day of the month)
df_cleaned["day"] = df_cleaned["month_day"].str.slice(start=-2)

# Create a new column called "date" and concatenate the values of year, month and day, with a hifen between them
df_cleaned["date"] = df_cleaned["year"] + "-" + df_cleaned["month2"] + "-" + df_cleaned["day"]

df_cleaned.head(3)

Unnamed: 0,month_day,city,expense,payment_source,payment_type,category,price,price_usd,price_usd_per_capita,year,month1,month2,day,date
0,May-11,Beijing,Taxi from the airport to Renata's house,Carol,apps,Transportation,87.0,12.08,6.04,2025,May,5,11,2025-05-11
1,May-12,Beijing,Didi to the fried dumpling restaurant,Carol,apps,Transportation,49.93,6.93,3.47,2025,May,5,12,2025-05-12
2,May-12,Beijing,Necklace and earrings Carol + gift earring fro...,Carol,apps,Shopping,170.0,23.61,11.81,2025,May,5,12,2025-05-12


In [269]:
# Remove columns we don't need anymore

df_cleaned = df_cleaned[["date",
                         "city",
                         "expense",
                         "payment_source",
                         "payment_type",
                         "category",
                         "price",
                         "price_usd",
                         "price_usd_per_capita"
                        ]]

df_cleaned.head(3)

Unnamed: 0,date,city,expense,payment_source,payment_type,category,price,price_usd,price_usd_per_capita
0,2025-05-11,Beijing,Taxi from the airport to Renata's house,Carol,apps,Transportation,87.0,12.08,6.04
1,2025-05-12,Beijing,Didi to the fried dumpling restaurant,Carol,apps,Transportation,49.93,6.93,3.47
2,2025-05-12,Beijing,Necklace and earrings Carol + gift earring fro...,Carol,apps,Shopping,170.0,23.61,11.81


Okay, I'm satisfied. I'll save this in a .csv file and in a .json file and move on to the analyses.

In [270]:
df_cleaned.to_csv("../04_tidy_data/china_cleaned.csv", index=False)

In [271]:
df_cleaned.to_json("../04_tidy_data/china_cleaned.json", orient="records", force_ascii=False, indent=4)