# Data cleaning
We have our raw dataset, now we need to make it ready for analysis

### Import libraries

In [38]:
import pandas as pd
import json
import warnings

### Load raw dataset

In [16]:
raw = pd.read_csv("../00_raw/china_raw.csv", dtype={"cidade": str})
raw

Unnamed: 0,day,city,expense,price,payment_source,category
0,12-May,Pequim,Didi pro templo do céu (tava fechado),38.42,Carol,Transporte
1,12-May,Pequim,Didi pro shopping das Pérolas (Hungqiao Market),13.30,Carol,Transporte
2,12-May,Pequim,Colar e brincos Carol + brinco presente da Lara,170.00,Carol,Compras/Presentes
3,12-May,Pequim,Didi pra Qianmen,14.80,Carol,Transporte
4,12-May,Pequim,Almoço Qianmen,64.00,Carol,Alimentação
...,...,...,...,...,...,...
125,29-May,Lhasa,Didi para Bokar,,Diva,Transporte
126,29-May,Lhasa,Almoço,33.00,Diva,Alimentação
127,29-May,Lhasa,Compras Bokar Supermarket,105.00,Carol,Compras/Presentes
128,29-May,Lhasa,Massagem no aeroporto,30.00,Diva,Compras/Presentes


### Add missing rows
When checking the AI generated dataset we noticed two issues:
<br>
1- Some values are null
<br>
2- Others are simply absent (from the dataset and the original source)
<br><br>
I solved the second issue by creating that row myself and including other major expenses that weren't kept on the original file, such as airfares, train tickets, hotels and tour agency packages. These were paid for in advance, but I kept the day they were used to make the timeline better.
<br><br>
I saved that file as a json.

In [39]:
# Open the json with the data that was still missing:
missing_data = pd.read_json("../00_raw/missing_data.json", dtype={"cidade": str})
missing_data.head(10)

Unnamed: 0,day,city,expense,price,payment_source,category
0,13-May,Datong,Da Tong Weidu International Hotel,292.0,Diva,Hotel
1,17-May,Suzhou,HanTin Premium Hotel,656.0,Renata,Hotel
2,18-May,Xangai,Homeinn Hotel,970.0,Renata,Hotel
3,18-May,Guangzhou,SunYat Sen University Kaifeng Hotel,1382.0,Renata,Hotel
4,13-May,Datong,trem de Pequim para Datong,378.0,Renata,Transporte
5,15-May,Datong,trem de Datong para Pequim,366.0,Renata,Transporte
6,17-May,Suzhou,trem de Pequim para Suzhou,1224.0,Renata,Transporte
7,18-May,Suzhou,trem de Suzhou para Xangai,168.0,Renata,Transporte
8,20-May,Xangai,Avião de Xangai para Pequim,1460.0,Renata,Transporte
9,24-May,Lhasa,Avião ida e volta de Pequim para Lhasa,9169.11,Paula,Transporte


In [107]:
# Now let's concatenate both datasets
df_concat = pd.concat([raw, missing_data])
df_concat

Unnamed: 0,day,city,expense,price,payment_source,category
0,12-May,Pequim,Didi pro templo do céu (tava fechado),38.42,Carol,Transporte
1,12-May,Pequim,Didi pro shopping das Pérolas (Hungqiao Market),13.30,Carol,Transporte
2,12-May,Pequim,Colar e brincos Carol + brinco presente da Lara,170.00,Carol,Compras/Presentes
3,12-May,Pequim,Didi pra Qianmen,14.80,Carol,Transporte
4,12-May,Pequim,Almoço Qianmen,64.00,Carol,Alimentação
...,...,...,...,...,...,...
31,02-Jun,Guangzhou,Didi,18.79,Carol,Transporte
32,02-Jun,Pequim,Didi do aeroporto,110.00,Renata,Transporte
33,03-Jun,Pequim,Massagem nos pés,156.00,Carol,Compras/presentes
34,03-Jun,Pequim,Didi para o aeroporto,90.00,Renata,Transporte


In [108]:
# And reorder the rows by day
df_concat = df_concat.sort_values("day")
df_concat

Unnamed: 0,day,city,expense,price,payment_source,category
27,01-Jun,Guangzhou,Jantar,120.0,Diva,Alimentação
26,01-Jun,Guangzhou,Cruzeiro Rio das Pérolas,369.0,Renata,Ingressos
23,01-Jun,Guangzhou,Almoço,90.0,Diva,Alimentação
22,01-Jun,Guangzhou,Carro do aeroporto para o hotel,130.0,Renata,Transporte
32,02-Jun,Pequim,Didi do aeroporto,110.0,Renata,Transporte
...,...,...,...,...,...,...
18,30-May,Pequim,Lenços de seda,1145.0,Paula,Compras/presentes
17,30-May,Pequim,Almoço no bairro da Renata,1234.0,Diva,Alimentação
21,30-May,Pequim,Metrô,14.0,Paula,Transporte
16,30-May,Pequim,Supermercado,21.3,Carol,Compras/presentes


In [109]:
# Looks like our date format isn't helpful once there are two months involved.
# Let's fix it:
df_concat["day_cleaned"] = pd.Series(df_concat["day"])
df_concat["day_cleaned"] = pd.to_datetime(df_concat["day_cleaned"], format='%d-%b')
df_concat = df_concat.sort_values("day_cleaned", ascending=True)
df_concat["day_cleaned"] = df_concat["day_cleaned"].dt.strftime('%b-%d')
df_concat.drop("day", axis=1, inplace=True)

In [114]:
df = df_concat[["day_cleaned",
         "city",
         "expense",
         "price",
         "payment_source",
         "category"]]
df

Unnamed: 0,day_cleaned,city,expense,price,payment_source,category
14,May-11,Pequim,Táxi do aeroporto para a casa da Renata,87.00,Carol,Transporte
10,May-12,Pequim,Didi pro restaurante de dumpling fritos,49.93,Carol,Transporte
2,May-12,Pequim,Colar e brincos Carol + brinco presente da Lara,170.00,Carol,Compras/Presentes
8,May-12,Pequim,3 Baralhos,90.00,Carol,Compras/Presentes
7,May-12,Pequim,Mountain Coffee,25.00,Carol,Alimentação
...,...,...,...,...,...,...
31,Jun-02,Guangzhou,Didi,18.79,Carol,Transporte
32,Jun-02,Pequim,Didi do aeroporto,110.00,Renata,Transporte
33,Jun-03,Pequim,Massagem nos pés,156.00,Carol,Compras/presentes
35,Jun-03,Pequim,Taobao e Meituan,840.47,Renata,Compras/presentes


The second issue will go unsolved for this project due to deadline restraints:

In [148]:
no_price = df.loc[pd.isna(df["price"])]
no_price

Unnamed: 0,day_cleaned,city,expense,price,payment_source,category,payment_type
51,May-17,Pequim,Metrô,,Diva,Transporte,balance account
99,May-22,Pequim,Comprinhas museu,,Diva,Compras/Presentes,balance account
97,May-22,Pequim,Didi para Tiannanmen,,Diva,Transporte,balance account
98,May-22,Pequim,Almoço no museu,,Diva,Alimentação,balance account
103,May-24,Lhasa,Didi pro aeroporto,,Carol,Transporte,balance account
112,May-27,Shigatse,Jantar,,Carol,Alimentação,balance account
125,May-29,Lhasa,Didi para Bokar,,Diva,Transporte,balance account


### Comparing payment types
I know that all expenses that have "Paula" as payment_source were paid for using our credit cards connected to Paula's bank account in the US.
<br><br>
And all the other names mean that the payment was made using Alipay, the Chinese superapp we used to call a Didi, take the metro, order at restaurants and pay for things in stores, or WeChat, whenever the store or vendor wouldn't accept Alipay. The money comes from Renata's bank account in China and stays in a digital wallet inside the apps.
<br><br>
I want to see how much we've spent from each payment type, and the average spent in each purchase from these two groups.

In [152]:
# Create new column "payment_type" and filling it according to the values in column "payment_source"
df["payment_type"] = df["payment_source"].str.replace("Carol", "Alipay").str.replace("Renata", "Alipay").str.replace("Diva", "Alipay").str.replace("Paula", "credit card")
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["payment_type"] = df["payment_source"].str.replace("Carol", "Alipay").str.replace("Renata", "Alipay").str.replace("Diva", "Alipay").str.replace("Paula", "credit card")


Unnamed: 0,day_cleaned,city,expense,price,payment_source,category,payment_type
14,May-11,Pequim,Táxi do aeroporto para a casa da Renata,87.00,Carol,Transporte,Alipay
10,May-12,Pequim,Didi pro restaurante de dumpling fritos,49.93,Carol,Transporte,Alipay
2,May-12,Pequim,Colar e brincos Carol + brinco presente da Lara,170.00,Carol,Compras/Presentes,Alipay
8,May-12,Pequim,3 Baralhos,90.00,Carol,Compras/Presentes,Alipay
7,May-12,Pequim,Mountain Coffee,25.00,Carol,Alimentação,Alipay
...,...,...,...,...,...,...,...
31,Jun-02,Guangzhou,Didi,18.79,Carol,Transporte,Alipay
32,Jun-02,Pequim,Didi do aeroporto,110.00,Renata,Transporte,Alipay
33,Jun-03,Pequim,Massagem nos pés,156.00,Carol,Compras/presentes,Alipay
35,Jun-03,Pequim,Taobao e Meituan,840.47,Renata,Compras/presentes,Alipay
