# Fact-checking raw data
Making sure the LLM didn't corrupt the content from the data source [while creating the dictionary](../01_prompt/prompt.md)

### Part 1: fetching and preparing the data

In [99]:
# Import libraries
import pandas as pd
import json
import warnings

In [100]:
# Open raw dataset and save it in an object called "raw"
raw = pd.read_json("../00_raw/china_raw.json", dtype={"cidade": str})
raw

Unnamed: 0,data,cidade,descricao,valor,quem_pagou,categoria
0,12/05,Pequim,Didi pro templo do céu (tava fechado),38.42,Carol,Transporte
1,12/05,Pequim,Didi pro shopping das Pérolas (Hungqiao Market),13.30,Carol,Transporte
2,12/05,Pequim,Colar e brincos Carol + brinco presente da Lara,170.00,Carol,Compras/Presentes
3,12/05,Pequim,Didi pra Qianmen,14.80,Carol,Transporte
4,12/05,Pequim,Almoço Qianmen,64.00,Carol,Alimentação
...,...,...,...,...,...,...
124,29/05,Lhasa,Didi para Bokar,,Diva,Transporte
125,29/05,Lhasa,Almoço,33.00,Diva,Alimentação
126,29/05,Lhasa,Compras Bokar Supermarket,105.00,Carol,Compras/Presentes
127,29/05,Lhasa,Massagem no aeroporto,30.00,Diva,Compras/Presentes


In [101]:
# Inspect raw dataset
raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129 entries, 0 to 128
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   data        129 non-null    object 
 1   cidade      129 non-null    object 
 2   descricao   129 non-null    object 
 3   valor       122 non-null    float64
 4   quem_pagou  129 non-null    object 
 5   categoria   129 non-null    object 
dtypes: float64(1), object(5)
memory usage: 6.2+ KB


In [102]:
# Translate and improve the column names to English to avoid confusion using "rename"
raw.rename(columns={"data": "day",
                    "cidade": "city",
                    "descricao": "expense",
                    "valor": "price",
                    "quem_pagou": "payment_source",
                    "categoria": "category"
                   }, inplace=True)
raw.head(3)

Unnamed: 0,day,city,expense,price,payment_source,category
0,12/05,Pequim,Didi pro templo do céu (tava fechado),38.42,Carol,Transporte
1,12/05,Pequim,Didi pro shopping das Pérolas (Hungqiao Market),13.3,Carol,Transporte
2,12/05,Pequim,Colar e brincos Carol + brinco presente da Lara,170.0,Carol,Compras/Presentes


In [103]:
# Make the "day" column look better using "replace"
# Use .to_frame() to prevent pandas from turning your dataframe into a Series
raw.day = raw.day.str.replace("/05", "-May").to_frame()
raw.head()

Unnamed: 0,day,city,expense,price,payment_source,category
0,12-May,Pequim,Didi pro templo do céu (tava fechado),38.42,Carol,Transporte
1,12-May,Pequim,Didi pro shopping das Pérolas (Hungqiao Market),13.3,Carol,Transporte
2,12-May,Pequim,Colar e brincos Carol + brinco presente da Lara,170.0,Carol,Compras/Presentes
3,12-May,Pequim,Didi pra Qianmen,14.8,Carol,Transporte
4,12-May,Pequim,Almoço Qianmen,64.0,Carol,Alimentação


### Part 2: fact-checking the LLM work

#### 2.1: Price values
When making the table, DeepSeek had already turned the column into a float. And it had already detected values that weren't adequate. For example, **"12.XX"**, which I wrote when I didn't remember how many cents. They were replaces by the word **"None"**.
<br><br>
When it turned the dictionary into a json, DeepSeek replaced "None" with **null**.

In [104]:
# Check how many rows in the "price" columns are now NA:
no_price = raw.loc[pd.isna(raw["price"])]
no_price

Unnamed: 0,day,city,expense,price,payment_source,category
51,17-May,Pequim,Metrô,,Diva,Transporte
96,22-May,Pequim,Didi para Tiannanmen,,Diva,Transporte
97,22-May,Pequim,Almoço no museu,,Diva,Alimentação
98,22-May,Pequim,Comprinhas museu,,Diva,Compras/Presentes
102,24-May,Lhasa,Didi pro aeroporto,,Carol,Transporte
111,27-May,Shigatse,Jantar,,Carol,Alimentação
124,29-May,Lhasa,Didi para Bokar,,Diva,Transporte


There aren't many results, but it's clear most are from May 22nd. We can manually use ctrl+F on the raw data to check a couple of samples. For example: "Didi para Tiannanmen":
<br><br>
![image](../02_check/check1.png)
<br><br>
From the screenshot we can also find the May 24th expense without a value, but there are a couple from May 23rd with very little information. Let's search for it to learn more:

In [105]:
# Filter to get only the rows from May 23rd
may_23 = raw[raw["day"] == "23-May"]
may_23

Unnamed: 0,day,city,expense,price,payment_source,category
100,23-May,Pequim,Metrô pra Cidade Proibida,9.0,Carol,Transporte
101,23-May,Pequim,Entrada Cidade Proibida,60.0,Diva,Ingressos


Comparing with the actual notes:
![image](../02_check/check2.png)
<br> So clearly DeepSeek made a couple of editorial decisions:
<br><br>
1- The subway ride to the Forbidden City was either 8 or 10 yuan, it simply calculated the average. That's incorrect because it was the total for two people, but it's not that imprecise or statistically relevant to inflate or deflate our totals by much. So we can let it slide.
<br><br>
2- Secondly, there are three lines of incoherent notes about shopping and the metro ride home. There are no other information such as who paid for it and how much it cost. DeepSeek just decided to ignore them. Since I can't remember the information missing, and since I'm not planning on diving into how incomplete the dataset is, I'll have to agree it's a little useless for the analysis.
<br><br>
Now let's try the same for May 24th, which also had interesting ways of showing prices:

In [106]:
# Filter to get only the rows from May 23rd
may_24 = raw[raw["day"] == "24-May"]
may_24

Unnamed: 0,day,city,expense,price,payment_source,category
102,24-May,Lhasa,Didi pro aeroporto,,Carol,Transporte
103,24-May,Lhasa,Almoço hot pot surpresa,438.0,Tica,Alimentação
104,24-May,Lhasa,Show Princesa Wejcheng,840.0,Tica,Ingressos
105,24-May,Lhasa,Comidas,30.0,Carol/Renata,Alimentação


Focus on line 105, the Princess Wejcheng show. There were three of us and the ticket prices were imputed separately, separated by a plus(+) sign.
<br><br>
![image](../02_check/check3.png)
<br><br>
DeepSeek took the liberty of adding up the values and got the math right (320 + 320 + 200 = 840). It kind of saved us the hassle of finding the same expense duplicated and adding the values ourselves.

#### 2.2: Classification
Now let's see how good DeepSeek was at classifying our expenses into the categories I defined:
<br>
*Transportation ("Transporte")
<br>
Food ("Alimentação")
<br>
Hotels ("Hospedagem")
<br>
Entrance tickets ("Ingresso")
<br>
Shopping, presents and souvenirs ("Compras/Presentes")*

In [107]:
# Search for all the rows where the description of the expense starts with "Didi".
# This is a clear sign of a transportation expense (Didi is China's Lyft).
# Then we will group by category and check how many are there:
didi = raw.loc[raw.expense.str.startswith("Didi", na=True)]\
.groupby("category")["category"].value_counts()
didi

category
Transporte    22
Name: count, dtype: int64

 Perfect, seems like it was able to identify every Didi trip as a transportation expense!
<br><br>
Let's try the same for the words "Almoço" (lunch), "Jantar" (dinner), "Café" (coffee), "Água" (water) and "Suco" (juice).

In [108]:
# Search for all the rows that contain the words above:

# First, set a pattern with the words
pattern = "Almoço|Jantar|Café|Água|Suco"

# Then filter the dataset to get the rows that contain one of the words in your pattern
# And group your new dataset by category, then count how many rows are of each category
food = raw.loc[raw.expense.str.contains(pattern, na=True)]\
.groupby("category")["category"].value_counts()
food

category
Alimentação    29
Name: count, dtype: int64

Great: all expenses with words relating to food or meal were correctly classified.

#### 2.3: Math check
Now, let's simply check a few of the values on the "price" column, just to see if any hallucination has left us with crazy prices for anything:

In [109]:
# Arrange the expenses according to their price
# But first, remove the NAs so we can see more details
raw.dropna(subset=["price"]).sort_values("price", ascending=False)

Unnamed: 0,day,city,expense,price,payment_source,category
59,18-May,Xangai,Passagem avião Xangai-Pequim,1460.0,Tica,Transporte
74,19-May,Xangai,Uniqlo haul,1162.0,Paula,Compras/Presentes
104,24-May,Lhasa,Show Princesa Wejcheng,840.0,Tica,Ingressos
44,16-May,Pequim,Jantar no Migas (espanhol),645.0,Diva,Alimentação
103,24-May,Lhasa,Almoço hot pot surpresa,438.0,Tica,Alimentação
...,...,...,...,...,...,...
70,19-May,Xangai,Coquinha,6.0,Diva,Alimentação
30,14-May,Datong,Lencinhos,5.4,Diva,Compras/Presentes
55,18-May,Suzhou,Metro,4.0,Diva,Transporte
64,19-May,Xangai,Barco,4.0,Cada uma,Transporte


Sounds about right that two plane tickets, a shopping spree from a Uniqlo first timer, a show and two special meals in good restaurants would be in the top 5 highest expenses.

#### 2.4: Discrete variables
Finally, let's see if the variables with few options of values (city, payment_source and category) don't have anything too crazy:

In [112]:
# Get unique values of cities:
raw.city.unique()

array(['Pequim', 'Datong', 'Suzhou', 'Xangai', 'Lhasa', 'Shigatse'],
      dtype=object)

In [113]:
# Get unique values of who paid for what:
raw.payment_source.unique()

array(['Carol', 'Diva', 'Tica', 'Cada uma', 'Paula', 'Carol/Renata',
       'Renata'], dtype=object)

In [114]:
# Get unique values of categories:
raw.category.unique()

array(['Transporte', 'Compras/Presentes', 'Alimentação', 'Ingressos'],
      dtype=object)

Now, here's a *tiny* problem: one of the categories was supposed to be hotels. I've seen at least one, but there is no such category listed. (DeepSeek also renamed by "Compras/presentes/souvernirs" category to simply "Compras/Presentes", but that's not a hallucination nor a serious crime.)
<br><br>
Let's find out how it classified any hotels:

In [115]:
# Search for all the rows where the description of the expense contains "Hotel" or "hotel".
# This is a clear sign of a hotel expense.
# In Portuguese, I asked it to name the category "Hospedagem", which means "Lodging".
# Afterwards, group the results by category and check how many are there:

hotels = "Hotel|hotel"

hotel = raw.loc[raw.expense.str.contains(hotels, na=True)]\
.groupby("category")["category"].value_counts()
hotel

category
Alimentação          3
Compras/Presentes    1
Transporte           3
Name: count, dtype: int64

**Note:** In this broader search we used "str.contains", so we're including any expense that could have been "taxi to the hotel", and not a hotel spending per se. But there should have been a category called "Hotel".
<br><br>
We'll try again, this time with "str.startswith":

In [116]:
hotel_beginning = raw.loc[raw.expense.str.startswith(hotels, na=True)]\
.groupby("category")["category"].value_counts()
hotel_beginning

Series([], Name: count, dtype: int64)

Alright, there are no rows in the "expense" column starting with "Hotel" or "hotel". So I'll go back to a manual Ctrl+F in my raw .txt file to see what's going on:
<br><br>
![image](../02_check/check4.png)
<br><br>
Problem found: there is only one entry for an actual hotel expense, and it was one of the worst inputs of the messy dataset:
<br>
1- There's one line above saying simply "Hotel".
<br>
2- And the line with the hotel spending description is completely different from the other patterns I tried to use. I detailed how many people, how many nights and even the price of the breakfast that was included (I must have felt so creative at the time!).
<br>
And the description doesn't even start with the word "Hotel". No wonder DeepSeek was thrown off from thinking this was a hotel expense.
<br><br>
The Ctrl+F search also shows that the other entries citing a hotel are clearly not a hotel expense.
<br><br>
So I only have one row to fix, let's go find it!

In [117]:
missing_hotel = raw.loc[raw.expense.str.contains("International Hotel", na=True)]
missing_hotel

Unnamed: 0,day,city,expense,price,payment_source,category


Bad sign, there seems to be no line with the works "International Hotel"...
<br><br>
Maybe DeepSeek was so distraught by the input it simply decided to ignore it?

In [118]:
# Fetch all expenses from May 13th
may_13 = raw[raw["day"] == "13-May"]
may_13

Unnamed: 0,day,city,expense,price,payment_source,category
11,13-May,Pequim,Didi pro Templo do Céu,45.0,Carol,Transporte
12,13-May,Pequim,Ingresso Templo do Céu,34.0,Carol,Ingressos
13,13-May,Pequim,Icetea de limão,8.0,Carol,Alimentação
14,13-May,Pequim,Bolsa e marcador de página,78.0,Carol,Compras/Presentes
15,13-May,Pequim,Metrô,10.0,Carol,Transporte
16,13-May,Pequim,Didi pra Beijingbei,53.73,Carol,Transporte
17,13-May,Datong,Didi pro Hotel Weidu em Datong,18.2,Carol,Transporte
18,13-May,Datong,Dois cafés da manhã do hotel,116.0,Diva,Alimentação
19,13-May,Datong,Didi pro restaurante Zini 369,7.3,Diva,Transporte
20,13-May,Datong,Jantar Zini 369,66.0,Diva,Alimentação


Yup, looks like the only hotel on our original list is missing on the AI generated version. 
<br><br>
(This happened because on some cities the hotels were included in the tour package or the reservation was made by my sister and I didn't bother fetching the price to write it down.)
<br><br>
**Conclusion:** All in all, the job was pretty well done for a task that didn't really require the most extreme precision. And it seems we have enough interesting data to go ahead with the cleaning part. Let's save this in a CSV and move to the [data cleaning notebook](../03_notebooks/01_data_cleaning.ipynb):

In [120]:
raw.to_csv("../00_raw/china_raw.csv", index=False)