# Analysing data

Getting insights from the complete and tidy dataset to plot into graphs on **Datawrapper**.

In [2]:
# Import libraries
import pandas as pd
import warnings

In [3]:
# Open dataset
df = pd.read_csv("../04_tidy_data/china_df.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134 entries, 0 to 133
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   date              134 non-null    object 
 1   city              134 non-null    object 
 2   expense           134 non-null    object 
 3   price             134 non-null    float64
 4   payment_source    134 non-null    object 
 5   category          134 non-null    object 
 6   price_per_capita  134 non-null    float64
 7   payment_type      134 non-null    object 
dtypes: float64(2), object(6)
memory usage: 8.5+ KB


In [5]:
# Looks like our date format was lost in the cleaning process. Let's fix it again:
df["date"] = pd.Series(df["date"])
df["date"] = pd.to_datetime(df["date"], format='%b-%d')
df = df.sort_values("date", ascending=True)
df["date"] = df["date"].dt.strftime('%b-%d')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 134 entries, 0 to 133
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   date              134 non-null    object 
 1   city              134 non-null    object 
 2   expense           134 non-null    object 
 3   price             134 non-null    float64
 4   payment_source    134 non-null    object 
 5   category          134 non-null    object 
 6   price_per_capita  134 non-null    float64
 7   payment_type      134 non-null    object 
dtypes: float64(2), object(6)
memory usage: 9.4+ KB


In [6]:
df.head(5)

Unnamed: 0,date,city,expense,price,payment_source,category,price_per_capita,payment_type
0,May-11,Pequim,Táxi do aeroporto para a casa da Renata,87.0,Carol,Transporte,43.5,apps
5,May-12,Pequim,Almoço Qianmen,64.0,Carol,Alimentação,32.0,apps
1,May-12,Pequim,Didi pro shopping das Pérolas (Hungqiao Market),13.3,Carol,Transporte,6.65,apps
3,May-12,Pequim,Didi pra casa,45.0,Carol,Transporte,22.5,apps
4,May-12,Pequim,Didi pra Qianmen,14.8,Carol,Transporte,7.4,apps


## 1- Comparing expenses in general

#### What where the top 5 largest unique expenses?

In [10]:
df.sort_values(by='price_per_capita', ascending=False)
df

Unnamed: 0,date,city,expense,price,payment_source,category,price_per_capita,payment_type
0,May-11,Pequim,Táxi do aeroporto para a casa da Renata,87.00,Carol,Transporte,43.50,apps
5,May-12,Pequim,Almoço Qianmen,64.00,Carol,Alimentação,32.00,apps
1,May-12,Pequim,Didi pro shopping das Pérolas (Hungqiao Market),13.30,Carol,Transporte,6.65,apps
3,May-12,Pequim,Didi pra casa,45.00,Carol,Transporte,22.50,apps
4,May-12,Pequim,Didi pra Qianmen,14.80,Carol,Transporte,7.40,apps
...,...,...,...,...,...,...,...,...
129,Jun-02,Guangzhou,Livraria,11.50,Carol,Compras/presentes,3.83,apps
130,Jun-02,Guangzhou,Didi,18.79,Carol,Transporte,6.26,apps
131,Jun-03,Pequim,Didi para o aeroporto,90.00,Renata,Transporte,45.00,apps
132,Jun-03,Pequim,Taobao e Meituan,840.47,Renata,Compras/presentes,420.24,apps


In [9]:
##### Below here the rest is still incomplete

## 2- Comparing expenses by city

## 3- Comparing expenses by category

## 4- Comparing expenses by type of payment

#### How much did we spend using each payment type?

In [35]:
df1 = df.groupby("payment_type")["price"].sum()
total_payment_type = pd.DataFrame(data=df1)
total_payment_type

Unnamed: 0_level_0,price
payment_type,Unnamed: 1_level_1
apps,23700.56
credit card,49907.01


#### How many payments did we make with each type?

In [36]:
df2 = df.groupby("payment_type")["payment_type"].value_counts()
count_payment_type = pd.DataFrame(data=df2)
count_payment_type

Unnamed: 0_level_0,count
payment_type,Unnamed: 1_level_1
apps,126
credit card,8


#### What was the average price per capita of expenses from each type of payment?

In [37]:
df3 = df.groupby("payment_type")["price_per_capita"].mean()
avg_payment_type = pd.DataFrame(data=df3)
avg_payment_type

Unnamed: 0_level_0,price_per_capita
payment_type,Unnamed: 1_level_1
apps,76.858849
credit card,2116.187917


## 5- Curious insights

#### What did we buy the most?