# Tidy Data  by  Hadley Wickham
### Journal of Statistical Software, August 2014, Volume 59, Issue 10
### https://www.jstatsoft.org/article/view/v059i10

# 2. Defining tidy data

## 2.1. Data structure (physical layout)

- Most statistical datasets are rectangular **tables** made up of **rows** and **columns**.

|              | treatmenta | treatmentb |
| ------------ | ---------: | ---------: |
| John Smith   |        NaN |          2 |
| Jane Doe     |         16 |         11 |
| Mary Johnson |          3 |          1 |


## 2.2. Data semantics (meaning)

- A **dataset** is a collection of **values**, usually either numbers (if quantitative) or strings (if
qualitative).
- Every **value** belongs to a **variable (feature)** and an **observation (sample)**.
- A **variable** contains all values that measure the same underlying attribute (like height, temperature, duration) across units.
- An **observation** contains all values measured on the same unit (like a person, or a day, or a race) across attributes.


## 2.3. Tidy data

**Tidy data** is a standard way of mapping the meaning of a dataset to its structure. A dataset is
messy or tidy depending on how rows, columns and tables are matched up with observations,
variables and types. In tidy data:

1. Each **variable** forms a **column**.
2. Each **observation** forms a **row**.
3. Each **type** of observational unit forms a **table**.

| name         | trt  | result |
| ------------ | :--- | -----: |
| John Smith   | a    |    NaN |
| Jane Doe     | a    |     16 |
| Mary Johnson | a    |      3 |
| John Smith   | b    |      2 |
| Jane Doe     | b    |     11 |
| Mary Johnson | b    |      1 |

# 3. Tidying messy datasets

## 3.1. Column headers are values, not variable names

In [None]:
import numpy as np
import pandas as pd

The dataset in Table 4 (p.6) explores the relationship between income and religion in the US.
- It comes from a report produced by the Pew Research Center, an American think-tank that collects data on attitudes to topics ranging from religion to the internet, and produces many reports that contain datasets in this format.

```
pew = pd.read_csv('../data/pew.csv')
pew
```

- This dataset has three variables, `religion`, `income`, and `frequency`.
- To tidy it, we need to **melt**, or **stack** it. In other words, we need to turn columns into rows.
- Melting is parameterized by a list of columns that are already variables, or **colvar**s for short.

> [**References**] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html

> `df.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)`

```
pew.melt(id_vars='religion')
pew_tidy = pew.melt(id_vars='religion', var_name='income', value_name='freq')
pew_tidy.head()
```

Another common use of this data format is to record regularly spaced observations over time.
- The Billboard dataset shown in Table 7 records the date a song first entered the Billboard Top 100.
- It has variables for `artist`, `track`, `date.entered`, `rank`, and `week`.
- The `rank` in each `week` after it enters the top 100 is recorded in 75 columns, `wk1` to `wk75`.
- If a song is in the Top 100 for less than 75 weeks the remaining columns are filled with missing values.

```
billboard = pd.read_csv('../data/billboard.csv')
billboard.head()
```

- Melting yeilds Table 8.

```
billboard.melt(id_vars=['year', 'artist', 'track', 'time', 'date.entered'],
               value_name='rank',
               var_name='week')
```

- tidy data는 어떤 모습?

> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str

```
billboard_tidy = billboard.melt(id_vars=['year', 'artist', 'track', 'time', 'date.entered'],
                                value_name='rank',
                                var_name='week')
billboard_tidy['week'] = billboard_tidy.week.str.extract(r'wk(\d+)')
billboard_tidy
```

- 아티스트 별 `rank` 평균을 구하라.

```
(billboard_tidy
 .groupby('artist')['rank']
 .mean()
)
```

- 또 다른 데이터셋, 에볼라 바이러스...

```
ebola = pd.read_csv('../data/country_timeseries.csv')
ebola
```

- 위 데이터셋은 어떻게 멜팅?

```
ebola_long = ebola.melt(id_vars=['Date', 'Day'],
                        var_name='cd_country',
                        value_name='count')
ebola_long
```

- `cd_country`에는 어떤 값들이 있나?

> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html

```
ebola_long.cd_country.unique()
```

- `cd_country`를 `status`와 `country` 두 컬럼으로 나누려면?

> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str

```
ebola_long.cd_country.str.split('_')
ebola_split = ebola_long.cd_country.str.split('_', expand=True)
ebola_split.head()
```

- `ebola_split`를 `ebola_long`에 넣으려면?

```
ebola_long[['status', 'country']] = ebola_split
ebola_long.head()
```

- 또 다른 시도?

> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

```
ebola_split_rev = (ebola_long
                   .cd_country
                   .str.split('_', expand=True)
                   .rename(columns={0: 'country', 1: 'status'})
                  )
ebola_split_rev.head()
```

- 왜 다를까? (**주의**)

```
ebola_long[['status', 'country']] = ebola_split_rev
ebola_long.head()

ebola_long.loc[:, ['status', 'country']] = ebola_split_rev
ebola_long.head()
```

## 3.3. Variables are stored in both rows and columns

The most complicated form of messy data occurs when variables are stored in both rows and columns.
- Table 11 shows daily weather data from the Global Historical Climatology Network for one weather station (MX17004) in Mexico for five months in 2010.
- It has variables in individual columns (`id`, `year`, `month`), spread across columns (`d1-d31`) and across rows (`tmin`, `tmax`) (minimum and maximum temperature).
- Months with less than 31 days have structural missing values for the last day(s) of the month.
- The element column is not a variable; it stores the names of variables.

```
weather = pd.read_csv('../data/weather.csv')
weather
```

- To tidy this dataset we first melt it with colvars `id`, `year`, `month` and the column that contains variable names, `element`. This yields Table 12(a).

```
weather_long = weather.melt(id_vars=['id', 'year', 'month', 'element'],
                            var_name='day',
                            value_name='temp')
weather_long.head()
```

- This dataset is mostly tidy, but we have two variables stored in rows: `tmin` and `tmax`, the type of observation.
- Fixing the issue with the type of observation requires the **cast**, or **unstack**, operation. This performs the inverse of melting by rotating the element variable back out into the columns (Table 12(b)).
- This form is tidy. There is one variable in each column, and each row represents a day's observations.

> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html

```
weather_long.pivot_table(index=['id', 'year', 'month', 'day'],
                         columns='element',
                         values='temp')
                         
(weather_long.pivot_table(index=['id', 'year', 'month', 'day'],
                          columns='element',
                          values='temp')
 .reset_index()
)
```

### **Exercise**: `year`, `month`, `day`를 합해 하나의 `date` 컬럼으로 만들어라. (e.g., `2019-01-30`)