# pandas: Python Data Analysis Library
### https://pandas.pydata.org
### pandas documentation: https://pandas.pydata.org/pandas-docs/stable/

- 현재 설치된 pandas 버전은? 0.24.2 or 0.25?

```
import pandas as pd
pd.__version__
```

- 데이터를 불러오는 전형적인 방법은 파일로부터... 실습 데이터는 어디에?

```
%ls ../data
```

- CSV (Comma-Separated Values) 또는 TSV (Tab-Separated Values) 파일에서 데이터를 불러오자.

> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

```
df = pd.read_csv('../data/gapminder.tsv')
df
```

- `Series` is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the **index**.

- `DataFrame` is a two-dimensional labeled data structure with columns of potentially different types. Along with the data, you can optionally pass **index** (row labels) and **columns** (column labels) arguments.

- `gapminder.tsv`는 `Series` or `DataFrame`?

```
type(df)
```

- `pandas.core.frame.DataFrame`은 도대체 어디에?

```
pd.DataFrame?
```

- `DataFrame` 클래스의 속성(attribute)은?

> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

```
df.shape
df.dtype
df.columns
df.index
```

- 한번에 `DataFrame`의 정보를 보려면?
> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

```
df.info()
```

- `DataFrame`의 값들은? We recommend using `df.to_numpy()` instead of `df.values`!
> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html
> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html
>  > `result = np.array(self.values, dtype=dtype, copy=copy)`

```
df.to_numpy()
```

- 데이터가 너무 많아요...

```
df.head()
df.tail()
```

- 귀찮더라도 한번은 꼭 봐야 할 [**Indexing and selecting data**](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)

### Basic indexing
- The primary function of indexing with `[]` (a.k.a. `__getitem__`) is selecting out lower-dimensional slices.
  - Series: `s[label]` returns scalar value
  - DataFrame: `df[colname]` returns Series corresponding to `colname`

```
country = df['country'] # or df.__getitem__('country')
country
type(country)
```

- You can pass a list of columns to `[]` to select columns in **that order**.

```
continent_and_country = df[['continent', 'country']]
continent_and_country

country_and_continent = df[['country', 'continent']]
country_and_continent
```

- 리스트에 여러 개가 아니라 하나만 들어있으면?

```
continent = df[['continent']]
continent.head()
type(continent)
```

- `colname`이 없는 경우, 읽을 때(`__getitem__`)는 `KeyError` 발생... 하지만 쓸 때(`__setitem__`)는 새 컬럼을 만들어 준다. 

```
df['const_one']
df['const_one'] = 1
df.head()
```

- 로우나 컬럼을 지우려면?
> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html

```
df_dropped = df.drop([1, 3]) # default: axis=0
#df_dropped = df.drop(index=[1, 3])
df_dropped.head()

df_dropped = df.drop(['continent', 'country'], axis=1)
#df_dropped = df.drop(columns=['continent', 'country'])
df_dropped.head()
```

### Different choices for indexing

- `.loc` is primarily **label** based, but may also be used with a boolean array. `.loc` will raise `KeyError` when the items are not found. Allowed inputs are:
  - A single label `5` or `'a'`
  - A list or array of labels `['a', 'b', 'c']`.
  - A slice object with labels `'a':'f'` (Both the start and the stop are **included**, when present in the index!)
  - A boolean array
  - A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
  
> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

```
df.loc[0]
df.loc[[0, 1]]
df.loc[1:3]
df.loc[-1] # KeyError
```

- `.iloc` is primarily **integer position** based (from 0 to length-1 of the axis), but may also be used with a boolean array. `.iloc` will raise `IndexError` if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. Allowed inputs are:
  - An integer `5`.
  - A list or array of integers `[4, 3, 0]`.
  - A slice object with integers `1:7` (The stop is **excluded**!).
  - A boolean array
  - A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).

> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

```
df.iloc[-1]
df.iloc[[0, 1, -1]]
df.iloc[1:3]
```

- Getting values from an object with multi-axes selection uses the following notation.
  - Any of the axes accessors may be the null slice `:`.
  - Axes left out of the specification are assumed to be `:`, e.g. `p.loc['a']` is equivalent to `p.loc['a', :, :]`.
  - Series: `s.loc[indexer]`
  - DataFrame: `df.loc[row_indexer, column_indexer]`
  
```
subset = df.loc[:, ['year', 'pop']]
subset.head()

subset = df.iloc[:, [2, 4]]
subset.head()
```

### Boolean indexing

- Another common operation is the use of boolean vectors to filter the data. The operators are: `|` for or, `&` for and, and `~` for not. **These must be grouped by using parentheses**!

> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing

```
df['country'] == 'United States'
df.loc[df['country'] == 'United States']
df.loc[(df['country'] == 'United States') & (df['year'] == 1982)]
df.iloc[df['country'] == 'United States'] # NotImplementedError
```

- `.iloc`은 index가 없는 boolean indexing만 허용한다!

```
df.iloc[[True] + [False] * (len(df) - 2) + [True]]
```

- Boolean indexing은 `df.loc` 또는 간단히 `df`를 사용하자! 

### Selection by callable

- 함수를 이용한 indexing도 간단히 `df`를 사용하자!
- 단, row_indexer와 column_index 모두 사용하는 경우에는 `df.loc`!

> [**Reference**] https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-callable

```
df[lambda df: df.index < 5]
df[lambda df: df.columns[1:3]]
df.loc[lambda df: [0, 2, 4], lambda df: ['country', 'continent']]
```