# 10分で分かるpandas
## はじめに
この記事はpandas公式チュートリアル「10 minutes to pandas」の写経及び解説です

以下のURLを参考にしています
https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html


## 環境
- Python3.8
- Jupyter Lab


# とりあえずインポート

In [178]:
import numpy as np
import pandas as pd

In [179]:
np

<module 'numpy' from 'C:\\Users\\user\\AppData\\Local\\Programs\\Python\\Python37\\lib\\site-packages\\numpy\\__init__.py'>

In [180]:
pd

<module 'pandas' from 'C:\\Users\\user\\AppData\\Roaming\\Python\\Python37\\site-packages\\pandas\\__init__.py'>

## [1. Object creation - オブジェクトを作る](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#object-creation)

### Seriesクラス
[Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)クラスにリストを入れることで簡単にデータを作ることが出来ます。


In [181]:
# 簡単に一列作る
s = pd.Series(data=[1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

### data_rangeメソッド
[date_range()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html)を使うことで、特定の期間の日付の行を作成出来ます。

In [182]:
# 2020年1月１日から6日間のデータ
dates = pd.date_range("20200101", periods=6)
dates

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq='D')

### DataFrameクラス
pandasの[DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas-dataframe)クラス**引数index**を指定することで、行インデックスを指定することが出来ます。

In [183]:
# 行インデックスに2020年1月1日からのデータを指定
# 各値にはランダムな数値を入れる
df = pd.DataFrame(np.random.randn(6, 4), index=dates)
df

Unnamed: 0,0,1,2,3
2020-01-01,-1.958945,-0.639418,-1.159997,0.457915
2020-01-02,0.34538,-0.596611,-0.618554,-0.224183
2020-01-03,-0.767764,0.79257,-0.080957,-0.233773
2020-01-04,-1.820982,-0.466856,0.16563,-1.001596
2020-01-05,-0.625027,-0.292947,-1.342502,-0.768661
2020-01-06,0.181846,1.005929,-0.170266,0.495846


また、同じくDataFrameクラスの
**引数columns**を指定することで列名を設定することが出来ます。

In [184]:
# 列名ABCDを設定
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

Unnamed: 0,A,B,C,D
2020-01-01,-0.190956,-0.77678,1.462762,0.603106
2020-01-02,-0.751889,0.704933,-0.036482,-1.690382
2020-01-03,0.725349,0.467768,0.002753,0.827767
2020-01-04,-0.241295,-0.011085,0.587052,-0.535605
2020-01-05,0.876456,0.959313,-2.127015,0.454854
2020-01-06,0.185534,0.011554,-0.133694,-0.003258


DataFrameクラスに辞書型のデータを渡すことで、辞書型のキーの部分が列名になります。

In [185]:
df2 = pd.DataFrame(
    {
        "A": 1.,
        "B": pd.Timestamp("20200101"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2020-01-01,1.0,3,test,foo
1,1.0,2020-01-01,1.0,3,train,foo
2,1.0,2020-01-01,1.0,3,test,foo
3,1.0,2020-01-01,1.0,3,train,foo


### DataFrame.dtypesプロパティ
**dtypesプロパティ**に参照することで各列のデータ属性が分かります。

In [186]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

## [2. Viewing data - データを表示する](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#object-creation)


DataFrameクラスの[head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head)を使うことでデータの先頭部を表示出来ます。


### DataFrame.headメソッド

In [187]:
df.head(2)

Unnamed: 0,A,B,C,D
2020-01-01,-0.190956,-0.77678,1.462762,0.603106
2020-01-02,-0.751889,0.704933,-0.036482,-1.690382


同じくDataFrameクラスの[tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas-dataframe-tail)を使うことでデータの後尾部を表示出来ます。

### DataFrame.tailメソッド
同じくDataFrameクラスの[tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas-dataframe-tail)を使うことでデータの後尾部を表示出来ます。

In [188]:
df.tail(2)

Unnamed: 0,A,B,C,D
2020-01-05,0.876456,0.959313,-2.127015,0.454854
2020-01-06,0.185534,0.011554,-0.133694,-0.003258


### DataFrame.indexプロパティ
DataFrameクラスの[index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.index.html#pandas-dataframe-index)を参照することでそのデータの行インデックスを表示出来ます。


In [189]:
df.index

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq='D')

In [190]:
df2.index

Int64Index([0, 1, 2, 3], dtype='int64')

### DataFrame.to_numpyメソッド
DataFrameクラスの[to_numpy()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html#pandas-dataframe-to-numpy)を使うことでデータをnumpyで操作しやすいデータに変換出来ます。


In [191]:
df.to_numpy()

array([[-0.19095613, -0.7767801 ,  1.46276182,  0.60310614],
       [-0.75188936,  0.70493296, -0.03648236, -1.69038241],
       [ 0.72534937,  0.46776792,  0.00275313,  0.82776687],
       [-0.24129548, -0.01108495,  0.58705207, -0.53560483],
       [ 0.8764557 ,  0.95931337, -2.12701477,  0.45485375],
       [ 0.18553417,  0.01155389, -0.13369352, -0.00325833]])

In [192]:
df2.to_numpy()

array([[1.0, Timestamp('2020-01-01 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2020-01-01 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2020-01-01 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2020-01-01 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

In [193]:
df

Unnamed: 0,A,B,C,D
2020-01-01,-0.190956,-0.77678,1.462762,0.603106
2020-01-02,-0.751889,0.704933,-0.036482,-1.690382
2020-01-03,0.725349,0.467768,0.002753,0.827767
2020-01-04,-0.241295,-0.011085,0.587052,-0.535605
2020-01-05,0.876456,0.959313,-2.127015,0.454854
2020-01-06,0.185534,0.011554,-0.133694,-0.003258


### DataFrame.describeメソッド
DataFrameクラスの[describe()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html#pandas-dataframe-describe)を使うことで、データの各列の簡単な統計を取ることが出来ます。


In [194]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.100533,0.225951,-0.040771,-0.057253
std,0.620985,0.621827,1.184714,0.935995
min,-0.751889,-0.77678,-2.127015,-1.690382
25%,-0.228711,-0.005425,-0.109391,-0.402518
50%,-0.002711,0.239661,-0.016865,0.225798
75%,0.590396,0.645642,0.440977,0.566043
max,0.876456,0.959313,1.462762,0.827767


### DataFrame.T属性
DataFrameクラスの[T](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.T.html#pandas-dataframe-t)を参照すると、行列入れ替えたデータにアクセス出来ます。

In [195]:
df.T

Unnamed: 0,2020-01-01,2020-01-02,2020-01-03,2020-01-04,2020-01-05,2020-01-06
A,-0.190956,-0.751889,0.725349,-0.241295,0.876456,0.185534
B,-0.77678,0.704933,0.467768,-0.011085,0.959313,0.011554
C,1.462762,-0.036482,0.002753,0.587052,-2.127015,-0.133694
D,0.603106,-1.690382,0.827767,-0.535605,0.454854,-0.003258


### DataFrame.transposeメソッド
DataFrameクラスの[transpose()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html#pandas-dataframe-transpose)でも同じく行列の入れ替えを取得出来ます。

In [196]:
df.transpose()

Unnamed: 0,2020-01-01,2020-01-02,2020-01-03,2020-01-04,2020-01-05,2020-01-06
A,-0.190956,-0.751889,0.725349,-0.241295,0.876456,0.185534
B,-0.77678,0.704933,0.467768,-0.011085,0.959313,0.011554
C,1.462762,-0.036482,0.002753,0.587052,-2.127015,-0.133694
D,0.603106,-1.690382,0.827767,-0.535605,0.454854,-0.003258


### DataFrame.sort_index()

DataFrameクラスの[sort_index()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html#pandas-dataframe-sort-index)を使用することで、行全体もしくは列全体の並び替えを行うことが出来ます。

In [197]:
df.sort_index()

Unnamed: 0,A,B,C,D
2020-01-01,-0.190956,-0.77678,1.462762,0.603106
2020-01-02,-0.751889,0.704933,-0.036482,-1.690382
2020-01-03,0.725349,0.467768,0.002753,0.827767
2020-01-04,-0.241295,-0.011085,0.587052,-0.535605
2020-01-05,0.876456,0.959313,-2.127015,0.454854
2020-01-06,0.185534,0.011554,-0.133694,-0.003258


**引数axis**に0もしくは"index"を設定すると行に、1もしくは"columns"を設定すると、列を軸に並び替えします(デフォルト値0)。  
また、**引数ascending**にFalseを指定すると並び順が降順になります(デフォルト値True)。

In [198]:
df.sort_index(axis="columns", ascending=False)

Unnamed: 0,D,C,B,A
2020-01-01,0.603106,1.462762,-0.77678,-0.190956
2020-01-02,-1.690382,-0.036482,0.704933,-0.751889
2020-01-03,0.827767,0.002753,0.467768,0.725349
2020-01-04,-0.535605,0.587052,-0.011085,-0.241295
2020-01-05,0.454854,-2.127015,0.959313,0.876456
2020-01-06,-0.003258,-0.133694,0.011554,0.185534


In [199]:
df.sort_index(axis=0, ascending=False)

Unnamed: 0,A,B,C,D
2020-01-06,0.185534,0.011554,-0.133694,-0.003258
2020-01-05,0.876456,0.959313,-2.127015,0.454854
2020-01-04,-0.241295,-0.011085,0.587052,-0.535605
2020-01-03,0.725349,0.467768,0.002753,0.827767
2020-01-02,-0.751889,0.704933,-0.036482,-1.690382
2020-01-01,-0.190956,-0.77678,1.462762,0.603106


### DataFrame.sort_valuesメソッド
DataFrameクラスの[sort_values()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html#pandas-dataframe-sort-values)を使用することで行単位もしくは列単位に並び替えを行うことが出来ます。


In [200]:
df.sort_values(by="B")

Unnamed: 0,A,B,C,D
2020-01-01,-0.190956,-0.77678,1.462762,0.603106
2020-01-04,-0.241295,-0.011085,0.587052,-0.535605
2020-01-06,0.185534,0.011554,-0.133694,-0.003258
2020-01-03,0.725349,0.467768,0.002753,0.827767
2020-01-02,-0.751889,0.704933,-0.036482,-1.690382
2020-01-05,0.876456,0.959313,-2.127015,0.454854


In [201]:
df.sort_values(by="2020-01-01", axis=1)

Unnamed: 0,B,A,D,C
2020-01-01,-0.77678,-0.190956,0.603106,1.462762
2020-01-02,0.704933,-0.751889,-1.690382,-0.036482
2020-01-03,0.467768,0.725349,0.827767,0.002753
2020-01-04,-0.011085,-0.241295,-0.535605,0.587052
2020-01-05,0.959313,0.876456,0.454854,-2.127015
2020-01-06,0.011554,0.185534,-0.003258,-0.133694


## [3. Selection - データを選択する](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#selection)
### [3.1 getting - 単純なデータ取得](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#getting)

**df["A"]**もしくは**df.A**とすることで、指定した一列を取得することが出来ます。

In [202]:
df["A"]

2020-01-01   -0.190956
2020-01-02   -0.751889
2020-01-03    0.725349
2020-01-04   -0.241295
2020-01-05    0.876456
2020-01-06    0.185534
Freq: D, Name: A, dtype: float64

In [203]:
df.A

2020-01-01   -0.190956
2020-01-02   -0.751889
2020-01-03    0.725349
2020-01-04   -0.241295
2020-01-05    0.876456
2020-01-06    0.185534
Freq: D, Name: A, dtype: float64

リスト**[]**で指定した場合、Pythonのスライス操作で列や行を選択することが出来ます

インデックスの範囲を取得することも出来ます。

In [204]:
print("先頭4列表示")
df[0:3]

先頭4列表示


Unnamed: 0,A,B,C,D
2020-01-01,-0.190956,-0.77678,1.462762,0.603106
2020-01-02,-0.751889,0.704933,-0.036482,-1.690382
2020-01-03,0.725349,0.467768,0.002753,0.827767


In [205]:
# 2020年1月2日から2020年1月4日まで表示
df['20200102':'20200104']

Unnamed: 0,A,B,C,D
2020-01-02,-0.751889,0.704933,-0.036482,-1.690382
2020-01-03,0.725349,0.467768,0.002753,0.827767
2020-01-04,-0.241295,-0.011085,0.587052,-0.535605


### [3.2 Selection by label - ラベルを指定してデータを選択する](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#selection-by-label)

DataFrameクラスの[loc()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas-dataframe-loc)にインデックス(今回の場合dates)を指定することで、行を列として選択することが出来ます。

In [206]:
df.loc[dates]

Unnamed: 0,A,B,C,D
2020-01-01,-0.190956,-0.77678,1.462762,0.603106
2020-01-02,-0.751889,0.704933,-0.036482,-1.690382
2020-01-03,0.725349,0.467768,0.002753,0.827767
2020-01-04,-0.241295,-0.011085,0.587052,-0.535605
2020-01-05,0.876456,0.959313,-2.127015,0.454854
2020-01-06,0.185534,0.011554,-0.133694,-0.003258


In [207]:
df.loc[dates[0]]

A   -0.190956
B   -0.776780
C    1.462762
D    0.603106
Name: 2020-01-01 00:00:00, dtype: float64

[loc()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas-dataframe-loc)を使うことで、複数列を選択することが出来ます。

In [208]:
df.loc[:, ["A", "B"]]

Unnamed: 0,A,B
2020-01-01,-0.190956,-0.77678
2020-01-02,-0.751889,0.704933
2020-01-03,0.725349,0.467768
2020-01-04,-0.241295,-0.011085
2020-01-05,0.876456,0.959313
2020-01-06,0.185534,0.011554


[loc()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas-dataframe-loc)とスライス操作を組み合わせることで複数行、複数列選択することが出来ます。

In [209]:
df.loc['20200102':'20200104', ['A', 'B']]

Unnamed: 0,A,B
2020-01-02,-0.751889,0.704933
2020-01-03,0.725349,0.467768
2020-01-04,-0.241295,-0.011085


[loc()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas-dataframe-loc)にインデックスを指定することで単体データを取得出来ます

In [210]:
df.loc[dates[0], 'A']

-0.19095612966196193

[at()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.at.html)を使うことでより高速に単体データを取得することが出来ます

In [211]:
df.at[dates[0], 'A']

-0.19095612966196193

### [3.3 Selection by position - 位置を指定してデータを選択する](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#selection-by-position)

DataFrameクラスの[iloc()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html)を使うことで、数値を指定してデータ選択することが出来ます。


In [212]:
df

Unnamed: 0,A,B,C,D
2020-01-01,-0.190956,-0.77678,1.462762,0.603106
2020-01-02,-0.751889,0.704933,-0.036482,-1.690382
2020-01-03,0.725349,0.467768,0.002753,0.827767
2020-01-04,-0.241295,-0.011085,0.587052,-0.535605
2020-01-05,0.876456,0.959313,-2.127015,0.454854
2020-01-06,0.185534,0.011554,-0.133694,-0.003258


In [213]:
df.iloc[3] # 4行目を1列として選択

A   -0.241295
B   -0.011085
C    0.587052
D   -0.535605
Name: 2020-01-04 00:00:00, dtype: float64

In [214]:
df.iloc[3:5, 0:2] # 4行目から5行目まで、1列目から2列目まで選択

Unnamed: 0,A,B
2020-01-04,-0.241295,-0.011085
2020-01-05,0.876456,0.959313


In [215]:
df.iloc[[1, 2, 4], [0, 2]] # 2行目、3行目、5行目、1列目、3列目を選択

Unnamed: 0,A,C
2020-01-02,-0.751889,-0.036482
2020-01-03,0.725349,0.002753
2020-01-05,0.876456,-2.127015


DataFrameクラスの[iloc()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html)の引数に開始位置終了位置を省略したスライス(:のみ)を指定することで、特定の全行 or 全列を取得出来ます

In [216]:
df.iloc[1:3, :] # 2行目から3行目を全列選択


Unnamed: 0,A,B,C,D
2020-01-02,-0.751889,0.704933,-0.036482,-1.690382
2020-01-03,0.725349,0.467768,0.002753,0.827767


In [217]:
df.iloc[:, 1:3] # 2列目から3列目を善行選択

Unnamed: 0,B,C
2020-01-01,-0.77678,1.462762
2020-01-02,0.704933,-0.036482
2020-01-03,0.467768,0.002753
2020-01-04,-0.011085,0.587052
2020-01-05,0.959313,-2.127015
2020-01-06,0.011554,-0.133694


DataFrameクラスの[iloc()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html)に引数に数値のみ指定することで、単体データの選択が出来ます。

In [218]:
df.iloc[1, 1]

0.7049329599637798

[at()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.at.html)と同様、[iat()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iat.html)使うことでより高速に単体データを取得することが出来ます

In [219]:
df.iat[1, 1]

0.7049329599637798

### [3.4 Boolean indexing - 条件判定によるデータ選択](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#boolean-indexing)

In [220]:
df

Unnamed: 0,A,B,C,D
2020-01-01,-0.190956,-0.77678,1.462762,0.603106
2020-01-02,-0.751889,0.704933,-0.036482,-1.690382
2020-01-03,0.725349,0.467768,0.002753,0.827767
2020-01-04,-0.241295,-0.011085,0.587052,-0.535605
2020-01-05,0.876456,0.959313,-2.127015,0.454854
2020-01-06,0.185534,0.011554,-0.133694,-0.003258


A列のデータが0を超えている行を選択するには以下のようにします。

In [221]:
df[df["A"] > 0] 

Unnamed: 0,A,B,C,D
2020-01-03,0.725349,0.467768,0.002753,0.827767
2020-01-05,0.876456,0.959313,-2.127015,0.454854
2020-01-06,0.185534,0.011554,-0.133694,-0.003258


DataFrameに対して条件判定することで、特定のデータだけ表示することが出来ます。

In [222]:
df[df > 0]

Unnamed: 0,A,B,C,D
2020-01-01,,,1.462762,0.603106
2020-01-02,,0.704933,,
2020-01-03,0.725349,0.467768,0.002753,0.827767
2020-01-04,,,0.587052,
2020-01-05,0.876456,0.959313,,0.454854
2020-01-06,0.185534,0.011554,,


[isin()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html#pandas-series-isin)を使うことでフィルタリングが出来ます。

In [223]:
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2

Unnamed: 0,A,B,C,D,E
2020-01-01,-0.190956,-0.77678,1.462762,0.603106,one
2020-01-02,-0.751889,0.704933,-0.036482,-1.690382,one
2020-01-03,0.725349,0.467768,0.002753,0.827767,two
2020-01-04,-0.241295,-0.011085,0.587052,-0.535605,three
2020-01-05,0.876456,0.959313,-2.127015,0.454854,four
2020-01-06,0.185534,0.011554,-0.133694,-0.003258,three


In [224]:
df2[df2['E'].isin(['two', 'four'])]

Unnamed: 0,A,B,C,D,E
2020-01-03,0.725349,0.467768,0.002753,0.827767,two
2020-01-05,0.876456,0.959313,-2.127015,0.454854,four


### [3.5 Setting - データの設定](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#setting)

新しい列を設定すると、データがインデックスによって自動的に配置されます。

In [225]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20200102', periods=6)) # 新しい列用のデータを作成する
s1

2020-01-02    1
2020-01-03    2
2020-01-04    3
2020-01-05    4
2020-01-06    5
2020-01-07    6
Freq: D, dtype: int64

In [226]:
df['F'] = s1 # DataFrameのF列にs1を追加
df

Unnamed: 0,A,B,C,D,F
2020-01-01,-0.190956,-0.77678,1.462762,0.603106,
2020-01-02,-0.751889,0.704933,-0.036482,-1.690382,1.0
2020-01-03,0.725349,0.467768,0.002753,0.827767,2.0
2020-01-04,-0.241295,-0.011085,0.587052,-0.535605,3.0
2020-01-05,0.876456,0.959313,-2.127015,0.454854,4.0
2020-01-06,0.185534,0.011554,-0.133694,-0.003258,5.0


ラベル指定による設定も出来ます。

In [227]:
print(dates)
df.at[dates[0], 'A'] = 0 # 1行目A列のデータを0に設定する
df

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq='D')


Unnamed: 0,A,B,C,D,F
2020-01-01,0.0,-0.77678,1.462762,0.603106,
2020-01-02,-0.751889,0.704933,-0.036482,-1.690382,1.0
2020-01-03,0.725349,0.467768,0.002753,0.827767,2.0
2020-01-04,-0.241295,-0.011085,0.587052,-0.535605,3.0
2020-01-05,0.876456,0.959313,-2.127015,0.454854,4.0
2020-01-06,0.185534,0.011554,-0.133694,-0.003258,5.0


位置指定による設定も出来ます。

In [228]:
df.iat[0, 1] = 0 # 1行目B列を0にする
df

Unnamed: 0,A,B,C,D,F
2020-01-01,0.0,0.0,1.462762,0.603106,
2020-01-02,-0.751889,0.704933,-0.036482,-1.690382,1.0
2020-01-03,0.725349,0.467768,0.002753,0.827767,2.0
2020-01-04,-0.241295,-0.011085,0.587052,-0.535605,3.0
2020-01-05,0.876456,0.959313,-2.127015,0.454854,4.0
2020-01-06,0.185534,0.011554,-0.133694,-0.003258,5.0


NumPy配列を使った設定も出来ます。

In [229]:
df.loc[:, 'D'] = np.array([5] * len(df)) # D列にNumpyで設定したデータを設定する
df

Unnamed: 0,A,B,C,D,F
2020-01-01,0.0,0.0,1.462762,5,
2020-01-02,-0.751889,0.704933,-0.036482,5,1.0
2020-01-03,0.725349,0.467768,0.002753,5,2.0
2020-01-04,-0.241295,-0.011085,0.587052,5,3.0
2020-01-05,0.876456,0.959313,-2.127015,5,4.0
2020-01-06,0.185534,0.011554,-0.133694,5,5.0


条件判定で選択したデータに対して値を設定することも出来ます

In [230]:
df2 = df.copy()
df2[df2 > 0] = 9999 # 0より大きいデータを全て9999にする
df2

Unnamed: 0,A,B,C,D,F
2020-01-01,0.0,0.0,9999.0,9999,
2020-01-02,-0.751889,9999.0,-0.036482,9999,9999.0
2020-01-03,9999.0,9999.0,9999.0,9999,9999.0
2020-01-04,-0.241295,-0.011085,9999.0,9999,9999.0
2020-01-05,9999.0,9999.0,-2.127015,9999,9999.0
2020-01-06,9999.0,9999.0,-0.133694,9999,9999.0


## [4. Missing data - 欠落データ](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#missing-data)

pandasでは欠落データを表すのに主に [np.nan](https://docs.scipy.org/doc/numpy/reference/constants.html#numpy.nan) を利用します。  
DataFrameクラスの[reindex()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html)を使うことで指定した行・列の変更/追加/削除を行ったDataFrameを返却します。

In [231]:
df

Unnamed: 0,A,B,C,D,F
2020-01-01,0.0,0.0,1.462762,5,
2020-01-02,-0.751889,0.704933,-0.036482,5,1.0
2020-01-03,0.725349,0.467768,0.002753,5,2.0
2020-01-04,-0.241295,-0.011085,0.587052,5,3.0
2020-01-05,0.876456,0.959313,-2.127015,5,4.0
2020-01-06,0.185534,0.011554,-0.133694,5,5.0


In [232]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])

In [233]:
df1.loc[dates[0]:dates[1], 'E'] = 1
df1

Unnamed: 0,A,B,C,D,F,E
2020-01-01,0.0,0.0,1.462762,5,,1.0
2020-01-02,-0.751889,0.704933,-0.036482,5,1.0,1.0
2020-01-03,0.725349,0.467768,0.002753,5,2.0,
2020-01-04,-0.241295,-0.011085,0.587052,5,3.0,


DataFrameクラスの[dropna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)を使うことで、欠落データ(NaN)を含むデータを除外することが出来ます。

In [234]:
df1.dropna(how='any') # NaNを含むデータを除外

Unnamed: 0,A,B,C,D,F,E
2020-01-02,-0.751889,0.704933,-0.036482,5,1.0,1.0


Pandasクラスの[isna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html)を使うことで、欠落データかどうかの判定を行うことが出来ます。

In [235]:
pd.isna(df1)

Unnamed: 0,A,B,C,D,F,E
2020-01-01,False,False,False,False,True,False
2020-01-02,False,False,False,False,False,False
2020-01-03,False,False,False,False,False,True
2020-01-04,False,False,False,False,False,True


## [5. Operations - 操作](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#operations)

### [5.1 Stats - 統計](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#stats)

DataFrameクラスのmean()を使うことで簡単な統計を取ることが出来ます。

In [237]:
df

Unnamed: 0,A,B,C,D,F
2020-01-01,0.0,0.0,1.462762,5,
2020-01-02,-0.751889,0.704933,-0.036482,5,1.0
2020-01-03,0.725349,0.467768,0.002753,5,2.0
2020-01-04,-0.241295,-0.011085,0.587052,5,3.0
2020-01-05,0.876456,0.959313,-2.127015,5,4.0
2020-01-06,0.185534,0.011554,-0.133694,5,5.0


In [247]:
df.mean() # 各列データの平均値

A    0.132359
B    0.355414
C   -0.040771
D    5.000000
F    3.000000
dtype: float64

In [248]:
df.mean(1) # 各行データの平均値

2020-01-01    1.615690
2020-01-02    1.183312
2020-01-03    1.639174
2020-01-04    1.666934
2020-01-05    1.741751
2020-01-06    2.012679
Freq: D, dtype: float64

pandasでは一次元のSeriesを使い、二次元のDataFrameに対して操作を行うことができます

In [256]:
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
s

2020-01-01    NaN
2020-01-02    NaN
2020-01-03    1.0
2020-01-04    3.0
2020-01-05    5.0
2020-01-06    NaN
Freq: D, dtype: float64

In [257]:
df

Unnamed: 0,A,B,C,D,F
2020-01-01,0.0,0.0,1.462762,5,
2020-01-02,-0.751889,0.704933,-0.036482,5,1.0
2020-01-03,0.725349,0.467768,0.002753,5,2.0
2020-01-04,-0.241295,-0.011085,0.587052,5,3.0
2020-01-05,0.876456,0.959313,-2.127015,5,4.0
2020-01-06,0.185534,0.011554,-0.133694,5,5.0


df - s を行う

In [260]:
df.sub(s, axis='index') # dfからs分、引き算する

Unnamed: 0,A,B,C,D,F
2020-01-01,,,,,
2020-01-02,,,,,
2020-01-03,-0.274651,-0.532232,-0.997247,4.0,1.0
2020-01-04,-3.241295,-3.011085,-2.412948,2.0,0.0
2020-01-05,-4.123544,-4.040687,-7.127015,0.0,-1.0
2020-01-06,,,,,


## [6. Merge - マージ](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#merge)

## [7. Grouping - グルーピング](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#grouping)

## [8. Reshaping - 再構築](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#reshaping)

## [9. Time series - 時系列](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#time-series)

## [10. Categoricals - カテゴリー化](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#categoricals)

## [11. Plotting - プロット](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#plotting)

## [12. Getting data in/out - データの入力と出力](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#getting-data-in-out)

## [13. Gotchas - 落とし穴](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#gotchas)