# 10分で分かるpandas
## はじめに
この記事はpandas公式チュートリアル「10 minutes to pandas」の写経及び解説です

以下のURLを参考にしています
https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html


## 環境
- Python3.8
- Jupyter Lab


# とりあえずインポート

In [29]:
import numpy as np
import pandas as pd

In [30]:
np

<module 'numpy' from '/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/numpy/__init__.py'>

In [31]:
pd

<module 'pandas' from '/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/__init__.py'>

## [1. オブジェクトを作る](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#object-creation)

### Seriesクラス
[Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)クラスにリストを入れることで簡単にデータを作ることが出来ます。


In [32]:
# 簡単に一列作る
s = pd.Series(data=[1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

### data_rangeメソッド
[date_range()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html)を使うことで、特定の期間の日付の行を作成出来ます。

In [33]:
# 2020年1月１日から6日間のデータ
dates = pd.date_range("20200101", periods=6)
dates

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq='D')

### DataFrameクラス
pandasの[DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas-dataframe)クラス**引数index**を指定することで、行インデックスを指定することが出来ます。

In [34]:
# 行インデックスに2020年1月1日からのデータを指定
# 各値にはランダムな数値を入れる
df = pd.DataFrame(np.random.randn(6, 4), index=dates)
df

Unnamed: 0,0,1,2,3
2020-01-01,1.072849,0.753893,2.262836,0.782191
2020-01-02,0.116488,2.583145,-0.119309,-0.491453
2020-01-03,0.684796,-1.820642,0.181518,0.118552
2020-01-04,0.055149,-0.449824,-1.187623,-1.655626
2020-01-05,-0.28339,0.81176,1.437833,0.938363
2020-01-06,-0.49252,-0.908324,-1.539927,-0.576967


また、同じくDataFrameクラスの
**引数columns**を指定することで列名を設定することが出来ます。

In [35]:
# 列名ABCDを設定
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

Unnamed: 0,A,B,C,D
2020-01-01,-0.210154,-2.356801,0.888426,0.904716
2020-01-02,0.931794,-1.466138,-0.181628,-0.325241
2020-01-03,0.652703,-0.636439,0.746821,-0.58681
2020-01-04,-1.302574,0.526551,0.051043,-0.448008
2020-01-05,-1.148398,0.397078,0.730502,-0.283657
2020-01-06,-1.492051,-0.072874,-1.45073,1.304447


DataFrameクラスに辞書型のデータを渡すことで、辞書型のキーの部分が列名になります。

In [36]:
df2 = pd.DataFrame(
    {
        "A": 1.,
        "B": pd.Timestamp("20200101"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2020-01-01,1.0,3,test,foo
1,1.0,2020-01-01,1.0,3,train,foo
2,1.0,2020-01-01,1.0,3,test,foo
3,1.0,2020-01-01,1.0,3,train,foo


### DataFrame.dtypesプロパティ
**dtypesプロパティ**に参照することで各列のデータ属性が分かります。

In [37]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

## [2. データを表示する](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#object-creation)


DataFrameクラスの[head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head)を使うことでデータの先頭部を表示できます。


### DataFrame.headメソッド

In [38]:
df.head(2)

Unnamed: 0,A,B,C,D
2020-01-01,-0.210154,-2.356801,0.888426,0.904716
2020-01-02,0.931794,-1.466138,-0.181628,-0.325241


同じくDataFrameクラスの[tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas-dataframe-tail)を使うことでデータの後尾部を表示できます。

### DataFrame.tailメソッド
同じくDataFrameクラスの[tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas-dataframe-tail)を使うことでデータの後尾部を表示できます。

In [39]:
df.tail(2)

Unnamed: 0,A,B,C,D
2020-01-05,-1.148398,0.397078,0.730502,-0.283657
2020-01-06,-1.492051,-0.072874,-1.45073,1.304447


### DataFrame.indexプロパティ
DataFrameクラスの[index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.index.html#pandas-dataframe-index)を参照することでそのデータの行インデックスを表示出来ます。


In [40]:
df.index

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06'],
              dtype='datetime64[ns]', freq='D')

In [41]:
df2.index

Int64Index([0, 1, 2, 3], dtype='int64')

### DataFrame.to_numpyメソッド
DataFrameクラスの[to_numpy()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html#pandas-dataframe-to-numpy)を使うことでデータをnumpyで操作しやすいデータに変換できます。


In [42]:
df.to_numpy()

array([[-0.21015439, -2.3568008 ,  0.88842576,  0.90471573],
       [ 0.93179427, -1.46613759, -0.1816276 , -0.32524143],
       [ 0.65270251, -0.63643867,  0.74682134, -0.58681003],
       [-1.30257365,  0.52655146,  0.05104261, -0.44800838],
       [-1.14839775,  0.39707812,  0.73050198, -0.28365727],
       [-1.49205076, -0.07287358, -1.45072958,  1.30444654]])

In [43]:
df2.to_numpy()

array([[1.0, Timestamp('2020-01-01 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2020-01-01 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2020-01-01 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2020-01-01 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

In [44]:
df

Unnamed: 0,A,B,C,D
2020-01-01,-0.210154,-2.356801,0.888426,0.904716
2020-01-02,0.931794,-1.466138,-0.181628,-0.325241
2020-01-03,0.652703,-0.636439,0.746821,-0.58681
2020-01-04,-1.302574,0.526551,0.051043,-0.448008
2020-01-05,-1.148398,0.397078,0.730502,-0.283657
2020-01-06,-1.492051,-0.072874,-1.45073,1.304447


### DataFrame.describeメソッド
DataFrameクラスの[describe()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html#pandas-dataframe-describe)を使うことで、データの各列の簡単な統計を取ることができます。


In [45]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.428113,-0.601437,0.130739,0.094241
std,1.046945,1.129514,0.885249,0.799763
min,-1.492051,-2.356801,-1.45073,-0.58681
25%,-1.26403,-1.258713,-0.12346,-0.417317
50%,-0.679276,-0.354656,0.390772,-0.304449
75%,0.436988,0.27959,0.742742,0.607622
max,0.931794,0.526551,0.888426,1.304447


### DataFrame.T属性
DataFrameクラスの[T](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.T.html#pandas-dataframe-t)を参照すると、行列入れ替えたデータにアクセスできます。

In [46]:
df.T

Unnamed: 0,2020-01-01,2020-01-02,2020-01-03,2020-01-04,2020-01-05,2020-01-06
A,-0.210154,0.931794,0.652703,-1.302574,-1.148398,-1.492051
B,-2.356801,-1.466138,-0.636439,0.526551,0.397078,-0.072874
C,0.888426,-0.181628,0.746821,0.051043,0.730502,-1.45073
D,0.904716,-0.325241,-0.58681,-0.448008,-0.283657,1.304447


### DataFrame.transposeメソッド
DataFrameクラスの[transpose()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html#pandas-dataframe-transpose)でも同じく行列の入れ替えを取得できます。

In [47]:
df.transpose()

Unnamed: 0,2020-01-01,2020-01-02,2020-01-03,2020-01-04,2020-01-05,2020-01-06
A,-0.210154,0.931794,0.652703,-1.302574,-1.148398,-1.492051
B,-2.356801,-1.466138,-0.636439,0.526551,0.397078,-0.072874
C,0.888426,-0.181628,0.746821,0.051043,0.730502,-1.45073
D,0.904716,-0.325241,-0.58681,-0.448008,-0.283657,1.304447


### DataFrame.sort_index()

DataFrameクラスの[sort_index()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html#pandas-dataframe-sort-index)を使用することで、行全体もしくは列全体の並び替えを行うことができます。

In [48]:
df.sort_index()

Unnamed: 0,A,B,C,D
2020-01-01,-0.210154,-2.356801,0.888426,0.904716
2020-01-02,0.931794,-1.466138,-0.181628,-0.325241
2020-01-03,0.652703,-0.636439,0.746821,-0.58681
2020-01-04,-1.302574,0.526551,0.051043,-0.448008
2020-01-05,-1.148398,0.397078,0.730502,-0.283657
2020-01-06,-1.492051,-0.072874,-1.45073,1.304447


**引数axis**に0もしくは"index"を設定すると行に、1もしくは"columns"を設定すると、列を軸に並び替えします(デフォルト値0)。また、**引数ascending**にFalseを指定すると並び順が降順になります(デフォルト値True)。

In [50]:
df.sort_index(axis="columns", ascending=False)

Unnamed: 0,D,C,B,A
2020-01-01,0.904716,0.888426,-2.356801,-0.210154
2020-01-02,-0.325241,-0.181628,-1.466138,0.931794
2020-01-03,-0.58681,0.746821,-0.636439,0.652703
2020-01-04,-0.448008,0.051043,0.526551,-1.302574
2020-01-05,-0.283657,0.730502,0.397078,-1.148398
2020-01-06,1.304447,-1.45073,-0.072874,-1.492051


In [51]:
df.sort_index(axis=0, ascending=False)

Unnamed: 0,A,B,C,D
2020-01-06,-1.492051,-0.072874,-1.45073,1.304447
2020-01-05,-1.148398,0.397078,0.730502,-0.283657
2020-01-04,-1.302574,0.526551,0.051043,-0.448008
2020-01-03,0.652703,-0.636439,0.746821,-0.58681
2020-01-02,0.931794,-1.466138,-0.181628,-0.325241
2020-01-01,-0.210154,-2.356801,0.888426,0.904716


### DataFrame.sort_valuesメソッド
DataFrameクラスの[sort_values()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html#pandas-dataframe-sort-values)を使用することで行単位もしくは列単位に並び替えを行うことができます。


In [52]:
df.sort_values(by="B")

Unnamed: 0,A,B,C,D
2020-01-01,-0.210154,-2.356801,0.888426,0.904716
2020-01-02,0.931794,-1.466138,-0.181628,-0.325241
2020-01-03,0.652703,-0.636439,0.746821,-0.58681
2020-01-06,-1.492051,-0.072874,-1.45073,1.304447
2020-01-05,-1.148398,0.397078,0.730502,-0.283657
2020-01-04,-1.302574,0.526551,0.051043,-0.448008


In [53]:
df.sort_values(by="2020-01-01", axis=1)

Unnamed: 0,B,A,C,D
2020-01-01,-2.356801,-0.210154,0.888426,0.904716
2020-01-02,-1.466138,0.931794,-0.181628,-0.325241
2020-01-03,-0.636439,0.652703,0.746821,-0.58681
2020-01-04,0.526551,-1.302574,0.051043,-0.448008
2020-01-05,0.397078,-1.148398,0.730502,-0.283657
2020-01-06,-0.072874,-1.492051,-1.45073,1.304447


In [54]:
df["A"]

2020-01-01   -0.210154
2020-01-02    0.931794
2020-01-03    0.652703
2020-01-04   -1.302574
2020-01-05   -1.148398
2020-01-06   -1.492051
Freq: D, Name: A, dtype: float64

In [55]:
df.A

2020-01-01   -0.210154
2020-01-02    0.931794
2020-01-03    0.652703
2020-01-04   -1.302574
2020-01-05   -1.148398
2020-01-06   -1.492051
Freq: D, Name: A, dtype: float64

In [56]:
# 先頭4列表示
df[0:3]

Unnamed: 0,A,B,C,D
2020-01-01,-0.210154,-2.356801,0.888426,0.904716
2020-01-02,0.931794,-1.466138,-0.181628,-0.325241
2020-01-03,0.652703,-0.636439,0.746821,-0.58681


In [57]:
# 2020年1月2日から2020年1月4日まで表示
df['20200102':'20200104']

Unnamed: 0,A,B,C,D
2020-01-02,0.931794,-1.466138,-0.181628,-0.325241
2020-01-03,0.652703,-0.636439,0.746821,-0.58681
2020-01-04,-1.302574,0.526551,0.051043,-0.448008


In [58]:
df.loc[dates]

Unnamed: 0,A,B,C,D
2020-01-01,-0.210154,-2.356801,0.888426,0.904716
2020-01-02,0.931794,-1.466138,-0.181628,-0.325241
2020-01-03,0.652703,-0.636439,0.746821,-0.58681
2020-01-04,-1.302574,0.526551,0.051043,-0.448008
2020-01-05,-1.148398,0.397078,0.730502,-0.283657
2020-01-06,-1.492051,-0.072874,-1.45073,1.304447


In [31]:
df.loc[dates[0]]

A   -0.866703
B   -0.524674
C   -0.780228
D   -0.211863
Name: 2020-01-01 00:00:00, dtype: float64

In [32]:
df.loc[:, ["A", "B"]]

Unnamed: 0,A,B
2020-01-01,-0.866703,-0.524674
2020-01-02,0.384998,1.149817
2020-01-03,1.561252,-0.659028
2020-01-04,-0.915777,0.896869
2020-01-05,-0.343103,0.984573
2020-01-06,0.56171,-0.570133


In [33]:
df.loc['20200102':'20200104', ['A', 'B']]

Unnamed: 0,A,B
2020-01-02,0.384998,1.149817
2020-01-03,1.561252,-0.659028
2020-01-04,-0.915777,0.896869


In [34]:
df.loc[dates[0], 'A']

-0.8667026620111296

In [35]:
df.at[dates[0], 'A']

-0.8667026620111296

In [36]:
df

Unnamed: 0,A,B,C,D
2020-01-01,-0.866703,-0.524674,-0.780228,-0.211863
2020-01-02,0.384998,1.149817,-1.021689,0.008706
2020-01-03,1.561252,-0.659028,2.596894,0.90702
2020-01-04,-0.915777,0.896869,0.079691,0.393829
2020-01-05,-0.343103,0.984573,0.196222,2.237728
2020-01-06,0.56171,-0.570133,2.138998,0.703548


In [37]:
df.iloc[3] # 4行目を1列として選択

A   -0.915777
B    0.896869
C    0.079691
D    0.393829
Name: 2020-01-04 00:00:00, dtype: float64

In [38]:
df.iloc[3:5, 0:2] # 4行目から5行目まで、1列目から2列目まで選択

Unnamed: 0,A,B
2020-01-04,-0.915777,0.896869
2020-01-05,-0.343103,0.984573


In [39]:
df.iloc[[1, 2, 4], [0, 2]] # 2行目、3行目、5行目、1列目、3列目を選択

Unnamed: 0,A,C
2020-01-02,0.384998,-1.021689
2020-01-03,1.561252,2.596894
2020-01-05,-0.343103,0.196222


In [40]:
df.iloc[1:3, :] # 2行目から3行目を全列選択


Unnamed: 0,A,B,C,D
2020-01-02,0.384998,1.149817,-1.021689,0.008706
2020-01-03,1.561252,-0.659028,2.596894,0.90702


In [41]:
df.iloc[:, 1:3] # 2列目から3列目を善行選択

Unnamed: 0,B,C
2020-01-01,-0.524674,-0.780228
2020-01-02,1.149817,-1.021689
2020-01-03,-0.659028,2.596894
2020-01-04,0.896869,0.079691
2020-01-05,0.984573,0.196222
2020-01-06,-0.570133,2.138998


In [42]:
df.iloc[1, 1]

1.1498171779772

In [43]:
df.iat[1, 1]

1.1498171779772

In [44]:
df

Unnamed: 0,A,B,C,D
2020-01-01,-0.866703,-0.524674,-0.780228,-0.211863
2020-01-02,0.384998,1.149817,-1.021689,0.008706
2020-01-03,1.561252,-0.659028,2.596894,0.90702
2020-01-04,-0.915777,0.896869,0.079691,0.393829
2020-01-05,-0.343103,0.984573,0.196222,2.237728
2020-01-06,0.56171,-0.570133,2.138998,0.703548


In [45]:
df[df["A"] > 0] # A列のデータが0を超えている行を選択する

Unnamed: 0,A,B,C,D
2020-01-02,0.384998,1.149817,-1.021689,0.008706
2020-01-03,1.561252,-0.659028,2.596894,0.90702
2020-01-06,0.56171,-0.570133,2.138998,0.703548
