<a id=0></a>
# 3.DataFrameを操作する

---
### [1.CSVファイルからDataFrameを作成 ](#1)
### [2.DataFrameからデータを抽出 ](#2)
### [3.要素の値を更新（上書き）](#3)
### [4.基本統計量やユニーク、最大・最小など](#4)
### [5.グループ化、レコードの並べ替え](#5)
### [6.重複、欠損値の処理](#6)
---

In [1]:
import numpy as np
import pandas as pd

In [2]:
# # google colaboratoryの場合
# from google.colab import drive
# drive.mount('/drive')

---
<a id=1></a>
[Topへ](#0)

---
## 1. CSVファイルからDataFrameを作成

* csvファイルからDataFrameを作成、indexを指定
* pklファイルからDataFrameを作成、csvの場合との比較
* object型として読み込まれた年月日をdatetime型に変換 
---

csvファイルからDataFrameを作成、indexを指定

In [13]:
df = pd.read_csv("./data/data.csv")
df.head(2)

Unnamed: 0.1,Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
0,0,1997-07-05,7370,16,2.420553,6.721355,52.848386,-0.302324,blue,circle
1,1,1997-07-06,960,82,7.616196,2.376375,,-0.004563,blue,circle


In [14]:
df = pd.read_csv("./data/data.csv", index_col=0)
df.head(2)

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
0,1997-07-05,7370,16,2.420553,6.721355,52.848386,-0.302324,blue,circle
1,1997-07-06,960,82,7.616196,2.376375,,-0.004563,blue,circle


In [15]:
df = pd.read_csv("./data/data_without_index.csv")
df.head(2)


Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
0,1997-07-05,7370,16,2.420553,6.721355,52.848386,-0.302324,blue,circle
1,1997-07-06,960,82,7.616196,2.376375,,-0.004563,blue,circle


pklファイルからDataFrameを作成、csvの場合との比較

In [16]:
df_p = pd.read_pickle("./data/data.pkl")
df_p.head(2)

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
0,1997-07-05,7370,16,2.420553,6.721355,52.848386,-0.302324,blue,circle
1,1997-07-06,960,82,7.616196,2.376375,,-0.004563,blue,circle


In [17]:
df_p.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Date        97 non-null     datetime64[ns]
 1   Price       100 non-null    int32         
 2   Quantity    100 non-null    int32         
 3   Width       95 non-null     float64       
 4   Height      96 non-null     float64       
 5   Quality     94 non-null     float64       
 6   Difference  93 non-null     float64       
 7   Colors      99 non-null     object        
 8   Shape       96 non-null     object        
dtypes: datetime64[ns](1), float64(4), int32(2), object(2)
memory usage: 6.4+ KB


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Date        97 non-null     object 
 1   Price       100 non-null    int64  
 2   Quantity    100 non-null    int64  
 3   Width       95 non-null     float64
 4   Height      96 non-null     float64
 5   Quality     94 non-null     float64
 6   Difference  93 non-null     float64
 7   Colors      99 non-null     object 
 8   Shape       96 non-null     object 
dtypes: float64(4), int64(2), object(3)
memory usage: 7.2+ KB


object型として読み込まれた年月日をdatetime型に変換

In [19]:
pd.to_datetime(df["Date"])

0    1997-07-05
1    1997-07-06
2    1997-07-07
3    1997-07-08
4           NaT
        ...    
95   1997-10-08
96   1997-10-09
97   1997-10-10
98   1997-10-11
99   1997-10-12
Name: Date, Length: 100, dtype: datetime64[ns]

In [20]:
df["Date"] = pd.to_datetime(df["Date"])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Date        97 non-null     datetime64[ns]
 1   Price       100 non-null    int64         
 2   Quantity    100 non-null    int64         
 3   Width       95 non-null     float64       
 4   Height      96 non-null     float64       
 5   Quality     94 non-null     float64       
 6   Difference  93 non-null     float64       
 7   Colors      99 non-null     object        
 8   Shape       96 non-null     object        
dtypes: datetime64[ns](1), float64(4), int64(2), object(2)
memory usage: 7.2+ KB


In [21]:
df = pd.read_csv("./data/data.csv", index_col=0, parse_dates=["Date"])
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Date        97 non-null     datetime64[ns]
 1   Price       100 non-null    int64         
 2   Quantity    100 non-null    int64         
 3   Width       95 non-null     float64       
 4   Height      96 non-null     float64       
 5   Quality     94 non-null     float64       
 6   Difference  93 non-null     float64       
 7   Colors      99 non-null     object        
 8   Shape       96 non-null     object        
dtypes: datetime64[ns](1), float64(4), int64(2), object(2)
memory usage: 7.8+ KB


---
<a id=2></a>
[Topへ](#0)

---
## 2. DataFrameからデータを抽出 

* カラムを指定して抽出
* locとilocでレコードとカラムを指定して抽出
* 条件文で抽出
* 複数の条件の場合の注意点
* filterを使う
* queryを使う

---

カラムを指定して抽出

In [22]:
df.head(2)

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
0,1997-07-05,7370,16,2.420553,6.721355,52.848386,-0.302324,blue,circle
1,1997-07-06,960,82,7.616196,2.376375,,-0.004563,blue,circle


locとilocでレコードとカラムを指定して抽出  
※ loc : [index_label, column_label], iloc : [row_index, column_index]

In [33]:
# 3やQuantityも含む
df.loc[0:3, "Date":"Quantity"]

Unnamed: 0,Date,Price,Quantity
0,1997-07-05,7370,16
1,1997-07-06,960,82
2,1997-07-07,5490,81
3,1997-07-08,5291,21


In [37]:
df.loc[[1, 3, 5], ["Date", "Quantity", "Width"]]

Unnamed: 0,Date,Quantity,Width
1,1997-07-06,82,7.616196
3,1997-07-08,21,6.323058
5,1997-07-10,42,8.353025


In [38]:
df.iloc[0:3, 0:3]

Unnamed: 0,Date,Price,Quantity
0,1997-07-05,7370,16
1,1997-07-06,960,82
2,1997-07-07,5490,81


In [40]:
df.iloc[[i for i in range(0, 20, 5)], [4, 5]]

Unnamed: 0,Height,Quality
0,6.721355,52.848386
5,3.207801,47.763399
10,6.909377,47.722053
15,8.172222,42.01556


In [42]:
df.iloc[0:3, df.columns.get_loc("Price")]

0    7370
1     960
2    5490
Name: Price, dtype: int64

条件文で抽出

In [44]:
df["Height"] >= 9.5

0     False
1     False
2     False
3     False
4     False
      ...  
95    False
96    False
97    False
98    False
99    False
Name: Height, Length: 100, dtype: bool

In [46]:
df[df["Height"]>=9.5]

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
35,1997-08-09,1628,52,2.439896,9.730106,59.810103,1.476659,blue,triangle
46,1997-08-20,6335,54,9.666548,9.6362,58.051058,-2.452265,,triangle
84,1997-09-27,2162,12,2.799339,9.548653,41.626976,0.981547,red,circle


複数の条件の場合の注意点

In [48]:
df[(df["Height"] >= 9.5) & ((df["Price"] > 3000) | (df["Shape"] == "triangle"))]

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
35,1997-08-09,1628,52,2.439896,9.730106,59.810103,1.476659,blue,triangle
46,1997-08-20,6335,54,9.666548,9.6362,58.051058,-2.452265,,triangle


In [49]:
condition1 = df["Height"] >= 9.5
condition2 = df["Price"] > 3000
condition3 = df["Shape"] == "triangle"
df[condition1 & (condition2|condition3)]

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
35,1997-08-09,1628,52,2.439896,9.730106,59.810103,1.476659,blue,triangle
46,1997-08-20,6335,54,9.666548,9.6362,58.051058,-2.452265,,triangle


filterを使う  
※ locなどで対応できない場合に使用するという考えでよい  

In [53]:
df.filter(like="0", axis=0).head(3)

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
0,1997-07-05,7370,16,2.420553,6.721355,52.848386,-0.302324,blue,circle
10,1997-07-15,1785,46,,6.909377,47.722053,-0.343201,green,circle
20,1997-07-25,6496,99,3.492096,7.259557,46.379513,1.200412,green,square


queryを使う

In [54]:
df.query("9 < Height & Width < 3")

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
13,1997-07-18,2533,44,1.134735,9.246936,43.915823,0.983504,green,square
35,1997-08-09,1628,52,2.439896,9.730106,59.810103,1.476659,blue,triangle
84,1997-09-27,2162,12,2.799339,9.548653,41.626976,0.981547,red,circle


---
<a id=3></a>
[Topへ](#0)

---
## 3. 要素の値を更新（上書き）する

 * 要素の値を更新
 * locもしくはilocで更新する
 * 一次元、二次元アレイであることを念頭に更新する
---

要素の値を更新

In [55]:
# chain indexは推奨されていない
# df["Price"][0] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Price"][0] = 0


In [56]:
df.head(3)

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
0,1997-07-05,0,16,2.420553,6.721355,52.848386,-0.302324,blue,circle
1,1997-07-06,960,82,7.616196,2.376375,,-0.004563,blue,circle
2,1997-07-07,5490,81,7.282163,3.677831,51.715512,-0.682343,red,square


 locもしくはilocで更新する

In [57]:
df.loc[0, 'Price'] = 1000
df.head(2)

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
0,1997-07-05,1000,16,2.420553,6.721355,52.848386,-0.302324,blue,circle
1,1997-07-06,960,82,7.616196,2.376375,,-0.004563,blue,circle


In [59]:
df.loc[0:2, 'Price'] = 10000
df.head(3)

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
0,1997-07-05,10000,16,2.420553,6.721355,52.848386,-0.302324,blue,circle
1,1997-07-06,10000,82,7.616196,2.376375,,-0.004563,blue,circle
2,1997-07-07,10000,81,7.282163,3.677831,51.715512,-0.682343,red,square


一次元、二次元アレイであることを念頭に更新する

In [60]:
df.loc[0, ["Width", "Height"]] = 8.88
df.head(3)

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
0,1997-07-05,10000,16,8.88,8.88,52.848386,-0.302324,blue,circle
1,1997-07-06,10000,82,7.616196,2.376375,,-0.004563,blue,circle
2,1997-07-07,10000,81,7.282163,3.677831,51.715512,-0.682343,red,square


In [63]:
df.loc[0, ["Width", "Height"]] = [1.11, 7.77]
df.head(3)

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
0,1997-07-05,10000,16,1.11,7.77,52.848386,-0.302324,blue,circle
1,1997-07-06,10000,82,7.616196,2.376375,,-0.004563,blue,circle
2,1997-07-07,10000,81,7.282163,3.677831,51.715512,-0.682343,red,square


In [65]:
df.loc[[0, 1], ["Width", "Height"]] = [4.44, 7.77]
df

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
0,1997-07-05,10000,16,4.440000,7.770000,52.848386,-0.302324,blue,circle
1,1997-07-06,10000,82,4.440000,7.770000,,-0.004563,blue,circle
2,1997-07-07,10000,81,7.282163,3.677831,51.715512,-0.682343,red,square
3,1997-07-08,5291,21,6.323058,6.335297,58.804605,-0.048195,green,square
4,NaT,5834,43,5.357747,0.902898,51.509484,,red,square
...,...,...,...,...,...,...,...,...,...
95,1997-10-08,491,62,6.158501,6.350937,55.542938,-0.075304,green,square
96,1997-10-09,5992,53,0.453040,3.746126,51.168085,-1.101252,blue,square
97,1997-10-10,3661,99,6.258599,5.031363,48.484440,0.487673,red,circle
98,1997-10-11,6284,41,8.564898,6.586936,58.127088,0.747368,blue,triangle


In [67]:
df.loc[[0, 1], ["Width", "Height"]] = [[1.11, 2.22], [3.33, 4.44]]
df

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
0,1997-07-05,10000,16,1.110000,2.220000,52.848386,-0.302324,blue,circle
1,1997-07-06,10000,82,3.330000,4.440000,,-0.004563,blue,circle
2,1997-07-07,10000,81,7.282163,3.677831,51.715512,-0.682343,red,square
3,1997-07-08,5291,21,6.323058,6.335297,58.804605,-0.048195,green,square
4,NaT,5834,43,5.357747,0.902898,51.509484,,red,square
...,...,...,...,...,...,...,...,...,...
95,1997-10-08,491,62,6.158501,6.350937,55.542938,-0.075304,green,square
96,1997-10-09,5992,53,0.453040,3.746126,51.168085,-1.101252,blue,square
97,1997-10-10,3661,99,6.258599,5.031363,48.484440,0.487673,red,circle
98,1997-10-11,6284,41,8.564898,6.586936,58.127088,0.747368,blue,triangle


---
<a id=4></a>
[Topへ](#0)

---
## 4. 基本統計量やユニーク、最大・最小など

---
* 基本統計量の算出
* カラム別のユニークな値
* 基本統計量の一括表示
* 最大値・最小値を持つレコード
---

基本統計量の算出

In [68]:
print(f"count : {df['Quantity'].count()}")
print(f"sum : {df['Quantity'].sum()}")
print(f"average : {df['Quantity'].mean()}")
print(f"median : {df['Quantity'].median()}")
print(f"max : {df['Quantity'].max()}")
print(f"min : {df['Quantity'].min()}")
print('=====')
print(f"mode : {df['Quantity'].mode()}")   # 最頻値

count : 100
sum : 5337
average : 53.37
median : 53.0
max : 99
min : 10
=====
mode : 0    71
Name: Quantity, dtype: int64


カラム別のユニークな値

In [71]:
df["Colors"].unique()

array(['blue', 'red', 'green', nan], dtype=object)

In [69]:
df["Quantity"].unique()

array([16, 82, 81, 21, 43, 42, 57, 32, 71, 97, 46, 53, 95, 44, 74, 56, 87,
       12, 10, 14, 99, 23, 36, 18, 88, 24, 51, 86, 60, 72, 61, 13, 52, 38,
       45, 22, 41, 80, 68, 37, 75, 54, 66, 15, 93, 39, 84, 98, 79, 33, 67,
       48, 11, 65, 90, 63, 96, 28, 62], dtype=int64)

In [72]:
df["Quantity"].nunique()

59

In [73]:
df["Quantity"].value_counts()

71    7
53    4
11    4
99    3
68    3
41    3
61    3
36    3
10    3
12    3
37    3
46    2
18    2
81    2
21    2
79    2
60    2
86    2
51    2
95    2
88    2
24    2
32    2
97    2
96    1
84    1
90    1
66    1
65    1
15    1
93    1
39    1
98    1
33    1
28    1
48    1
63    1
67    1
54    1
16    1
72    1
75    1
80    1
43    1
42    1
57    1
44    1
74    1
56    1
87    1
14    1
23    1
82    1
13    1
52    1
38    1
45    1
22    1
62    1
Name: Quantity, dtype: int64

In [75]:
np.array(df["Quantity"].value_counts())

array([7, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

基本統計量の一括表示

In [76]:
df.describe()

Unnamed: 0,Price,Quantity,Width,Height,Quality,Difference
count,100.0,100.0,95.0,96.0,94.0,93.0
mean,5370.03,53.37,4.791071,5.234543,50.30581,0.107284
std,2911.617369,27.258825,2.893133,2.814596,6.219802,0.994061
min,164.0,10.0,0.091971,0.050616,40.216753,-2.452265
25%,2824.25,32.75,2.318724,3.242448,45.637994,-0.313024
50%,5574.5,53.0,5.183297,5.489802,50.951457,0.169973
75%,7954.0,72.5,7.03071,7.370669,55.821776,0.784006
max,10000.0,99.0,9.900539,9.730106,59.810103,2.275761


In [78]:
df.describe().T # transpose

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Price,100.0,5370.03,2911.617369,164.0,2824.25,5574.5,7954.0,10000.0
Quantity,100.0,53.37,27.258825,10.0,32.75,53.0,72.5,99.0
Width,95.0,4.791071,2.893133,0.091971,2.318724,5.183297,7.03071,9.900539
Height,96.0,5.234543,2.814596,0.050616,3.242448,5.489802,7.370669,9.730106
Quality,94.0,50.30581,6.219802,40.216753,45.637994,50.951457,55.821776,59.810103
Difference,93.0,0.107284,0.994061,-2.452265,-0.313024,0.169973,0.784006,2.275761


最大値・最小値を持つレコード

In [95]:
df["Quantity"].nlargest(5)
df["Quantity"].nsmallest(5)

18    10
59    10
92    10
83    11
88    11
Name: Quantity, dtype: int64

In [87]:
large_index = df["Quantity"].nlargest(5).index

In [88]:
small_index = df["Quantity"].nsmallest(5).index

In [85]:
df.loc[large_index]

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
20,1997-07-25,6496,99,3.492096,7.259557,46.379513,1.200412,green,square
26,1997-07-31,2847,99,6.635018,0.050616,53.934743,-0.157724,blue,triangle
97,1997-10-10,3661,99,6.258599,5.031363,48.48444,0.487673,red,circle
57,1997-08-31,1121,98,2.935918,8.093612,59.012143,0.535492,blue,triangle
9,1997-07-14,8422,97,2.264958,,58.829296,-0.844137,red,circle


In [89]:
df.loc[small_index]

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
18,1997-07-23,4655,10,8.972158,9.004181,53.660135,-0.431942,red,square
59,1997-09-02,8089,10,9.132406,5.113424,52.636744,1.541371,blue,square
92,1997-10-05,5358,10,0.978342,4.916159,,1.122678,blue,circle
83,1997-09-26,5376,11,7.145959,6.601974,54.124845,,red,circle
88,1997-10-01,5563,11,7.578461,0.143935,47.412843,,blue,triangle


In [96]:
df.loc[df["Width"] > 9]

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
43,1997-08-17,2712,37,9.539286,9.148644,55.017421,-0.086414,red,circle
45,1997-08-19,9655,51,9.283186,4.281841,42.062477,0.278723,blue,triangle
46,1997-08-20,6335,54,9.666548,9.6362,58.051058,-2.452265,,triangle
53,1997-08-27,1685,93,9.900539,1.40084,58.10764,,red,square
59,1997-09-02,8089,10,9.132406,5.113424,52.636744,1.541371,blue,square
75,1997-09-18,978,41,9.758521,5.163003,42.859834,,green,square
79,1997-09-22,6431,67,9.626484,8.359801,41.682136,-0.137078,blue,circle


---
<a id=5></a>
[Topへ](#0)

---
## 5. グループ化、レコードの並べ替え

* グループ化し、特定のグループを抽出する
* 並べ替え
* グループ毎の統計量を比較
* 複数のカテゴリ（クラス）でのグループ化とMultiIndexの扱い   
---

グループ化し、特定のグループを抽出する

In [97]:
df.groupby("Colors")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000012FDD78FCA0>

In [99]:
df.groupby("Colors").groups

{'blue': [0, 1, 6, 7, 26, 30, 33, 35, 36, 42, 44, 45, 50, 56, 57, 58, 59, 60, 61, 63, 64, 69, 70, 71, 73, 76, 79, 88, 92, 93, 96, 98], 'green': [3, 5, 10, 11, 13, 15, 19, 20, 21, 23, 24, 28, 32, 34, 39, 47, 48, 49, 51, 52, 54, 55, 62, 66, 67, 68, 75, 80, 81, 82, 87, 91, 94, 95, 99], 'red': [2, 4, 8, 9, 12, 14, 16, 17, 18, 22, 25, 27, 29, 31, 37, 38, 40, 41, 43, 53, 65, 72, 74, 77, 78, 83, 84, 85, 86, 89, 90, 97]}

In [100]:
df.groupby("Colors").groups["blue"]

Int64Index([ 0,  1,  6,  7, 26, 30, 33, 35, 36, 42, 44, 45, 50, 56, 57, 58, 59,
            60, 61, 63, 64, 69, 70, 71, 73, 76, 79, 88, 92, 93, 96, 98],
           dtype='int64')

In [116]:
df.groupby("Colors").get_group("blue").head(5)

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
0,1997-07-05,10000,16,1.11,2.22,52.848386,-0.302324,blue,circle
1,1997-07-06,10000,82,3.33,4.44,,-0.004563,blue,circle
6,1997-07-11,566,57,1.865185,0.407751,52.865764,0.492686,blue,triangle
7,1997-07-12,4526,32,5.908929,6.775644,49.165058,0.088873,blue,circle
26,1997-07-31,2847,99,6.635018,0.050616,53.934743,-0.157724,blue,triangle


並べ替え

In [102]:
df_sorted = df.sort_values(by=["Colors"])
df_sorted

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
0,1997-07-05,10000,16,1.110000,2.220000,52.848386,-0.302324,blue,circle
35,1997-08-09,1628,52,2.439896,9.730106,59.810103,1.476659,blue,triangle
36,1997-08-10,3656,38,3.930977,8.920466,48.252354,0.700243,blue,square
42,1997-08-16,7613,95,,9.404586,48.579881,0.340922,blue,square
44,1997-08-18,7141,75,3.701587,0.154566,,0.275635,blue,square
...,...,...,...,...,...,...,...,...,...
14,1997-07-19,5411,74,8.773394,2.579416,41.387226,0.473627,red,circle
65,1997-09-08,1095,81,0.359423,4.655980,55.831581,1.551263,red,triangle
16,1997-07-21,6520,87,5.552008,5.296506,40.364437,2.275761,red,square
40,1997-08-14,8892,80,7.224521,2.807724,58.615147,0.295478,red,


In [105]:
df_sorted.rename_axis("ID", axis=0, inplace=True)
df_sorted

Unnamed: 0_level_0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,1997-07-05,10000,16,1.110000,2.220000,52.848386,-0.302324,blue,circle
35,1997-08-09,1628,52,2.439896,9.730106,59.810103,1.476659,blue,triangle
36,1997-08-10,3656,38,3.930977,8.920466,48.252354,0.700243,blue,square
42,1997-08-16,7613,95,,9.404586,48.579881,0.340922,blue,square
44,1997-08-18,7141,75,3.701587,0.154566,,0.275635,blue,square
...,...,...,...,...,...,...,...,...,...
14,1997-07-19,5411,74,8.773394,2.579416,41.387226,0.473627,red,circle
65,1997-09-08,1095,81,0.359423,4.655980,55.831581,1.551263,red,triangle
16,1997-07-21,6520,87,5.552008,5.296506,40.364437,2.275761,red,square
40,1997-08-14,8892,80,7.224521,2.807724,58.615147,0.295478,red,


In [107]:
df_sorted.sort_values(by=["Colors", "ID"], inplace=True, na_position="first", ascending=[True, False])
df_sorted

Unnamed: 0_level_0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
46,1997-08-20,6335,54,9.666548,9.636200,58.051058,-2.452265,,triangle
98,1997-10-11,6284,41,8.564898,6.586936,58.127088,0.747368,blue,triangle
96,1997-10-09,5992,53,0.453040,3.746126,51.168085,-1.101252,blue,square
93,1997-10-06,5718,28,4.734718,1.732019,,-1.441196,blue,circle
92,1997-10-05,5358,10,0.978342,4.916159,,1.122678,blue,circle
...,...,...,...,...,...,...,...,...,...
12,1997-07-17,7049,95,1.375209,3.410664,58.107013,0.069653,red,circle
9,1997-07-14,8422,97,2.264958,,58.829296,-0.844137,red,circle
8,1997-07-13,5678,71,0.165878,5.120931,50.912336,-1.445610,red,square
4,NaT,5834,43,5.357747,0.902898,51.509484,,red,square


In [108]:
pd.concat([df[df["Colors"].isnull()].sort_index(), df[df["Colors"] == "blue"].sort_index(), 
           df[df["Colors"] == "green"].sort_index(), df[df["Colors"] == "red"].sort_index()])

Unnamed: 0,Date,Price,Quantity,Width,Height,Quality,Difference,Colors,Shape
46,1997-08-20,6335,54,9.666548,9.636200,58.051058,-2.452265,,triangle
0,1997-07-05,10000,16,1.110000,2.220000,52.848386,-0.302324,blue,circle
1,1997-07-06,10000,82,3.330000,4.440000,,-0.004563,blue,circle
6,1997-07-11,566,57,1.865185,0.407751,52.865764,0.492686,blue,triangle
7,1997-07-12,4526,32,5.908929,6.775644,49.165058,0.088873,blue,circle
...,...,...,...,...,...,...,...,...,...
85,1997-09-28,164,65,,5.543541,41.696754,0.330701,red,triangle
86,1997-09-29,8106,90,6.117207,4.196001,59.732792,-0.014718,red,triangle
89,1997-10-02,2127,11,1.160726,0.460026,56.255991,-1.664693,red,triangle
90,1997-10-03,2795,63,0.407288,8.554606,58.944972,-0.006199,red,square


グループ毎の統計量を比較

In [118]:
df.groupby("Colors")["Quality"].mean()

Colors
blue     51.556144
green    48.996105
red      50.408645
Name: Quality, dtype: float64

In [119]:
df.groupby("Colors")["Quality"].agg(["count", "min", "max", "sum", "mean", "median"])


Unnamed: 0_level_0,count,min,max,sum,mean,median
Colors,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
blue,28,41.151175,59.810103,1443.572041,51.556144,52.052751
green,35,40.216753,59.720021,1714.863684,48.996105,47.485416
red,30,40.364437,59.732792,1512.259335,50.408645,52.039936


In [120]:
df.groupby("Colors")["Quality"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Colors,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
blue,28.0,51.556144,5.237486,41.151175,48.497999,52.052751,55.093171,59.810103
green,35.0,48.996105,6.344835,40.216753,43.204312,47.485416,55.964529,59.720021
red,30.0,50.408645,6.757617,40.364437,42.346829,52.039936,55.75575,59.732792


複数のカテゴリ（クラス）でのグループ化とMultiIndexの扱い

In [126]:
df_G = df.groupby(["Colors", "Shape"]).describe()

In [127]:
df_G.index

MultiIndex([( 'blue',   'circle'),
            ( 'blue',   'square'),
            ( 'blue', 'triangle'),
            ('green',   'circle'),
            ('green',   'square'),
            ('green', 'triangle'),
            (  'red',   'circle'),
            (  'red',   'square'),
            (  'red', 'triangle')],
           names=['Colors', 'Shape'])

In [128]:
df_G.loc["green", "Price"]

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Shape,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
circle,9.0,5874.888889,3186.919023,869.0,3199.0,7308.0,7729.0,9770.0
square,14.0,4561.857143,2615.834706,491.0,2566.75,5221.0,6268.5,8766.0
triangle,11.0,5364.636364,3578.553458,289.0,2012.5,6365.0,8310.0,9787.0


---
<a id=6></a>
[Topへ](#0)

---
## 6.重複、 欠損値の処理

* 重複レコードの削除
* 欠損値の確認
* 欠損値レコードを削除
* 欠損値を平均値で置換    
* scikit learn の SimpleImputerを使う
---

重複レコードの削除

欠損値の確認

欠損値レコードを削除

欠損値を平均値で置き換え

scikit learn の SimpleImputer を使う

---
 <a id=4></a>
[Topへ](#0)

---
## 以上
    
---