# youtube_datasetを用いた分析練習

## データ、及び必要な機能のインポート

In [25]:
import pandas as pd
import numpy as np
import gzip
#import pandas_profiling as pdp
from IPython.display import display
import dask.dataframe as dd
import time
from datetime import datetime
import math
from collections import Counter

In [4]:
yt_train = pd.read_csv('yt_bb_classification_train.csv.gz', compression="gzip",names=('youtube_id','timestamp_ms','class_id','class_name','object_presence'))
print(yt_train.shape)
display(yt_train)
display(yt_train.describe())
#print("ユーザー数=", len(df_members["msno"].unique()))

(8146143, 5)


Unnamed: 0,youtube_id,timestamp_ms,class_id,class_name,object_presence
0,AAADSgKurQY,13000,19,dog,present
1,AAADSgKurQY,14000,19,dog,present
2,AAADSgKurQY,15000,19,dog,present
3,AAADSgKurQY,16000,19,dog,present
4,AAADSgKurQY,17000,19,dog,present
5,AAADSgKurQY,18000,19,dog,present
6,AAADSgKurQY,19000,19,dog,present
7,AAADSgKurQY,20000,19,dog,present
8,AAADSgKurQY,21000,19,dog,present
9,AAADSgKurQY,22000,19,dog,present


Unnamed: 0,timestamp_ms,class_id
count,8146143.0,8146143.0
mean,143721.1,8.929339
std,212635.6,7.06286
min,0.0,0.0
25%,29000.0,2.0
50%,76000.0,8.0
75%,180000.0,14.0
max,3586000.0,23.0


In [5]:
yt_validation = pd.read_csv('yt_bb_classification_validation.csv.gz', compression="gzip",names=('youtube_id','timestamp_ms','class_id','class_name','object_presence'))
print(yt_validation.shape)
display(yt_validation)
display(yt_validation.describe())

(1013246, 5)


Unnamed: 0,youtube_id,timestamp_ms,class_id,class_name,object_presence
0,AACebVo-JXY,283000,3,boat,present
1,AACebVo-JXY,284000,3,boat,present
2,AACebVo-JXY,285000,3,boat,present
3,AACebVo-JXY,286000,3,boat,present
4,AACebVo-JXY,287000,3,boat,present
5,AACebVo-JXY,288000,3,boat,present
6,AACebVo-JXY,289000,3,boat,present
7,AACebVo-JXY,290000,3,boat,present
8,AACebVo-JXY,291000,3,boat,present
9,AACebVo-JXY,292000,3,boat,present


Unnamed: 0,timestamp_ms,class_id
count,1013246.0,1013246.0
mean,141086.8,8.844676
std,208905.5,7.10311
min,0.0,0.0
25%,28000.0,2.0
50%,75000.0,7.0
75%,179000.0,14.0
max,3556000.0,23.0


## 各種データについての確認

使用データセット:Youtube-BB Dataset  
ようつべにアップロードされている作品中で写っている物体とその動画IDとかその時間で物体が写っているかどうかが記録されている

### youtube_id - (string) 
the YouTube identifier of the video the segment was extracted from. One may view the selected video at [ http://youtu.be/${youtube_id} ].  
 →セグメントが抽出された動画のYouTube識別子

### timestamp_ms - (integer)   
the time in milliseconds of the classified frame in the video.  
→ビデオ内の分類されたフレームのミリ秒単位の時間

### class_id - (integer)  
a numeric identifier for the object class.  
→オブジェクトクラスの数値識別子

### class_name - (string)  
a human-readable name for the object class.  
→オブジェクト名

### object_presence - (string)  
whether or not the object is present in the frame ('present' or 'absent').  
→オブジェクトがフレーム内に存在するか否か（ 'present'または 'absent'）

# 各種データの確認、及び変換

## データ全体を確認

### 基礎統計

In [40]:
yt_train.describe()

Unnamed: 0,timestamp_ms,class_id
count,8146143.0,8146143.0
mean,143721.1,8.929339
std,212635.6,7.06286
min,0.0,0.0
25%,29000.0,2.0
50%,76000.0,8.0
75%,180000.0,14.0
max,3586000.0,23.0


### 全体の欠損を確認

In [10]:
yt_train[yt_train.isnull().any(1)]

Unnamed: 0,youtube_id,timestamp_ms,class_id,class_name,object_presence


### 全体のユニーク数を確認

In [11]:
yt_train.nunique()

youtube_id         253569
timestamp_ms         3587
class_id               24
class_name             24
object_presence         2
dtype: int64

### 相関係数

In [20]:
yt_train.corr()

Unnamed: 0,timestamp_ms,class_id
timestamp_ms,1.0,0.043126
class_id,0.043126,1.0


### 最頻値

In [44]:
yt_train.mode()

Unnamed: 0,youtube_id,timestamp_ms,class_id,class_name,object_presence
0,peFIHPWhmQI,15000,0,person,present


## youtube_idについて

### 各IDごとの情報  
- 各IDのデータ数(youtube_id,count)  
- 投稿動画の平均時間(timestamp_ms,mean)  
- 投稿動画の合計時間(timestamp_ms,sum)  
- class_idのユニーク数(class_id,nunique)

In [19]:
yt_train.groupby('youtube_id').agg({'youtube_id':'count','timestamp_ms':['mean','sum'],'class_id':'nunique'})

Unnamed: 0_level_0,youtube_id,timestamp_ms,timestamp_ms,class_id
Unnamed: 0_level_1,count,mean,sum,nunique
youtube_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
--0bLFuriZ4,19,82000.000000,1558000,1
--2LyLmEaq8,38,52000.000000,1976000,1
--4VWx_0Sc4,19,37000.000000,703000,1
--4xkUrLgjA,17,53000.000000,901000,1
--4yByFm8j4,21,51000.000000,1071000,1
--7Jrp2urUc,19,382000.000000,7258000,1
--7KXqgihyw,19,127000.000000,2413000,1
--7L39R9NG0,19,52000.000000,988000,1
--A-XRFwRt8,19,52000.000000,988000,1
--AjlzHEFNs,95,124000.000000,11780000,1


## timestamp_msについて

### 各動画時間についての情報   
- 動画時間ごとのデータ数(timestamp_ms,count)  
- 動画時間ごとの動画数(youtube_id,nunique)  
- 動画時間ごとのクラスIDのユニーク数(class_id,nunique)  

In [38]:
yt_train.groupby('timestamp_ms').agg({'timestamp_ms':'count','youtube_id':'nunique','class_id':'nunique'})

Unnamed: 0_level_0,timestamp_ms,youtube_id,class_id
timestamp_ms,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,73205,72581,24
1000,73202,72578,24
2000,73197,72573,24
3000,73200,72576,24
4000,73190,72566,24
5000,73170,72546,24
6000,73171,72547,24
7000,73158,72534,24
8000,73151,72527,24
9000,73151,72527,24


## 各クラスIDについて

### class_idについての情報  
- 各クラスIDごとの作品数(class_id,count)  
- 各クラスIDごとの動画数(youtube_id,nunique)  
- 各クラスIDごとの平均・合計動画時間(timestamp_ms,mean/sum)  

In [54]:
yt_train.groupby('class_id').agg({'class_id':'count','youtube_id':'nunique','timestamp_ms':['mean','sum']})

Unnamed: 0_level_0,class_id,youtube_id,timestamp_ms,timestamp_ms
Unnamed: 0_level_1,count,nunique,mean,sum
class_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,1492159,61885,103155.591328,153924544000
1,381409,9328,112031.467532,42729810000
2,214522,4945,225164.761656,48302795000
3,309155,8793,143364.018049,44321703000
4,364935,11223,173456.651184,63300403000
5,315929,7137,151399.159305,47831385000
6,397821,13722,181212.904799,72090299000
7,577420,26439,91858.993454,53041220000
8,45893,1356,228907.676552,10505260000
9,203868,5633,231945.430377,47286251000


## class_nameについて

### 各クラス名についての情報  
- 各クラス名ごとのクラスID(class_id,mid/nunique)
- 各クラス名ごとのデータ数(class_name,count)  
- 各クラス名ごとの投稿者数(youtube_id,nunique)  
- 各クラス名ごとの平均・合計動画時間(timestamp_ms,mean/sum)  

In [61]:
yt_train.groupby('class_name').agg({'class_name':'count','class_id':['mean','nunique'],'youtube_id':'nunique','timestamp_ms':['mean','sum']})

Unnamed: 0_level_0,class_name,class_id,class_id,youtube_id,timestamp_ms,timestamp_ms
Unnamed: 0_level_1,count,mean,nunique,nunique,mean,sum
class_name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
airplane,324170,13,1,7545,192801.363482,62500418000
bear,315929,5,1,7137,151399.159305,47831385000
bicycle,214522,2,1,4945,225164.761656,48302795000
bird,381409,1,1,9328,112031.467532,42729810000
boat,309155,3,1,8793,143364.018049,44321703000
bus,364935,4,1,11223,173456.651184,63300403000
car,258409,23,1,8699,116105.491682,30002704000
cat,577420,7,1,26439,91858.993454,53041220000
cow,397821,6,1,13722,181212.904799,72090299000
dog,453509,19,1,12520,96237.702008,43644664000


## object_presenceについて

### オブジェクトの有無についての情報
- オブジェクトの有無によるデータ数の違い(object_presence,count)  
- オブジェクトの有無による動画数の違い(youtube_id,nunique)  
- オブジェクトの有無による動画時間の違い(timestamp_ms,mean/sum)  
- オブジェクトの有無によるクラスIDの種類数(class_id,nunique)

In [60]:
yt_train.groupby('object_presence').agg({'object_presence':'count','youtube_id':'nunique','timestamp_ms':['mean','sum'],'class_id':'nunique'})

Unnamed: 0_level_0,object_presence,youtube_id,timestamp_ms,timestamp_ms,class_id
Unnamed: 0_level_1,count,nunique,mean,sum,nunique
object_presence,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
absent,784166,98376,162372.450476,127326955000,23
present,7361977,242320,141734.458692,1043445825000,24


データ確認からのまとめ  
- 各要素の中で異常な値があるかを確認する  

# 分析方針
冷静に考えてこのデータだけだと分析する価値ないわ(？)