# 概要


大きく分けて３つの内容を行う。

1. データの下処理
1. コサイン類似度計算
1. 類似度が高いindexを探索する
1. csv出力

データの下処理を行うCSVデータをスクレイピングの結果から受け取る。
データの下処理を行い、コサイン類似度を計算し、類似度が高い化粧品indexを探索する。
その後、最終的にDBに格納するためのcsv出力を行う。


# 1. データの下処理


csvを読みとるだけ

In [1]:
import pandas as pd

df = pd.read_csv('./top_1000.csv')


# 2. コサイン類似度計算

コサイン類似度の計算を行う。
その前に使用されている関数を試して、挙動を確認する。その後、コサイン類似度を実装する。

## 検証

### コサイン類似度の方針

#### 1. 追川先生の方法


追川先生の授業でのコサイン類似度.

平均評価値を求めて、コサイン距離を計算する関数を自作して計算する.

もとデータが映画の5段階評価なのでコンテキストは少し違う。

``` python
# 行の平均値を計算し、平均評価値の列を追加
db1['mean'] = db1.mean(axis=1)
# 確認
db1

# DataFrame の複製を作成
db1_adj = db1.copy()
# 平均評価値を引き調整
for i in db1_adj.columns.drop('mean'):
    db1_adj[i] = db1_adj[i] - db1_adj['mean']
# mean 列を削除
db1_adj = db1_adj.drop('mean', axis=1)
# 確認
db1_adj

# 類似度行を追加し、NaN で初期化
db1_adj.loc['similarity'] = np.nan
# 各アイテムについて類似度を計算し、DataFrameに代入
for i in db1_adj.columns.drop('Item5'):
    db1_adj.at['similarity', i] = similarity_cosine(db1_adj, i, 'Item5')
# 確認
db1_adj

# コサイン類似度（= 1 - コサイン距離）
from scipy.spatial.distance import cosine

def similarity_cosine(df, l1, l2):
    a = df[[l1, l2]].dropna()
    if len(a[l1]) < 2:
        return np.nan
    return 1 - cosine(a[l1], a[l2])
```





#### 2. ほか調べたやり方


AIchatが生成したコードそのまま使えそう
```python
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MultiLabelBinarizer

# CSVファイルを読み込む
sample_df = pd.read_csv('./top_1000.csv')

# 成分をカンマで分割してリストにする
df['ingredients'] = df['ingredients'].str.split(',')

# MultiLabelBinarizerを使用して成分をバイナリベクトルに変換する
mlb = MultiLabelBinarizer()
ingredient_matrix = mlb.fit_transform(df['ingredients'])

# コサイン類似度を計算する
cosine_sim = cosine_similarity(ingredient_matrix)

# コサイン類似度の結果をデータフレームに変換する
cosine_sim_df = pd.DataFrame(cosine_sim, index=df['id'], columns=df['id'])
```

他に参考になりそうなサイト
* https://www.hinomaruc.com/get-similar-product-using-cosine-similarity-in-python/#toc4



### MultiLabelBinarizer

使用しているMultiLabelBinarizer関数を確認する。

3つの成分を取り出し、成分単位でベクトルを作成してくれるか確認する。

参考: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html

In [2]:
sample_df = pd.read_csv('./top_1000.csv')

sample_df = sample_df.head(3)

print("0番目の成分")
print(sample_df["composition"][0])
print("1番目の成分")
print(sample_df["composition"][1])
print("2番目の成分")
print(sample_df["composition"][2])

0番目の成分
イソドデカン,ポリエチレン,パルミチン酸デキストリン,トリメチルシリルプルラン,ビオチノイルトリペプチド-1,パンテノール,トコフェロール,クオタニウム-18ベントナイト,グリセリン,水,炭酸プロピレン,酸化鉄
1番目の成分
イソドデカン, トリメチルシロキシケイ酸, タルク, メチルトリメチコン, 合成ワックス, ミツロウ, パルミチン酸デキストリン
2番目の成分
シリカ,ホウケイ酸（Ｃａ／Ａｌ）,ジメチコン,真珠層末,ポリクオタニウム－５１,ワイルドタイムエキス,セリシン,      ジステアリルジモニウムクロリド,トリエトキシシリルエチルポリジメチルシロキシエチルジメチコン,イソプロパノール,水,      テトラヒドロテトラメチルシクロテトラシロキサン,ＢＧ,テトラデセン,ステアリン酸,水酸化Ａｌ,クエン酸,      ソルビン酸Ｋ,フェノキシエタノール


In [3]:
from sklearn.preprocessing import MultiLabelBinarizer

# MultiLabelBinarizerのインスタンスを作成
mlb = MultiLabelBinarizer()

# データをバイナリ行列に変換
sample_df['composition'] = sample_df['composition'].fillna('').astype(str).str.split(',')
ingredient_matrix = mlb.fit_transform(sample_df['composition'])

# 変換されたデータとクラス名を組み合わせて新しいデータフレームを作成
ingredient_df = pd.DataFrame(ingredient_matrix, columns=mlb.classes_)

# 新しいデータフレームを表示
ingredient_df


Unnamed: 0,ジステアリルジモニウムクロリド,ソルビン酸Ｋ,テトラヒドロテトラメチルシクロテトラシロキサン,タルク,トリメチルシロキシケイ酸,パルミチン酸デキストリン,ミツロウ,メチルトリメチコン,合成ワックス,イソドデカン,...,ホウケイ酸（Ｃａ／Ａｌ）,ポリエチレン,ポリクオタニウム－５１,ワイルドタイムエキス,水,水酸化Ａｌ,炭酸プロピレン,真珠層末,酸化鉄,ＢＧ
0,0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,1,0,1,0,1,0
1,0,0,0,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,0,...,1,0,1,1,1,1,0,1,0,1


スクレイピングのデータでおかしなものもあるが、大まかにはできていそう。

`水`、`イソドデカン`は含まれているため複数１が立っている

### cosine_similarity

自作しなくてもこの関数で計算ができる

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

## 実装

In [4]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MultiLabelBinarizer

collabotive_df = pd.read_csv('./top_1000.csv')

# 成分をカンマで分割してリストにする
collabotive_df['composition'] = collabotive_df['composition'].fillna('').str.split(',')

# MultiLabelBinarizerを使用して成分をバイナリベクトルに変換する
mlb = MultiLabelBinarizer()
ingredient_matrix = mlb.fit_transform(collabotive_df['composition'])

# コサイン類似度を計算する
cosine_sim = cosine_similarity(ingredient_matrix)

# コサイン類似度の結果をデータフレームに変換する
cosine_sim_df = pd.DataFrame(cosine_sim, index=collabotive_df['id'], columns=collabotive_df['id'])


cosine_sim_df


id,1,2,3,4,5,6,7,8,9,10,...,991,992,993,994,995,996,997,998,999,1000
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.109109,0.066227,0.122474,0.122474,0.122474,0.122474,0.000000,0.280976,0.280976,...,0.160817,0.160817,0.142374,0.158114,0.094916,0.094916,0.132068,0.193649,0.184637,0.184637
2,0.109109,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,0.066227,0.000000,1.000000,0.194666,0.194666,0.194666,0.194666,0.040555,0.260513,0.260513,...,0.085203,0.085203,0.226294,0.335083,0.150863,0.150863,0.174928,0.205196,0.097823,0.097823
4,0.122474,0.000000,0.194666,1.000000,1.000000,1.000000,1.000000,0.050000,0.344124,0.344124,...,0.131306,0.131306,0.278994,0.464758,0.185996,0.185996,0.172532,0.158114,0.211058,0.211058
5,0.122474,0.000000,0.194666,1.000000,1.000000,1.000000,1.000000,0.050000,0.344124,0.344124,...,0.131306,0.131306,0.278994,0.464758,0.185996,0.185996,0.172532,0.158114,0.211058,0.211058
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,0.094916,0.000000,0.150863,0.185996,0.185996,0.185996,0.185996,0.145310,0.320028,0.320028,...,0.183169,0.183169,0.189189,0.330165,1.000000,1.000000,0.150424,0.110282,0.315450,0.315450
997,0.132068,0.000000,0.174928,0.172532,0.172532,0.172532,0.172532,0.026958,0.296862,0.296862,...,0.084955,0.084955,0.200565,0.222738,0.150424,0.150424,1.000000,0.136399,0.195077,0.195077
998,0.193649,0.000000,0.205196,0.158114,0.158114,0.158114,0.158114,0.000000,0.181369,0.181369,...,0.041523,0.041523,0.257325,0.244949,0.110282,0.110282,0.136399,1.000000,0.047673,0.047673
999,0.184637,0.000000,0.097823,0.211058,0.211058,0.211058,0.211058,0.075378,0.276686,0.276686,...,0.237542,0.237542,0.140200,0.233550,0.315450,0.315450,0.195077,0.047673,1.000000,1.000000


In [5]:

#確認用
ingredient_matrix

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [6]:
cosine_sim_df

id,1,2,3,4,5,6,7,8,9,10,...,991,992,993,994,995,996,997,998,999,1000
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.109109,0.066227,0.122474,0.122474,0.122474,0.122474,0.000000,0.280976,0.280976,...,0.160817,0.160817,0.142374,0.158114,0.094916,0.094916,0.132068,0.193649,0.184637,0.184637
2,0.109109,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,0.066227,0.000000,1.000000,0.194666,0.194666,0.194666,0.194666,0.040555,0.260513,0.260513,...,0.085203,0.085203,0.226294,0.335083,0.150863,0.150863,0.174928,0.205196,0.097823,0.097823
4,0.122474,0.000000,0.194666,1.000000,1.000000,1.000000,1.000000,0.050000,0.344124,0.344124,...,0.131306,0.131306,0.278994,0.464758,0.185996,0.185996,0.172532,0.158114,0.211058,0.211058
5,0.122474,0.000000,0.194666,1.000000,1.000000,1.000000,1.000000,0.050000,0.344124,0.344124,...,0.131306,0.131306,0.278994,0.464758,0.185996,0.185996,0.172532,0.158114,0.211058,0.211058
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,0.094916,0.000000,0.150863,0.185996,0.185996,0.185996,0.185996,0.145310,0.320028,0.320028,...,0.183169,0.183169,0.189189,0.330165,1.000000,1.000000,0.150424,0.110282,0.315450,0.315450
997,0.132068,0.000000,0.174928,0.172532,0.172532,0.172532,0.172532,0.026958,0.296862,0.296862,...,0.084955,0.084955,0.200565,0.222738,0.150424,0.150424,1.000000,0.136399,0.195077,0.195077
998,0.193649,0.000000,0.205196,0.158114,0.158114,0.158114,0.158114,0.000000,0.181369,0.181369,...,0.041523,0.041523,0.257325,0.244949,0.110282,0.110282,0.136399,1.000000,0.047673,0.047673
999,0.184637,0.000000,0.097823,0.211058,0.211058,0.211058,0.211058,0.075378,0.276686,0.276686,...,0.237542,0.237542,0.140200,0.233550,0.315450,0.315450,0.195077,0.047673,1.000000,1.000000


# 3. 類似度が高いindexを探索する

In [7]:
cosine_sim_df

id,1,2,3,4,5,6,7,8,9,10,...,991,992,993,994,995,996,997,998,999,1000
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.000000,0.109109,0.066227,0.122474,0.122474,0.122474,0.122474,0.000000,0.280976,0.280976,...,0.160817,0.160817,0.142374,0.158114,0.094916,0.094916,0.132068,0.193649,0.184637,0.184637
2,0.109109,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,0.066227,0.000000,1.000000,0.194666,0.194666,0.194666,0.194666,0.040555,0.260513,0.260513,...,0.085203,0.085203,0.226294,0.335083,0.150863,0.150863,0.174928,0.205196,0.097823,0.097823
4,0.122474,0.000000,0.194666,1.000000,1.000000,1.000000,1.000000,0.050000,0.344124,0.344124,...,0.131306,0.131306,0.278994,0.464758,0.185996,0.185996,0.172532,0.158114,0.211058,0.211058
5,0.122474,0.000000,0.194666,1.000000,1.000000,1.000000,1.000000,0.050000,0.344124,0.344124,...,0.131306,0.131306,0.278994,0.464758,0.185996,0.185996,0.172532,0.158114,0.211058,0.211058
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,0.094916,0.000000,0.150863,0.185996,0.185996,0.185996,0.185996,0.145310,0.320028,0.320028,...,0.183169,0.183169,0.189189,0.330165,1.000000,1.000000,0.150424,0.110282,0.315450,0.315450
997,0.132068,0.000000,0.174928,0.172532,0.172532,0.172532,0.172532,0.026958,0.296862,0.296862,...,0.084955,0.084955,0.200565,0.222738,0.150424,0.150424,1.000000,0.136399,0.195077,0.195077
998,0.193649,0.000000,0.205196,0.158114,0.158114,0.158114,0.158114,0.000000,0.181369,0.181369,...,0.041523,0.041523,0.257325,0.244949,0.110282,0.110282,0.136399,1.000000,0.047673,0.047673
999,0.184637,0.000000,0.097823,0.211058,0.211058,0.211058,0.211058,0.075378,0.276686,0.276686,...,0.237542,0.237542,0.140200,0.233550,0.315450,0.315450,0.195077,0.047673,1.000000,1.000000


In [8]:
import numpy as np

maxindices_list = []
maxvalues_list = []

for j in range(1, len(cosine_sim_df)+1):
    # 各行の類似度の値とそのインデックスを取得
    similarities = []
    for i, value in enumerate(cosine_sim_df[j]):
        if i != j and np.abs(value) < 1.0:
            similarities.append((value, i))
    
    # 類似度の値でソート
    similarities.sort(reverse=True)
    
    # 上位3つの類似度の値とそのインデックスを取得
    top3_similarities = similarities[:3]
    
    # 上位3つのインデックスとその値を格納
    maxindices_list.append([index for value, index in top3_similarities])
    maxvalues_list.append([value for value, index in top3_similarities])


In [14]:
# データ確認
# top3_similarities
# maxvalues_list
# maxindices_list

[[0.3333333333333334, 0.29462782549439487, 0.28306925853614895],
 [0.9999999999999997, 0.1091089451179962, 0.08247860988423225],
 [0.9999999999999998, 0.4525885718428286, 0.4525885718428286],
 [0.9999999999999999, 0.9999999999999999, 0.9999999999999999],
 [0.9999999999999999, 0.9999999999999999, 0.9999999999999999],
 [0.9999999999999999, 0.9999999999999999, 0.9999999999999999],
 [0.9999999999999999, 0.9999999999999999, 0.9999999999999999],
 [0.9999999999999998, 0.29543947393437575, 0.29543947393437575],
 [0.6457925788671689, 0.6457925788671689, 0.585299229428781],
 [0.6457925788671689, 0.6457925788671689, 0.585299229428781],
 [0.6457925788671689, 0.6457925788671689, 0.585299229428781],
 [0.9999999999999998, 0.9999999999999998, 0.9999999999999998],
 [0.9999999999999998, 0.9999999999999998, 0.9999999999999998],
 [0.9999999999999998, 0.9999999999999998, 0.9999999999999998],
 [0.9999999999999998, 0.9999999999999998, 0.9999999999999998],
 [0.9999999999999998, 0.9999999999999998, 0.999999999

# 4. CSV出力

In [10]:
# リストを文字列に変換する関数
def convert_list_to_string(lst):
    return ', '.join(map(str, lst))

# 元のデータフレームを読み込む
original_df = pd.read_csv('top_1000.csv')

# maxindices_listを新しい列として追加
original_df['highest_similarity_ids'] = maxindices_list

# highest_similarity_ids列のリストを文字列に変換
original_df['highest_similarity_ids'] = original_df['highest_similarity_ids'].apply(convert_list_to_string)

# 新しいデータフレームをCSVファイルとして出力
original_df.to_csv('product_with_highest_similarity_ids.csv', index=False)



In [11]:
original_df

Unnamed: 0,id,company_name,product_name,composition,highest_similarity_ids
0,1,サンプル堂,ロングラスティングマスカラEX,"イソドデカン,ポリエチレン,パルミチン酸デキストリン,トリメチルシリルプルラン,ビオチノイル...","285, 888, 355"
1,2,サンプル堂,ロング&カラードラッシュ,"イソドデカン, トリメチルシロキシケイ酸, タルク, メチルトリメチコン, 合成ワックス, ...","1, 0, 734"
2,3,サンプル堂,ルースパウダー,"シリカ,ホウケイ酸（Ｃａ／Ａｌ）,ジメチコン,真珠層末,ポリクオタニウム－５１,ワイルドタイ...","2, 297, 296"
3,4,SHISEIDO,フローレス ルミエール ラディアンス パーフェクティング クッション ダブルレフィル ウィズ...,"水,ジメチコン,メトキシケイヒ酸エチルヘキシル,メチルトリメチコン,サリチル酸オクチル,ラウ...","172, 171, 170"
4,5,SHISEIDO,フローレス ルミエール ラディアンス パーフェクティング クッション ダブルレフィル ウィズ...,"水,ジメチコン,メトキシケイヒ酸エチルヘキシル,メチルトリメチコン,サリチル酸オクチル,ラウ...","172, 171, 170"
...,...,...,...,...,...
995,996,SHISEIDO,チークスタイリスト PK272,"タルク,窒化ホウ素,ワセリン,シリカ,酸化亜鉛,焼成セリサイト,リンゴ酸ジイソステアリル,テ...","365, 364, 581"
996,997,SHISEIDO,マツイクガールズラッシュ,"水,酢酸ステアリン酸スクロース,アクリレーツコポリマー,マイクロクリスタリンワックス,ＢＧ,...","940, 668, 667"
997,998,SHISEIDO,エマルジョン,"水,グリセリン,ＢＧ,スクワラン,ジメチコン,テトラエチルヘキサン酸ペンタエリスリチル,水添...","370, 773, 772"
998,999,SHISEIDO,スナイプジェルライナー RD550,"トリメチルシロキシケイ酸,メチルトリメチコン,シクロペンタシロキサン,脂肪酸（Ｃ１８－３６）...","843, 842, 841"
