<a href="https://colab.research.google.com/github/dAn-solution/competition/blob/main/Prob_kiva_009.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Kiva／クラウドファンディングの資金調達額予測
- BERT特徴量のデータを作成
- 'DESCRIPTION_TRANSLATED'を対象にBERT特徴量を作成
- [yshr10ic](https://comp.probspace.com/users/yshr10ic/0)さんの[BERT特徴量を使ったBaselineの実装](https://comp.probspace.com/topics/yshr10ic-Post9f96eb771afe36fb3bf7)を流用
- GPUで実行　1時間40分程度


### Google Driveのマウント

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/'My Drive'

Mounted at /content/drive
/content/drive/My Drive


In [None]:
# 必要なライブラリのインストール
!pip install -q transformers > /dev/null

In [None]:
# カレントディレクトリを変更
import os
os.chdir('/content/drive/My Drive/Probdata/kiva/')
print(os.getcwd())

/content/drive/My Drive/Probdata/kiva


In [None]:
class Config():
    root_path = './'
    input_path = os.path.join(root_path, 'input')
    output_path = os.path.join(root_path, 'output')
    result_path = os.path.join(root_path, 'result')
    bert_model_name = 'bert-base-uncased'
    seed = 42
    debug = False

In [None]:
# create dirs

for dir in [Config.output_path]:
    os.makedirs(dir, exist_ok=True)

In [None]:
import numpy as np
import pandas as pd
import torch
from tqdm import tqdm
from sklearn.decomposition import PCA

# NLP
import transformers
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

pd.set_option('max_columns', None)
pd.options.display.float_format = '{:.5f}'.format

### データの読み込み

In [None]:
train_df = pd.read_csv(os.path.join(Config.input_path, 'train.csv'))
if Config.debug:
    train_df = train_df[:1000]
print(train_df.shape)

(91333, 18)


In [None]:
test_df = pd.read_csv(os.path.join(Config.input_path, 'test.csv'))
if Config.debug:
    test_df = test_df[:1000]
print(test_df.shape)

(91822, 17)


### BERTによる文章のベクトル化

In [None]:
class BertSequenceVectorizer:
    """
    事前学習済み BERT モデルを使ったテキスト特徴抽出
    """
    def __init__(self, model_name='bert-base-uncased', max_len=128):
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.model_name = model_name
        self.tokenizer = transformers.AutoTokenizer.from_pretrained(self.model_name)
        self.model = transformers.AutoModel.from_pretrained(self.model_name)
        self.model = self.model.to(self.device)
        self.max_len = max_len

    def vectorize(self, sentence : str) -> np.array:
        inp = self.tokenizer.encode(sentence)
        len_inp = len(inp)

        if len_inp >= self.max_len:
            inputs = inp[:self.max_len]
            masks = [1] * self.max_len
        else:
            inputs = inp + [0] * (self.max_len - len_inp)
            masks = [1] * len_inp + [0] * (self.max_len - len_inp)

        inputs_tensor = torch.tensor([inputs], dtype=torch.long).to(self.device)
        masks_tensor = torch.tensor([masks], dtype=torch.long).to(self.device)

        output = self.model(inputs_tensor, masks_tensor)
        seq_out = output['last_hidden_state']

        if torch.cuda.is_available():    
            return seq_out[0][0].cpu().detach().numpy()
        else:
            return seq_out[0][0].detach().numpy()

In [None]:
def get_bert_feature(train_input_df, test_input_df):
    vectorizer = BertSequenceVectorizer(model_name=Config.bert_model_name)
    train_texts = train_input_df['DESCRIPTION_TRANSLATED'].fillna('')
    test_texts = test_input_df['DESCRIPTION_TRANSLATED'].fillna('')
    train_text_vecs = np.array([vectorizer.vectorize(x) for x in train_texts])
    test_text_vecs = np.array([vectorizer.vectorize(x) for x in test_texts])
    pca = PCA(n_components=64)
    train_text_vecs = pca.fit_transform(train_text_vecs)
    test_text_vecs = pca.transform(test_text_vecs)

    train_output_df = pd.DataFrame(train_text_vecs, columns=[f'bert_pca_vecs={i:03}' for i in range(train_text_vecs.shape[1])])
    test_output_df = pd.DataFrame(test_text_vecs, columns=[f'bert_pca_vecs={i:03}' for i in range(test_text_vecs.shape[1])])
    train_output_df.index = train_input_df.index
    test_output_df.index = test_input_df.index
    return train_output_df, test_output_df

In [None]:
train_bert, test_bert = get_bert_feature(train_df, test_df)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


KeyboardInterrupt: ignored

#### ベクトル化したデータの保存

In [None]:
train_bert.to_csv(os.path.join(Config.result_path, f'train_bert_009.csv'))
test_bert.to_csv(os.path.join(Config.result_path, f'test_bert_009.csv'))