## 코드 설명에 앞서 (느낀점)
1. 케글에 유사대회에서 코드를 긁어와 극히 일부분만을 수정하였습니다. (https://www.kaggle.com/c/spooky-author-identification)

2. nn, cnn, gru 각각의 부분에서 최고의 정확도를 낼 수 있게 수정해주었더니 특정 feature가 과도하게 weight가 들어가면서 오히려 전체 stacking enesmble에서 더 저조한 성적을 보이는 현상이 일어났습니다. 결국엔 어쩔수 없이 수정이 없는 원본이 최종 제출점수가 되어버렸습니다.

3. GPU 사용을 위해 google colab을 사용하였습니다. (하지만 RNN부분에서는 큰 속도향상을 기대할 순 없었습니다)

4. 케글 유사대회에 있는 거의 대부분의 코드를 리뷰하고 실제로 돌려보면서 느낀 것은, 일정 이상의 점수를 내기 위해선 매우 많은 모델들의 앙상블은 필수였으며, 더욱더 높은 점수를 내기 위해서는 해당 데이터에 맞는 text 손질이였습니다. (NLP 1인자 BERT 제외)

5. 다른 중요한 공모전들과 일정이 겹치다보니 많은 시간을 투자할 수 없어서 데이터와 충분히 친해지지 못한 것이 아쉽고, 실제로 코드에도 그 모습이 고스란히 보이는 듯 합니다. 최적의 코드를 공유해드리지 못하여 아쉬움이 많이 남습니다.

## 코드 공유에 앞서 (코드목차)
1. 데이터 전처리

2. MultinomialNB을 이용한 feature 생성

3. CNN을 이용한 feature 생성

4. GRU를 이용한 feature 생성

5. NN을 이용한 feature 생성

6. 최종 stacking ensemble


In [1]:
# 구글 드라이브에 마운트합니다.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
#경로 설정
import os
os.chdir('/content/drive/My Drive/Colab Notebooks/소설작가분류AI경진대회')

# 데이터 전처리

In [3]:
# libraries
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test_x.csv")

# 단어수(중복 포함)
train_df["num_words"] = train_df["text"].apply(lambda x: len(str(x).split()))
test_df["num_words"] = test_df["text"].apply(lambda x: len(str(x).split()))

# 단어수(중복 제거)
train_df["num_unique_words"] = train_df["text"].apply(lambda x: len(set(str(x).split())))
test_df["num_unique_words"] = test_df["text"].apply(lambda x: len(set(str(x).split())))

# 글자수
train_df["num_chars"] = train_df["text"].apply(lambda x: len(str(x)))
test_df["num_chars"] = test_df["text"].apply(lambda x: len(str(x)))

# stopwords : nltk의 stopwords보다 월등한 성능을 보여줍니다
stopwords = [
    "a", "about", "above", "across", "after", "afterwards", "again", "against",
    "all", "almost", "alone", "along", "already", "also", "although", "always",
    "am", "among", "amongst", "amoungst", "amount", "an", "and", "another",
    "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are",
    "around", "as", "at", "back", "be", "became", "because", "become",
    "becomes", "becoming", "been", "before", "beforehand", "behind", "being",
    "below", "beside", "besides", "between", "beyond", "bill", "both",
    "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con",
    "could", "couldnt", "cry", "de", "describe", "detail", "do", "done",
    "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else",
    "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone",
    "everything", "everywhere", "except", "few", "fifteen", "fifty", "fill",
    "find", "fire", "first", "five", "for", "former", "formerly", "forty",
    "found", "four", "from", "front", "full", "further", "get", "give", "go",
    "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter",
    "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his",
    "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed",
    "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter",
    "latterly", "least", "less", "ltd", "made", "many", "may", "me",
    "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly",
    "move", "much", "must", "my", "myself", "name", "namely", "neither",
    "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone",
    "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
    "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our",
    "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps",
    "please", "put", "rather", "re", "same", "see", "seem", "seemed",
    "seeming", "seems", "serious", "several", "she", "should", "show", "side",
    "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone",
    "something", "sometime", "sometimes", "somewhere", "still", "such",
    "system", "take", "ten", "than", "that", "the", "their", "them",
    "themselves", "then", "thence", "there", "thereafter", "thereby",
    "therefore", "therein", "thereupon", "these", "they", "thick", "thin",
    "third", "this", "those", "though", "three", "through", "throughout",
    "thru", "thus", "to", "together", "too", "top", "toward", "towards",
    "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us",
    "very", "via", "was", "we", "well", "were", "what", "whatever", "when",
    "whence", "whenever", "where", "whereafter", "whereas", "whereby",
    "wherein", "whereupon", "wherever", "whether", "which", "while", "whither",
    "who", "whoever", "whole", "whom", "whose", "why", "will", "with",
    "within", "without", "would", "yet", "you", "your", "yours", "yourself",
    "yourselves"]

train_df["num_stopwords"] = train_df["text"].apply(lambda x: len([w for w in str(x).lower().split() if w in stopwords]))
test_df["num_stopwords"] = test_df["text"].apply(lambda x: len([w for w in str(x).lower().split() if w in stopwords]))

# punctuation의 개수
import string
train_df["num_punctuations"] =train_df['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
test_df["num_punctuations"] =test_df['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )

# 대문자로만 이루어진 단어 개수
train_df["num_words_upper"] = train_df["text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
test_df["num_words_upper"] = test_df["text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))

# 첫글자가 대문자인 단어 개수
train_df["num_words_title"] = train_df["text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
test_df["num_words_title"] = test_df["text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))

# text 평균 길이
train_df["mean_word_len"] = train_df["text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test_df["mean_word_len"] = test_df["text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

In [4]:
train_df.head()

Unnamed: 0,index,text,author,num_words,num_unique_words,num_chars,num_stopwords,num_punctuations,num_words_upper,num_words_title,mean_word_len
0,0,"He was almost choking. There was so much, so m...",3,46,39,240,27,8,0,4,4.23913
1,1,"“Your sister asked for it, I suppose?”",2,7,7,38,2,2,1,2,4.571429
2,2,"She was engaged one day as she walked, in per...",1,57,50,320,28,9,0,4,4.614035
3,3,"The captain was in the porch, keeping himself ...",4,58,49,319,27,18,0,7,4.517241
4,4,"“Have mercy, gentlemen!” odin flung up his han...",3,39,36,228,16,13,0,4,4.871795


In [5]:
# Clean text
from tqdm import tqdm
tqdm.pandas()
punctuation = ['.', '..', '...', ',', ':', ';', '-', '*', '"', '!', '?']
def clean_text(x):
    x.lower()
    for p in punctuation:
        x.replace(p, '')
    return x

train_df['text_cleaned'] = train_df['text'].apply(lambda x: clean_text(x))
test_df['text_cleaned'] = test_df['text'].apply(lambda x: clean_text(x))

def extract_features(df):
    df['len'] = df['text'].apply(lambda x: len(x))
    df['n_words'] = df['text'].apply(lambda x: len(x.split(' ')))
    df['n_.'] = df['text'].str.count('\.')
    df['n_...'] = df['text'].str.count('\...')
    df['n_,'] = df['text'].str.count('\,')
    df['n_:'] = df['text'].str.count('\:')
    df['n_;'] = df['text'].str.count('\;')
    df['n_-'] = df['text'].str.count('\-')
    df['n_?'] = df['text'].str.count('\?')
    df['n_!'] = df['text'].str.count('\!')
    df['n_\''] = df['text'].str.count('\'')
    df['n_"'] = df['text'].str.count('\"')

    # 문장 첫단어 개수
    df['n_The '] = df['text'].str.count('The ')
    df['n_I '] = df['text'].str.count('I ')
    df['n_It '] = df['text'].str.count('It ')
    df['n_He '] = df['text'].str.count('He ')
    df['n_Me '] = df['text'].str.count('Me ')
    df['n_She '] = df['text'].str.count('She ')
    df['n_We '] = df['text'].str.count('We ')
    df['n_They '] = df['text'].str.count('They ')
    df['n_You '] = df['text'].str.count('You ')
    df['n_the'] = df['text_cleaned'].str.count('the ')
    df['n_ a '] = df['text_cleaned'].str.count(' a ')
    df['n_appear'] = df['text_cleaned'].str.count('appear')
    df['n_little'] = df['text_cleaned'].str.count('little')
    df['n_was '] = df['text_cleaned'].str.count('was ')
    df['n_one '] = df['text_cleaned'].str.count('one ')
    df['n_two '] = df['text_cleaned'].str.count('two ')
    df['n_three '] = df['text_cleaned'].str.count('three ')
    df['n_ten '] = df['text_cleaned'].str.count('ten ')
    df['n_is '] = df['text_cleaned'].str.count('is ')
    df['n_are '] = df['text_cleaned'].str.count('are ')
    df['n_ed'] = df['text_cleaned'].str.count('ed ')
    df['n_however'] = df['text_cleaned'].str.count('however')
    df['n_ to '] = df['text_cleaned'].str.count(' to ')
    df['n_into'] = df['text_cleaned'].str.count('into')
    df['n_about '] = df['text_cleaned'].str.count('about ')
    df['n_th'] = df['text_cleaned'].str.count('th')
    df['n_er'] = df['text_cleaned'].str.count('er')
    df['n_ex'] = df['text_cleaned'].str.count('ex')
    df['n_an '] = df['text_cleaned'].str.count('an ')
    df['n_ground'] = df['text_cleaned'].str.count('ground')
    df['n_any'] = df['text_cleaned'].str.count('any')
    df['n_silence'] = df['text_cleaned'].str.count('silence')
    df['n_wall'] = df['text_cleaned'].str.count('wall')

    df.drop(['text_cleaned'], axis=1, inplace=True)

print('Processing train...')
extract_features(train_df)
print('Processing test...')
extract_features(test_df)

Processing train...
Processing test...


pos_tag와 ne_chunk를 이용한 tokenization. 자세한 내용은 https://statkclee.github.io/nlp2/nlp-ner-python.html 에 가면 확인 할 수 있다.

In [6]:
import nltk
nltk.download('words')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('averaged_perceptron_tagger')

def pos_tag_text(s):
    sents = nltk.sent_tokenize(s)
    res = []
    for sent in sents:
        words = nltk.word_tokenize(sent)
        tag_res = [a[1] for a in nltk.pos_tag(words)]
        res.append(' '.join(tag_res))
    return '. '.join(res)

def ne_text(s):
    sents = nltk.sent_tokenize(s)
    res = []
    for sent in sents:
        words = nltk.word_tokenize(sent)
        tag_res = nltk.pos_tag(words)
        ne_tree = nltk.ne_chunk(tag_res)
        list_res = nltk.tree2conlltags(ne_tree)
        ne_res = [a[2] for a in list_res]
        res.append(' '.join(ne_res))
    return '. '.join(res)

train_df['tag_txt'] = train_df["text"].apply(pos_tag_text)
train_df['ne_txt'] = train_df["text"].apply(ne_text)
test_df['tag_txt'] = test_df["text"].apply(pos_tag_text)
test_df['ne_txt'] = test_df["text"].apply(ne_text)

c_vec3 = CountVectorizer(lowercase=False)
c_vec3.fit(train_df['tag_txt'].values.tolist())
train_cvec3 = c_vec3.transform(train_df['tag_txt'].values.tolist()).toarray()
test_cvec3 = c_vec3.transform(test_df['tag_txt'].values.tolist()).toarray()
print(train_cvec3.shape,test_cvec3.shape)

c_vec4 = CountVectorizer(lowercase=False)
c_vec4.fit(train_df['ne_txt'].values.tolist())
train_cvec4 = c_vec4.transform(train_df['ne_txt'].values.tolist()).toarray()
test_cvec4 = c_vec4.transform(test_df['ne_txt'].values.tolist()).toarray()
print(train_cvec4.shape,test_cvec4.shape)

tf_vec5 = TfidfVectorizer(lowercase=False)
tf_vec5.fit(train_df['tag_txt'].values.tolist())
train_tf5 = tf_vec5.transform(train_df['tag_txt'].values.tolist()).toarray()
test_tf5 = tf_vec5.transform(test_df['tag_txt'].values.tolist()).toarray()
print(train_tf5.shape,test_tf5.shape)

tf_vec6 = TfidfVectorizer(lowercase=False)
tf_vec6.fit(train_df['ne_txt'].values.tolist())
train_tf6 = tf_vec6.transform(train_df['ne_txt'].values.tolist()).toarray()
test_tf6 = tf_vec6.transform(test_df['ne_txt'].values.tolist()).toarray()
print(train_tf6.shape,test_tf6.shape)

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
(54879, 34) (19617, 34)
(54879, 6) (19617, 6)
(54879, 34) (19617, 34)
(54879, 6) (19617, 6)


# MultinomialNB을 이용한 feature 생성

In [7]:
train_Y = train_df['author']
train_id = train_df['index'].values
test_id = test_df['index'].values

# tfidf와 svd 합
tfidf_vec = TfidfVectorizer(ngram_range=(1,3), max_df=0.8,lowercase=False, sublinear_tf=True)
full_tfidf = tfidf_vec.fit_transform(train_df['text'].values.tolist())
train_tfidf = tfidf_vec.transform(train_df['text'].values.tolist())
test_tfidf = tfidf_vec.transform(test_df['text'].values.tolist())
print(train_tfidf.shape,test_tfidf.shape)

# svd1
n_comp = 30
svd_obj = TruncatedSVD(n_components=n_comp, algorithm='arpack')
svd_obj.fit(full_tfidf)
train_svd = pd.DataFrame(svd_obj.transform(train_tfidf))
test_svd = pd.DataFrame(svd_obj.transform(test_tfidf))
print(train_svd.shape,test_svd.shape)

# tfidf char
tfidf_vec2 = TfidfVectorizer(ngram_range=(3,7), analyzer='char',max_df=0.8, sublinear_tf=True)
full_tfidf2 = tfidf_vec2.fit_transform(train_df['text'].values.tolist())
train_tfidf2 = tfidf_vec2.transform(train_df['text'].values.tolist())
test_tfidf2 = tfidf_vec2.transform(test_df['text'].values.tolist())
print(train_tfidf2.shape,test_tfidf2.shape)

# svd2
n_comp = 30
svd_obj = TruncatedSVD(n_components=n_comp, algorithm='arpack')
svd_obj.fit(full_tfidf2)
train_svd2 = pd.DataFrame(svd_obj.transform(train_tfidf2))
test_svd2 = pd.DataFrame(svd_obj.transform(test_tfidf2))
print(train_svd2.shape,test_svd2.shape)


# cnt vec
c_vec = CountVectorizer(ngram_range=(1,3),max_df=0.8, lowercase=False)
c_vec.fit(train_df['text'].values.tolist())
train_cvec = c_vec.transform(train_df['text'].values.tolist())
test_cvec = c_vec.transform(test_df['text'].values.tolist())
print(train_cvec.shape,test_cvec.shape)

# cnt char
c_vec2 = CountVectorizer(ngram_range=(3,7), analyzer='char',max_df=0.8)
c_vec2.fit(train_df['text'].values.tolist())
train_cvec2 = c_vec2.transform(train_df['text'].values.tolist())
test_cvec2 = c_vec2.transform(test_df['text'].values.tolist())
print(train_cvec2.shape,test_cvec2.shape)

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss

feat_cnt = 5

def gen_nb_feats(rnd=1):
    help_tfidf_train,help_tfidf_test = np.zeros((54879,5)),np.zeros((19617,5))
    help_tfidf_train2,help_tfidf_test2 = np.zeros((54879,5)),np.zeros((19617,5))
    help_cnt1_train,help_cnt1_test = np.zeros((54879,5)),np.zeros((19617,5))
    help_cnt2_train,help_cnt2_test = np.zeros((54879,5)),np.zeros((19617,5))

    skf = StratifiedKFold(n_splits=feat_cnt, shuffle=True, random_state=23*rnd)
    for train_index, test_index in skf.split(train_tfidf,train_Y):
        # tfidf to nb
        X_train, X_test = train_tfidf[train_index], train_tfidf[test_index]
        y_train, y_test = train_Y[train_index], train_Y[test_index]
        tmp_model = MultinomialNB(alpha=0.025,fit_prior=False)
        tmp_model.fit(X_train,y_train)
        tmp_train_feat = tmp_model.predict_proba(X_test)
        tmp_test_feat = tmp_model.predict_proba(test_tfidf)
        help_tfidf_train[test_index] = tmp_train_feat
        help_tfidf_test += tmp_test_feat/feat_cnt

        # tfidf to nb
        X_train, X_test = train_tfidf2[train_index], train_tfidf2[test_index]
        tmp_model = MultinomialNB(0.025,fit_prior=False)
        tmp_model.fit(X_train,y_train)
        tmp_train_feat = tmp_model.predict_proba(X_test)
        tmp_test_feat = tmp_model.predict_proba(test_tfidf2)
        help_tfidf_train2[test_index] = tmp_train_feat
        help_tfidf_test2 += tmp_test_feat/feat_cnt

        # count vec to nb
        X_train, X_test = train_cvec[train_index], train_cvec[test_index]
        tmp_model = MultinomialNB(0.025,fit_prior=False)
        tmp_model.fit(X_train,y_train)
        tmp_train_feat = tmp_model.predict_proba(X_test)
        tmp_test_feat = tmp_model.predict_proba(test_cvec)
        help_cnt1_train[test_index] = tmp_train_feat
        help_cnt1_test += tmp_test_feat/feat_cnt

        # count vec2 to nb 
        X_train, X_test = train_cvec2[train_index], train_cvec2[test_index]
        tmp_model = MultinomialNB(0.025,fit_prior=False)
        tmp_model.fit(X_train,y_train)
        tmp_train_feat = tmp_model.predict_proba(X_test)
        tmp_test_feat = tmp_model.predict_proba(test_cvec2)
        help_cnt2_train[test_index] = tmp_train_feat
        help_cnt2_test += tmp_test_feat/feat_cnt
    
    help_train_feat = np.hstack([help_tfidf_train,help_tfidf_train2,help_cnt1_train,help_cnt2_train])
    help_test_feat = np.hstack([help_tfidf_test,help_tfidf_test2,help_cnt1_test,help_cnt2_test])

    return help_train_feat,help_test_feat
    
help_train_feat,help_test_feat = gen_nb_feats(1)
help_train_feat2,help_test_feat2 = gen_nb_feats(2)
help_train_feat3,help_test_feat3 = gen_nb_feats(3)

(54879, 2137725) (19617, 2137725)
(54879, 30) (19617, 30)
(54879, 2485843) (19617, 2485843)
(54879, 30) (19617, 30)
(54879, 2137725) (19617, 2137725)
(54879, 2485843) (19617, 2485843)


In [8]:
# libraries for Deep Learning
from keras.layers import Embedding, GRU, Dense, Flatten, Dropout
from keras.models import Sequential, load_model
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.layers import Conv1D, GlobalMaxPooling1D, GlobalAveragePooling1D
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from sklearn import preprocessing
from sklearn.metrics import log_loss
import gc

# CNN을 이용한 feature 생성

In [9]:
def get_cnn_feats(rnd=1):
    train_pred, test_pred = np.zeros((54879,5)),np.zeros((19617,5))
    best_val_train_pred, best_val_test_pred = np.zeros((54879,5)),np.zeros((19617,5))
    FEAT_CNT = 5
    NUM_WORDS = 30000
    N = 10
    MAX_LEN = 150
    NUM_CLASSES = 5
    MODEL_P = 'nn_model.h5'
    
    tmp_X = train_df['text']
    tmp_Y = train_df['author']
    tmp_X_test = test_df['text']
    
    tokenizer = Tokenizer(num_words=NUM_WORDS)
    tokenizer.fit_on_texts(tmp_X)

    ttrain_x = tokenizer.texts_to_sequences(tmp_X)
    ttrain_x = pad_sequences(ttrain_x, maxlen=MAX_LEN)
    
    ttest_x = tokenizer.texts_to_sequences(tmp_X_test)
    ttest_x = pad_sequences(ttest_x, maxlen=MAX_LEN)

    lb = preprocessing.LabelBinarizer()
    lb.fit(tmp_Y)

    ttrain_y = lb.transform(tmp_Y)
    skf = StratifiedKFold(n_splits=FEAT_CNT, shuffle=True, random_state=233*rnd)
    for train_index, test_index in skf.split(train_tfidf,tmp_Y):
        model = Sequential()
        model.add(Embedding(NUM_WORDS, N, input_length=MAX_LEN))
        model.add(Conv1D(16,
                         3,
                         padding='valid',
                         activation='relu',
                         strides=1))
        model.add(GlobalAveragePooling1D())
        model.add(Dense(16, activation='relu'))
        model.add(Dropout(0.2))
        model.add(Dense(NUM_CLASSES, activation='softmax'))

        model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

        mc = ModelCheckpoint(filepath=MODEL_P, monitor='val_loss', save_best_only=True, verbose=1)
        es = EarlyStopping(monitor='val_loss', patience=2)

        np.random.seed(42)
        model.fit(ttrain_x[train_index], ttrain_y[train_index], 
                  validation_split=0.1,
                  batch_size=64, epochs=15, 
                  verbose=1,
                  callbacks=[mc,es],
                  shuffle=False
                 )
 
        # feature 생성 1
        train_pred[test_index] = model.predict(ttrain_x[test_index])
        test_pred += model.predict(ttest_x)/feat_cnt
        
        # feature 생성 2
        model = load_model(MODEL_P)
        best_val_train_pred[test_index] = model.predict(ttrain_x[test_index])
        best_val_test_pred += model.predict(ttest_x)/feat_cnt
        
        del model
        gc.collect()
        print('------------------')
        
    return train_pred,test_pred,best_val_train_pred,best_val_test_pred

get_cnn_feats 함수에서 인자는 단순히 seed값 변경의 의미만을 가지기 때문에,
굳이 3번이나 반복해야하나 싶어 feature 생성을 한번만 하였더니 이 또한 정확도 하락에
기여하였습니다. 이해할순 없지만 이러한 앙상블 역시 정확도에 기여하는 것을 볼 수 있습니다.

In [10]:
cnn_train1,cnn_test1,cnn_train2,cnn_test2 = get_cnn_feats(1)
cnn_train3,cnn_test3,cnn_train4,cnn_test4 = get_cnn_feats(2)
cnn_train5,cnn_test5,cnn_train6,cnn_test6 = get_cnn_feats(3)

Epoch 1/15
Epoch 00001: val_loss improved from inf to 1.30919, saving model to nn_model.h5
Epoch 2/15
Epoch 00002: val_loss improved from 1.30919 to 1.08020, saving model to nn_model.h5
Epoch 3/15
Epoch 00003: val_loss improved from 1.08020 to 0.94385, saving model to nn_model.h5
Epoch 4/15
Epoch 00004: val_loss improved from 0.94385 to 0.89431, saving model to nn_model.h5
Epoch 5/15
Epoch 00005: val_loss improved from 0.89431 to 0.87432, saving model to nn_model.h5
Epoch 6/15
Epoch 00006: val_loss improved from 0.87432 to 0.87427, saving model to nn_model.h5
Epoch 7/15
Epoch 00007: val_loss did not improve from 0.87427
Epoch 8/15
Epoch 00008: val_loss did not improve from 0.87427
------------------
Epoch 1/15
Epoch 00001: val_loss improved from inf to 1.27793, saving model to nn_model.h5
Epoch 2/15
Epoch 00002: val_loss improved from 1.27793 to 1.03873, saving model to nn_model.h5
Epoch 3/15
Epoch 00003: val_loss improved from 1.03873 to 0.96008, saving model to nn_model.h5
Epoch 4/15

# GRU를 이용한 feature 생성

In [11]:
def get_gru_feats(rnd=1):
    train_pred, test_pred = np.zeros((54879,5)),np.zeros((19617,5))
    best_val_train_pred, best_val_test_pred = np.zeros((54879,5)),np.zeros((19617,5))
    FEAT_CNT = 5
    NUM_WORDS = 16000
    N = 12
    MAX_LEN = 300
    NUM_CLASSES = 5
    MODEL_P = 'nn_model.h5'
    
    tmp_X = train_df['text']
    tmp_Y = train_df['author']
    tmp_X_test = test_df['text']
    
    tokenizer = Tokenizer(num_words=NUM_WORDS)
    tokenizer.fit_on_texts(tmp_X)

    ttrain_x = tokenizer.texts_to_sequences(tmp_X)
    ttrain_x = pad_sequences(ttrain_x, maxlen=MAX_LEN)
    
    ttest_x = tokenizer.texts_to_sequences(tmp_X_test)
    ttest_x = pad_sequences(ttest_x, maxlen=MAX_LEN)

    lb = preprocessing.LabelBinarizer()
    lb.fit(tmp_Y)

    ttrain_y = lb.transform(tmp_Y)
    skf = StratifiedKFold(n_splits=FEAT_CNT, shuffle=True, random_state=2333*rnd)
    for train_index, test_index in skf.split(ttrain_x,tmp_Y):
        model = Sequential()
        model.add(Embedding(NUM_WORDS, N, input_length=MAX_LEN))
        model.add(GRU(N, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
        model.add(Flatten())
        model.add(Dense(128, activation='relu'))
        model.add(Dropout(0.2))
        model.add(Dense(NUM_CLASSES, activation='softmax'))
        model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

        mc = ModelCheckpoint(filepath=MODEL_P, monitor='val_loss', save_best_only=True, verbose=1)
        es=EarlyStopping(monitor='val_loss', patience=2)

        np.random.seed(42)
        model.fit(ttrain_x[train_index], ttrain_y[train_index], 
                  validation_split=0.1,
                  batch_size=256, epochs=10, 
                  verbose=1,
                  callbacks=[mc,es],
                  shuffle=False
                 )
        
        # feature 생성 1
        train_pred[test_index] = model.predict(ttrain_x[test_index])
        test_pred += model.predict(ttest_x)/feat_cnt
        
        # feature 생성 2
        model = load_model(MODEL_P)
        best_val_train_pred[test_index] = model.predict(ttrain_x[test_index])
        best_val_test_pred += model.predict(ttest_x)/feat_cnt
        
        del model
        gc.collect()
        print('------------------')
        
    return train_pred,test_pred,best_val_train_pred,best_val_test_pred

뒤늦게 안 사실인데 해당 데이터에서는 GRU보다 LSTM이 마지막 앙상블에서 더 성능이 좋습니다.

In [12]:
gru_train1,gru_test1,gru_train2,gru_test2 = get_gru_feats(1)

Epoch 1/10
Epoch 00001: val_loss improved from inf to 1.00951, saving model to nn_model.h5
Epoch 2/10
Epoch 00002: val_loss improved from 1.00951 to 0.73846, saving model to nn_model.h5
Epoch 3/10
Epoch 00003: val_loss improved from 0.73846 to 0.68376, saving model to nn_model.h5
Epoch 4/10
Epoch 00004: val_loss improved from 0.68376 to 0.66434, saving model to nn_model.h5
Epoch 5/10
Epoch 00005: val_loss improved from 0.66434 to 0.65235, saving model to nn_model.h5
Epoch 6/10
Epoch 00006: val_loss did not improve from 0.65235
Epoch 7/10
Epoch 00007: val_loss did not improve from 0.65235
------------------
Epoch 1/10
Epoch 00001: val_loss improved from inf to 0.99421, saving model to nn_model.h5
Epoch 2/10
Epoch 00002: val_loss improved from 0.99421 to 0.77778, saving model to nn_model.h5
Epoch 3/10
Epoch 00003: val_loss improved from 0.77778 to 0.70057, saving model to nn_model.h5
Epoch 4/10
Epoch 00004: val_loss improved from 0.70057 to 0.68212, saving model to nn_model.h5
Epoch 5/10

# NN을 이용한 feature 생성

In [13]:
# NN은 (https://www.kaggle.com/nzw0301/simple-keras-fasttext-val-loss-0-31)에서 최고의 정확도를 냅니다.
# 하지만 이 코드에 결합할 시 더 나쁜 결과를 내어 적용하진 않았습니다.

def get_nn_feats(rnd=1):
    train_pred, test_pred = np.zeros((54879,5)),np.zeros((19617,5))
    best_val_train_pred, best_val_test_pred = np.zeros((54879,5)),np.zeros((19617,5))
    FEAT_CNT = 10
    NUM_WORDS = 30000
    N = 10
    MAX_LEN = 100
    NUM_CLASSES = 5
    MODEL_P = 'nn_model.h5'
    
    tmp_X = train_df['text']
    tmp_Y = train_df['author']
    tmp_X_test = test_df['text']
    
    tokenizer = Tokenizer(num_words=NUM_WORDS)
    tokenizer.fit_on_texts(tmp_X)

    ttrain_x = tokenizer.texts_to_sequences(tmp_X)
    ttrain_x = pad_sequences(ttrain_x, maxlen=MAX_LEN)
    
    ttest_x = tokenizer.texts_to_sequences(tmp_X_test)
    ttest_x = pad_sequences(ttest_x, maxlen=MAX_LEN)

    lb = preprocessing.LabelBinarizer()
    lb.fit(tmp_Y)

    ttrain_y = lb.transform(tmp_Y)
    skf = StratifiedKFold(n_splits=FEAT_CNT, shuffle=True, random_state=233*rnd)
    for train_index, test_index in skf.split(ttrain_x,tmp_Y):
        model = Sequential()
        model.add(Embedding(NUM_WORDS, N, input_length=MAX_LEN))
        model.add(GlobalAveragePooling1D())
        model.add(Dense(30, activation='relu'))
        model.add(Dropout(0.1))
        model.add(Dense(NUM_CLASSES, activation='softmax'))

        model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

        mc = ModelCheckpoint(filepath=MODEL_P, monitor='val_loss', save_best_only=True, verbose=1)
        es=EarlyStopping(monitor='val_loss', patience=2)

        np.random.seed(42)
        model.fit(ttrain_x[train_index], ttrain_y[train_index], 
                  validation_split=0.3,
                  batch_size=64, epochs=20, 
                  verbose=1,
                  callbacks=[mc,es],
                  shuffle=False
                 )
 
        # feature 생성 1
        train_pred[test_index] = model.predict(ttrain_x[test_index])
        test_pred += model.predict(ttest_x)/feat_cnt
        
        # feature 생성 2
        model = load_model(MODEL_P)
        best_val_train_pred[test_index] = model.predict(ttrain_x[test_index])
        best_val_test_pred += model.predict(ttest_x)/feat_cnt
        
        del model
        gc.collect()
        print('------------------')
        
    return train_pred,test_pred,best_val_train_pred,best_val_test_pred

nn_train1,nn_test1,nn_train2,nn_test2 = get_nn_feats(1)

Epoch 1/20
Epoch 00001: val_loss improved from inf to 1.36640, saving model to nn_model.h5
Epoch 2/20
Epoch 00002: val_loss improved from 1.36640 to 1.15079, saving model to nn_model.h5
Epoch 3/20
Epoch 00003: val_loss improved from 1.15079 to 1.00556, saving model to nn_model.h5
Epoch 4/20
Epoch 00004: val_loss improved from 1.00556 to 0.90711, saving model to nn_model.h5
Epoch 5/20
Epoch 00005: val_loss improved from 0.90711 to 0.83678, saving model to nn_model.h5
Epoch 6/20
Epoch 00006: val_loss improved from 0.83678 to 0.79010, saving model to nn_model.h5
Epoch 7/20
Epoch 00007: val_loss improved from 0.79010 to 0.75538, saving model to nn_model.h5
Epoch 8/20
Epoch 00008: val_loss improved from 0.75538 to 0.73594, saving model to nn_model.h5
Epoch 9/20
Epoch 00009: val_loss improved from 0.73594 to 0.72527, saving model to nn_model.h5
Epoch 10/20
Epoch 00010: val_loss improved from 0.72527 to 0.72456, saving model to nn_model.h5
Epoch 11/20
Epoch 00011: val_loss improved from 0.724

# 최종 stacking ensemble

In [14]:
all_nn_train = np.hstack([gru_train1, gru_train2, 
                        cnn_train1, cnn_train2,cnn_train3, cnn_train4,cnn_train5, cnn_train6,
                        nn_train1,nn_train2
                        ])
all_nn_test = np.hstack([gru_test1, gru_test2, 
                        cnn_test1, cnn_test2,cnn_test3, cnn_test4,cnn_test5, cnn_test6,
                        nn_test1,nn_test2 
                        ])

In [15]:
# 최종 앙상블 데이터
cols_to_drop = ['index', 'text','tag_txt','ne_txt']
train_X = train_df.drop(cols_to_drop+['author'], axis=1).values
test_X = test_df.drop(cols_to_drop, axis=1).values
train_X = np.hstack([train_X,train_svd,train_svd2,train_cvec3,train_cvec4,train_tf5,train_tf6])
test_X = np.hstack([test_X,test_svd,test_svd2,test_cvec3,test_cvec4,test_tf5,test_tf6])

f_train_X = np.hstack([train_X, help_train_feat,help_train_feat2,help_train_feat3,all_nn_train])
f_train_X = np.round(f_train_X,4)
f_test_X = np.hstack([test_X, help_test_feat,help_test_feat2,help_test_feat3,all_nn_test])
f_test_X = np.round(f_test_X,4)
print(f_train_X.shape, f_test_X.shape)

(54879, 303) (19617, 303)


In [16]:
# 최종 앙상블입니다.
def cv_test(k_cnt=3, s_flag = False):
    rnd = 42
    if s_flag:
        kf = StratifiedKFold(n_splits=k_cnt, shuffle=True, random_state=rnd)
    else:
        kf = KFold(n_splits=k_cnt, shuffle=True, random_state=rnd)
    test_pred = None
    weighted_test_pred = None
    org_train_pred = None
    avg_k_score = 0
    reverse_score = 0
    best_loss = 100
    best_single_pred = None
    for train_index, test_index in kf.split(f_train_X,train_Y):
        X_train, X_test = f_train_X[train_index], f_train_X[test_index]
        y_train, y_test = train_Y[train_index], train_Y[test_index]
        params = {
                'colsample_bytree': 0.7,
                'subsample': 0.8,
                'eta': 0.04,
                'max_depth': 3,
                'eval_metric':'mlogloss',
                'objective':'multi:softprob',
                'num_class':5,
                'tree_method':'gpu_hist'
        }
        
        d_train = xgb.DMatrix(X_train, y_train)
        d_valid = xgb.DMatrix(X_test, y_test)
        d_test = xgb.DMatrix(f_test_X)
        
        watchlist = [(d_train, 'train'), (d_valid, 'valid')]
        m = xgb.train(params, d_train, 2000, watchlist, 
                        early_stopping_rounds=50,
                        verbose_eval=200)
        
        train_pred = m.predict(d_train)
        valid_pred = m.predict(d_valid)
        tmp_train_pred = m.predict(xgb.DMatrix(f_train_X))
        
        train_score = log_loss(y_train,train_pred)
        valid_score = log_loss(y_test,valid_pred)
        print('train log loss',train_score,'valid log loss',valid_score)
        avg_k_score += valid_score
        rev_valid_score = 1.0/valid_score
        reverse_score += rev_valid_score
        print('rev',rev_valid_score)
        
        if test_pred is None:
            test_pred = m.predict(d_test)
            weighted_test_pred = test_pred*rev_valid_score
            org_train_pred = tmp_train_pred
            best_loss = valid_score
            best_single_pred = test_pred
        else:
            curr_pred = m.predict(d_test)
            test_pred += curr_pred
            weighted_test_pred += curr_pred*rev_valid_score
            org_train_pred += tmp_train_pred

            if valid_score < best_loss:
                print('BETTER')
                best_loss = valid_score
                best_single_pred = curr_pred

    test_pred = test_pred / k_cnt
    test_pred = np.round(test_pred,4)
    org_train_pred = org_train_pred / k_cnt
    avg_k_score = avg_k_score/k_cnt

    submiss=pd.read_csv("sample_submission.csv")
    submiss['0']=test_pred[:,0]
    submiss['1']=test_pred[:,1]
    submiss['2']=test_pred[:,2]
    submiss['3']=test_pred[:,3]
    submiss['4']=test_pred[:,4]
    submiss.to_csv("xgb_{}.csv".format(k_cnt),index=False)
    print(reverse_score)
    # weigthed
    submiss=pd.read_csv("sample_submission.csv")
    weighted_test_pred = weighted_test_pred / reverse_score
    weighted_test_pred = np.round(weighted_test_pred,4)
    submiss['0']=weighted_test_pred[:,0]
    submiss['1']=weighted_test_pred[:,1]
    submiss['2']=weighted_test_pred[:,2]
    submiss['3']=weighted_test_pred[:,3]
    submiss['4']=weighted_test_pred[:,4]
    submiss.to_csv("weighted_{}.csv".format(k_cnt),index=False)
    # best single
    submiss=pd.read_csv("sample_submission.csv")
    weighted_test_pred = np.round(best_single_pred,4)
    submiss['0']=weighted_test_pred[:,0]
    submiss['1']=weighted_test_pred[:,1]
    submiss['2']=weighted_test_pred[:,2]
    submiss['3']=weighted_test_pred[:,3]
    submiss['4']=weighted_test_pred[:,4]
    submiss.to_csv("single_{}.csv".format(k_cnt),index=False)
    
    # train log loss
    print('local average valid loss',avg_k_score)
    print('train log loss', log_loss(train_Y,org_train_pred))

In [17]:
cv_test(5, True)

[0]	train-mlogloss:1.53927	valid-mlogloss:1.53965
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.

Will train until valid-mlogloss hasn't improved in 50 rounds.
[200]	train-mlogloss:0.355055	valid-mlogloss:0.384927
[400]	train-mlogloss:0.312747	valid-mlogloss:0.364205
[600]	train-mlogloss:0.284918	valid-mlogloss:0.35637
[800]	train-mlogloss:0.262457	valid-mlogloss:0.352282
[1000]	train-mlogloss:0.24276	valid-mlogloss:0.349757
[1200]	train-mlogloss:0.225228	valid-mlogloss:0.348044
[1400]	train-mlogloss:0.209108	valid-mlogloss:0.346995
[1600]	train-mlogloss:0.194379	valid-mlogloss:0.346386
Stopping. Best iteration:
[1655]	train-mlogloss:0.190594	valid-mlogloss:0.346087

train log loss 0.1871349243375515 valid log loss 0.34632040740935727
rev 2.887499490660916
[0]	train-mlogloss:1.5395	valid-mlogloss:1.53957
Multiple eval metrics have been passed: 'valid-mlogloss' will be used for early stopping.

Will train until valid-mlogloss hasn't improved in