Описание
Исходный репозиторий: https://github.com/srbhklkrn/depression-therapist-chatbot

Блок обучения модели https://github.com/srbhklkrn/depression-therapist-chatbot/blob/master/model/sent_model_vocab.py
Исходный файл

Логика работы:

Особенности
* в репозитории есть ссылка на готовую модель и код для обучения модели, в репозитории пустышки с названием хеш суммой
* в репозитории нет ссылки на датасет, но указано, что кагл твиттеры. Вместо них пустышки с названием хеш суммой. В кагле есть основной датасет https://www.kaggle.com/c/tweet-sentiment-extraction
* обучение на датасете твиттеров, которые разбиваются на набор слов не более 20ти символов длиной, чистятся от хештегов и авторов, ссылок и слов с xxx. 
* разбитые отрезки токенизируются и чистятся от слов короче 2х символов
* словарь слов сохраняется в отдельный файл
* модель обучается на сети с одним слоем из 128 нейронов. Точность определения тональности до 82%.



Предварительные действия


Ограничения и особенности модели:
* для обработки текста использует `gensim` https://radimrehurek.com/gensim/

In [None]:
# Зависимости
import os
import codecs
import sys
import numpy as np
from gensim.parsing.preprocessing import preprocess_string, strip_punctuation, stem_text
from gensim.corpora.dictionary import Dictionary
from tensorflow import keras
from keras.models import Sequential, load_model
from keras.layers import Dense, Dropout
from keras.layers.recurrent import LSTM
from keras.regularizers import l2
from keras.layers.embeddings import Embedding
from keras.callbacks import TensorBoard, EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
import time
import csv



In [None]:
# Обеспечиваем подгрузку данных и их хранение в каталоге ноутубка MyDrive/chats_emotions_and_voises/chat04_depression-therapist-chatbot
BASE_PATH='/content/gdrive/MyDrive/chats_emotions_and_voises/devdv_ALL/1_rule'
from google.colab import drive
drive.mount('/content/gdrive')
if not os.path.exists(BASE_PATH):
    raise ValueError('Нет папки для хранения данных', BASE_PATH)
%cd $BASE_PATH
if not os.path.exists('Tensorboard'): 
  os.makedirs('Tensorboard')
%ls

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/MyDrive/chats_emotions_and_voises/chat04_depression-therapist-chatbot
1model_nn.hdf5      data_tweets.csv                  [0m[01;34mTensorboard[0m/
chat04_main.ipynb   model_nn.hdf5                    [01;34mtweet_data[0m/
chat04_study.ipynb  model_nn_ORIGINAL.h5             vocab_sentiment
data_labels.csv     sent_model_vocab_model-ORIGINAL  vocab_sentiment_ORIGINAL


In [None]:
# Всопомогательные функции

# Функци анализа датасета твиттеров и извлечения из них твитов и меток тональности
def export(type_data='train'):
    print("Extracting data...")
    if type_data.lower() == 'train':
        filename = 'train.csv'
    elif type_data.lower() == 'test':
        filename = 'test.csv'
    data_file = codecs.open('tweet_data/' + filename, encoding='ISO-8859-1')
    data = []
    for tweet in data_file.read().split('\n')[:-1]:
        data.append([string for string in tweet.split('"') if string not in [
                    '', ',']])
    data_file.close()
    print(data[:3])
    labels = [(float(tweet[0]) / 4.0) for tweet in data]
    tweets = [tweet[-1] for tweet in data]
    print(f'First 2 tweets fron {len(tweets)}\n', tweets[:2], labels[:2])

    print("Обработка данных (токензация и чистка от незначимых слов)...")
    for i, tweet in enumerate(tweets):
        # Чистка от пустых слов, слов начинающихся с @ и # (аккаунты и хештеги), ссылок
        new_tweet = ' '.join([word for word in tweet.split(' ') if len(word)\
                            > 0 and word[0] not in ['@', '#'] and 'http' not\
                            in word]).strip()
        # Преобразование слов и чистка от заканчивающихся на xxx, а также без слов not и notxxx
        pro_tweet = [word[:-3] if word[-3:] == 'xxx' else word for word in
                    preprocess_string(new_tweet.replace('not', 'notxxx'))]
        #pro_tweet = preprocess_string(new_tweet)
        # Результат
        # new_tweet: Yup!! saw the entire match....reached office late..
        # pro_tweet:  ['yup', 'saw', 'entir', 'match', 'reach', 'offic', 'late']
        if len(pro_tweet) < 2:
            tweets[i] = strip_punctuation(stem_text(new_tweet.lower())).\
                        strip().split()
        else:
            tweets[i] = pro_tweet
        # sys.stdout.write("\r%d tweet(s) pre-processed out of %d\r" % (i + 1, len(tweets)))
        # sys.stdout.flush()
        if (i + 1) % 100000 == 0:
            print(f"{i+1:,} tweet(s) pre-processed out of {len(tweets):,}")

    print("\nЧистка данных (от слов короче 2х символов)...")
    backup_tweets = np.array(tweets, dtype=object)
    backup_labels = np.array(labels, dtype=object)
    tweets = []
    labels = []
    for i, tweet in enumerate(backup_tweets):
        if len(tweet) >= 2:
            tweets.append(tweet)
            labels.append(backup_labels[i])
    del backup_tweets
    del backup_labels

    # Shuffle the dataset
    data = list(zip(tweets, labels))
    np.random.shuffle(data)
    tweets, labels = zip(*data)

    return (tweets, labels)

# Функция создания словаря
def create_vocab(tweets):
    print("Building vocabulary...")
    vocab = Dictionary()    
    vocab.add_documents(tweets)
    vocab.save('vocab_sentiment')
    return vocab

# Получение готового словаря или создание нового
def get_vocab(tweets=None):
    if 'vocab_sentiment' in os.listdir('.'):
        if not tweets:
            print("Loading vocabulary...")
            vocab = Dictionary.load('vocab_sentiment')
            print("Loaded vocabulary")
            return vocab
        # response = raw_input('Vocabulary found. Do you want to load it? (Y/n): ')
        response = input('Vocabulary found. Do you want to load it? (Y/n):')
        if response.lower() in ['n', 'no', 'nah', 'nono', 'nahi', 'nein']:
            if not tweets:
                tweets, labels = export()
                del labels
            return create_vocab(tweets)
        else:
            print("Loading vocabulary...")
            vocab = Dictionary.load('vocab_sentiment')
            print("Loaded vocabulary")
            return vocab
    else:
        if not tweets:
            tweets, labels = export()
            del labels
        return create_vocab(tweets)

# Инициализация словаря
def init_with_vocab(tweets=None, labels=None, vocab=None, type_data='train'):
    if not tweets and not labels:
        if type_data=='train':
            if 'data_tweets.csv' in os.listdir('.') and 'data_labels.csv' in os.listdir('.'):
                with open('data_tweets.csv', newline='') as f:
                    reader = csv.reader(f)
                    tweets = list(reader)
                with open('data_labels.csv', newline='') as f:
                    reader = csv.reader(f)
                    labels = list(reader)
                print(tweets[:2], labels[:2])
            else:
                tweets, labels = export(type_data)
                with open('data_tweets.csv', 'w', newline='') as csvfile:
                    writer= csv.writer(csvfile)
                    writer.writerow(tweets)
                with open('data_labels.csv', 'w', newline='') as csvfile:
                    writer= csv.writer(csvfile)
                    writer.writerow(labels)
        else:
            tweets, labels = export(type_data)
    elif tweets and labels:
        pass
    else:
        print("One of tweets or labels given, but not the other")
        return
    if not vocab and type_data == 'train':
        vocab = get_vocab(tweets)
    elif not vocab:
        vocab = get_vocab()

    print("Replacing words with vocabulary numbers...")
    #if type_data == 'train':
        #max_tweet_len = max([len(tweet) for tweet in tweets])
    #else:
        #max_tweet_len = 40 #Empirically obtained :P
    max_tweet_len = 20
    numbered_tweets = []
    numbered_labels = []
    for tweet_num, (tweet, label) in enumerate(zip(tweets, labels)):
        current_tweet = []

        for word in tweet:
            if word in vocab.token2id:
                current_tweet.append(vocab.token2id[word] + 1)

        if len(current_tweet) <= max_tweet_len:
            current_tweet_len = len(current_tweet)
            for i in range(max_tweet_len - current_tweet_len):
                current_tweet.append(0)
            numbered_tweets.append(current_tweet)
            numbered_labels.append(label)

        else:
            while len(current_tweet) > max_tweet_len:
                numbered_tweets.append(current_tweet[:max_tweet_len])
                numbered_labels.append(label)
                current_tweet = current_tweet[max_tweet_len:]
            if len(current_tweet) > 1:
                current_tweet_len = len(current_tweet)
                for i in range(max_tweet_len - current_tweet_len):
                    current_tweet.append(0)
                numbered_tweets.append(current_tweet)
                numbered_labels.append(label)

    print("Replaced words with vocabulary numbers")
    del tweets
    labels = np.array(numbered_labels)
    del numbered_labels
    return (numbered_tweets, labels, len(vocab))


In [None]:
# Создание и обучение модели нейронной сети. Один слой LSTM, с сигмоидной функцией и слоем дропаут, оптимизация adam
def create_nn(vocab_len=None, max_tweet_len=None):
    if vocab_len == None:
        print("Error: Vocabulary not initialized")
        return
    if max_tweet_len == None:
        print("Error: Please specify max tweet length")
        return

    nn_model = Sequential()
    nn_model.add(Embedding(input_dim=(vocab_len + 1), output_dim=32,
                           mask_zero=True))
    nn_model.add(LSTM(128))
    nn_model.add(Dense(32, activation='sigmoid', kernel_regularizer=l2(0.05)))
    nn_model.add(Dropout(0.3))
    nn_model.add(Dense(1, activation='sigmoid'))

    nn_model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=[
                     'accuracy'])

    print("Created neural network model")
    return nn_model

def get_nn(vocab_len=None, max_tweet_len=None):
    if 'model_nn.hdf5' in os.listdir('.'):
        # response = raw_input('Neural network model found. Do you want to load'\
        #                     ' it? (Y/n): ')
        response = input('Neural network model found. Do you want to load it? (Y/n): ')
        if response.lower() in ['n', 'no', 'nah', 'nono', 'nahi', 'nein']:
            return create_nn(vocab_len, max_tweet_len)
        else:
            print("Loading model...")
            nn_model = load_model('model_nn.hdf5')
            print("Loaded model")
            return nn_model
    else:
        return create_nn(vocab_len, max_tweet_len)

In [None]:
# Основная функци обучения модели
def get_data_and_nn(tweets=None, labels=None, nn_model=None):
    if tweets is None and labels is None:
        tweets, labels, vocab_len = init_with_vocab()
    elif tweets is not None and labels is not None:
        pass
    else:
        print("One of tweets or labels given, but not the other")
        return
    if not nn_model:
        max_tweet_len = max([len(tweet) for tweet in tweets])
        nn_model = get_nn(vocab_len, max_tweet_len)
    return tweets, labels, nn_model

def train_nn(tweets=None, labels=None, nn_model=None):
    tweets, labels, nn_model = get_data_and_nn(tweets, labels, nn_model)

    # Callbacks (extra features)
    tb_callback = TensorBoard(log_dir='./Tensorboard/' + str(time.time()))
    early_stop = EarlyStopping(monitor='loss', min_delta=0.025, patience=6)
    lr_reducer = ReduceLROnPlateau(monitor='loss', factor=0.5, min_lr=0.00001,
                                patience=2, min_delta=0.1) # epsilon=0.1) # WARNING:tensorflow:`epsilon` argument is deprecated and will be removed, use `min_delta` instead.
    # saver = ModelCheckpoint('model_nn.h5', monitor='val_acc')
    saver = ModelCheckpoint('model_nn.hdf5', monitor='val_acc')

    try:
        # nn_model.fit(tweets, labels, epochs=50, batch_size=8192, callbacks= # `validation_split` is only supported for Tensors or NumPy arrays
        #             [tb_callback, early_stop, lr_reducer, saver], 
        #             validation_split=0.2)
        nparray_tweets = np.array(tweets)
        nparray_labels = np.array(labels)
        nn_model.fit(nparray_tweets, nparray_labels, epochs=50, batch_size=8192, callbacks=
                    [tb_callback, early_stop, lr_reducer, saver], 
                    validation_split=0.2, verbose=1)
    except KeyboardInterrupt:
        pass
    # nn_model.save('model_nn.h5')
    nn_model.save('model_nn.hdf5')
    print("Saved model: model_nn.hdf5")
    del tweets
    del labels
   

In [None]:
# Cоздаем вокабуляр и обучаем модель 
tweets, labels, nn_model = get_data_and_nn()
print(tweets[:2])
print(labels[:2])

Extracting data...
First 2 tweets
 [['0', '1467810369', 'Mon Apr 06 22:19:45 PDT 2009', 'NO_QUERY', '_TheSpecialOne_', "@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"], ['0', '1467810672', 'Mon Apr 06 22:19:49 PDT 2009', 'NO_QUERY', 'scotthamilton', "is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!"]]
Preprocessing data...
100,000 tweet(s) pre-processed out of 1,600,000
200,000 tweet(s) pre-processed out of 1,600,000
300,000 tweet(s) pre-processed out of 1,600,000
400,000 tweet(s) pre-processed out of 1,600,000
500,000 tweet(s) pre-processed out of 1,600,000
600,000 tweet(s) pre-processed out of 1,600,000
700,000 tweet(s) pre-processed out of 1,600,000
800,000 tweet(s) pre-processed out of 1,600,000
900,000 tweet(s) pre-processed out of 1,600,000
1,000,000 tweet(s) pre-processed out of 1,600,000
1,100,000 tweet(s) pre-processed out of 1,600,000
1,200,000 

In [None]:
train_nn(tweets, labels, nn_model)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Saved model: model_nn.hdf5


In [None]:
# Оцениваем качество модели
tweets_test, labels_test, _ = init_with_vocab(type_data='test')
print(nn_model.evaluate(np.array(tweets_test), np.array(labels_test), batch_size=32))


Extracting data...
[['0', '2329087315', 'Thu Jun 25 10:20:06 PDT 2009', 'NO_QUERY', 'mephistolesnc', "@Nakialjackson Aww...that's sad "], ['0', '2329087373', 'Thu Jun 25 10:20:07 PDT 2009', 'NO_QUERY', 'boo_kay', '@vmprfreak '], ['0', '2329087698', 'Thu Jun 25 10:20:07 PDT 2009', 'NO_QUERY', 'swgalibertarian', 'Talking to a GAGOV candidate in about an hour. Time to figure out exactly what questions I want to ask him... SOOO many, not enough time! ']]
First 2 tweets fron 595
 ["@Nakialjackson Aww...that's sad ", '@vmprfreak '] [0.0, 0.0]
Preprocessing data...

Cleaning data...
[list(['aww', 'sad']) list([])] 
 ['kid', 'amp'] 
 [0.0 0.0] 
 [0.0, 0.0]
Loading vocabulary...
Loaded vocabulary
Replacing words with vocabulary numbers...
Replaced words with vocabulary numbers
[0.4067313075065613, 0.8239316344261169]
