# Chat bot question answering.

The goal of your project is to create a chatbot model that provides answers on client questions.   
Your goal is to divide your dataset on several sections and to make multi-level classifier that will classify the section and later the most reasonable(closest) answer.    
Take care about text-preprocessing, stop words removal, and sentence vectorization

This project consists of the following activities :  




1. Phase 1 : Dataset
    * Team Planning
    * Full git project Integration
    * General Project Research
    * Dataset Collection
    * Dataset Preparation
2. Phase 2 : Training
    * Research about NLP model
    * Compose NLP model
        * Stop words Removal
        * Text tokenization
        * Text Preprocessing
        * Question vectorization
        * Find closest vector
    * Ping Pong phase with Dataset labelers
    * Generate more data if needed
    * Fine tunning of your model


3. Phase 3 : Deployment
    * Perform manual benchmark
    * Model Deploy (Git)
    * Write git Readme.md file
    * Receive Feedback from PM


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import re
import string
from scipy.spatial.distance import cdist
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import precision_recall_fscore_support as score

from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.utils import to_categorical

from sklearn.metrics import ConfusionMatrixDisplay, plot_confusion_matrix, classification_report, confusion_matrix


from bpemb import BPEmb
import xgboost as xgb

%matplotlib inline

Using TensorFlow backend.


# Load Data

In [2]:
brainster_df = pd.read_csv('dataset/dataset_brainster.csv')

In [3]:
brainster_df

Unnamed: 0,questions,answers,category,category_id
0,Колку време трае академијата за дигитален марк...,Академијата за дигитален маркетинг трае 23 нед...,маркетинг,1
1,Колку трае академијата за дигитален маркетинг?,Академијата за дигитален маркетинг трае 23 нед...,маркетинг,1
2,колку месеци недели е академијата за дигитален...,Академијата за дигитален маркетинг трае 23 нед...,маркетинг,1
3,колку недели е академијата за дигитален маркетинг,Академијата за дигитален маркетинг трае 23 нед...,маркетинг,1
4,колку месеци е академијата за дигитален маркетинг,Академијата за дигитален маркетинг трае 23 нед...,маркетинг,1
...,...,...,...,...
2932,Дали добивам диплома по завршување на академиј...,"Ако успешно го одбраниш завршниот проект, доби...",UX/UI,7
2933,дали имам диплома или сертификат за UX/UI,"Ако успешно го одбраниш завршниот проект, доби...",UX/UI,7
2934,дали ќе имам диплома или сертификат на академи...,"Ако успешно го одбраниш завршниот проект, доби...",UX/UI,7
2935,Како ги одбирате студентите на академијата за ...,По средбата или интервјуто координаторот има с...,UX/UI,7


# Provide and prepare data information

In [4]:
questions = brainster_df.questions
other_col = brainster_df.drop(columns='questions', axis=0)

In [5]:
print(questions.shape, other_col.shape)

(2937,) (2937, 3)


In [6]:
other_col.head(5)

Unnamed: 0,answers,category,category_id
0,Академијата за дигитален маркетинг трае 23 нед...,маркетинг,1
1,Академијата за дигитален маркетинг трае 23 нед...,маркетинг,1
2,Академијата за дигитален маркетинг трае 23 нед...,маркетинг,1
3,Академијата за дигитален маркетинг трае 23 нед...,маркетинг,1
4,Академијата за дигитален маркетинг трае 23 нед...,маркетинг,1


In [7]:
questions.head(5)

0    Колку време трае академијата за дигитален марк...
1       Колку трае академијата за дигитален маркетинг?
2    колку месеци недели е академијата за дигитален...
3    колку недели е академијата за дигитален маркетинг
4    колку месеци е академијата за дигитален маркетинг
Name: questions, dtype: object

In [8]:
questions_array = questions.to_numpy()
questions_array

array(['Колку време трае академијата за дигитален маркетинг?',
       'Колку трае академијата за дигитален маркетинг?',
       'колку месеци недели е академијата за дигитален маркетинг', ...,
       'дали ќе имам диплома или сертификат на академијата за UX/UI ',
       'Како ги одбирате студентите на академијата за UX/UI?',
       'Кои се придобивките од посета на Академијата за UX/UI?'],
      dtype=object)

# Data Preprocessing

In [9]:
lat_to_cyr = {'kj' : 'ќ', 'gj' : 'ѓ', 'zh' : 'ж', 'ch' : 'ч', 'sh' : 'ш', 'dj' : 'ѓ',
              'a' : 'а', 'b' : 'б', 'c' : 'ц', 'd' : 'д', 'e' : 'е', 'f' : 'ф', 'g' : 'г',
              'h' : 'х', 'i' : 'и', 'j' : 'ј', 'k' : 'к', 'l' : 'л', 'm' : 'м', 'n' : 'н',
              'o' : 'о', 'p' : 'п', 'q' : 'љ', 'r' : 'р', 's' : 'с', 't' : 'т', 'u' : 'у',
              'v' : 'в', 'w' : 'њ', 'x' : 'џ', 'y' : 'ѕ', 'z' : 'з'
             }

# input (text) must be array

def latin_to_cyrillic(text):
    questions = []
    for question in text:
        for key, value in lat_to_cyr.items():
            question = re.sub(key, value, question.lower())
        questions.append(question)
    return questions

In [10]:
questions_translated = latin_to_cyrillic(questions_array)

In [11]:
stop_words_mkd = pd.read_csv('stop_words.txt').to_numpy()
stop_words_list = []
for words in stop_words_mkd:
    for word in words:
        stop_words_list.append(word)
        
len(stop_words_list)

171

# Define Model Architecture

# Count Vectorizer Model

In [12]:
count_vector_model = CountVectorizer(stop_words=stop_words_list, strip_accents='unicode')

In [13]:
count_vector_features = count_vector_model.fit_transform(questions_translated)
questions_df = pd.DataFrame(data = count_vector_features.todense(), 
                            columns=count_vector_model.get_feature_names())

questions_df.tail(10)

  'stop_words.' % sorted(inconsistent))


Unnamed: 0,15,16,24,30,автоматско,адс,акадеимјата,академии,академиите,академија,...,јава,јазик,јазици,јак,јупѕтер,јљуерѕ,ља,њарехоусе,њебдривер,њебсите
2927,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2928,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2929,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2930,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2931,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2932,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2933,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2934,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2935,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2936,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
questions_df.shape

(2937, 712)

In [15]:
# display function

def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

In [16]:
display_all(questions_df.head(1))

Unnamed: 0,15,16,24,30,автоматско,адс,акадеимјата,академии,академиите,академија,академијата,академијта,акадмијата,акредитација,акредитиран,акредитирана,активности,алгоритми,анализа,ангажирани,англиски,аплицира,аплицираат,аплицирам,аппиум,асистирате,атрактивна,аутоматед,ајаџ,бази,банкарски,басицс,бацк,библиотека,биг,бидам,бидат,бирате,боотстрап,браинстер,број,бусинесс,важи,ваучер,ваучери,ваучерот,ваш,ваша,вашата,ваши,ве,веб,вебсајт,веке,видам,видат,видеа,видеата,видео,види,викенд,вирусот,вклучам,вклучува,владата,внимание,возраст,вработување,време,времетрање,врска,врши,вршите,генератион,гит,гледам,години,гоогле,готово,граница,графички,гроут,гроњтх,група,групата,групи,гугл,давате,дале,дата,датум,дебитна,дееп,дел,делот,ден,денови,деновиве,денот,детали,детална,детално,дигитал,дигитален,дигиталниот,дизајн,дизајни,дизајните,диплома,дипломата,добар,добив,добива,добивам,добивање,добиен,добијам,додатни,дознаам,доколку,документ,долго,дополнителни,достапна,достапни,држат,другите,економски,електро,енд,живеам,живо,завршен,завршување,завршувањето,заинтересиран,заинтересирана,замолам,запишам,запишеме,запишуваат,запишување,започнам,започната,започнеме,започнува,здраво,земам,земјава,земјата,знаел,знаење,идејно,избираат,избирате,изведуваат,изработам,изработка,изработуваат,изучува,изучуваат,изучувате,имаат,имам,имаме,имате,инстаграм,инсталирам,инсталирање,инструктори,инструкторите,интеллигенце,интересира,инфо,информации,информација,исконтактирам,искористи,искусни,искуство,исполнувам,ит,кажете,каква,какви,какво,каков,камера,канал,кандидати,кандидатите,кариерен,картичка,ке,керас,кит,клиенти,книги,кого,колкав,колкава,колкави,компании,компетентен,компјутер,компјутери,компјутерот,конкретно,контакт,контактирам,контент,концепирана,копирајтинг,користам,користат,користи,корона,короната,кошта,кратки,кратко,крај,крајниот,крајот,кредит,кредитна,креирање,купам,купи,курс,курсеви,курсевите,лап,лаптоп,лаптопот,ларавел,лаѕоут,леад,леарн,леарнинг,лето,лејаут,линкдин,линкедин,листа,литература,лица,лого,луге,мавен,маил,макдеонски,македонија,македонски,маркетинг,марктинг,математика,мачине,машински,машинско,мегународно,ментори,месеци,места,место,минати,мининг,мк,млади,мое,можам,можат,можност,можностите,мокен,молам,моменталната,мрежи,надвор,намаление,намалување,наменет,наменета,наогаат,наогате,направам,направи,напреден,напредна,напредни,напредувал,населба,насочам,настава,наставата,науки,начин,начини,наши,нашите,најважно,најдам,најдобро,најмалку,најмногу,невронски,недели,неделно,некого,некои,немам,неопходно,нетњоркс,неурал,ниво,нлп,нов,нуди,нудите,обврски,област,областа,области,обработуваат,обука,обуката,објектно,одбивте,одбиен,одбираат,одбирате,одвива,одвиваат,одговор,одговоривте,одлука,одлучам,однос,односно,одржува,одржуваат,онлајн,онлине,ооп,опрема,опфака,опфатен,опфатот,опција,општествени,организирате,ориентирано,останати,оф,официјалните,оценка,оценува,оценување,пазарот,пакување,пари,партиципирам,партнер,партнери,пат,пати,пајтон,перспективи,перформанси,плака,плакам,плакање,плакањето,платам,плати,побарам,повеке,податоци,подготвам,подготвителна,подготвителната,подготовка,поддршка,подобра,подразбира,подразбора,подротвителна,позадина,поздрав,познавања,познавање,полна,помагате,помегу,помогне,помогнете,помош,пополнета,попоуст,попуп,попуст,поради,поразговарам,поразговарм,поректите,портфолио,посветува,поседувам,посета,посредувате,посредување,постер,постои,поточно,потребен,потребна,потребни,потребно,почетник,почетници,почетокот,почнала,почнам,почне,почнува,почнуваат,почнувате,појаснување,поњерби,правам,прават,правата,правен,правење,правите,пракса,практична,практични,практично,пратете,прашам,прашања,прашувам,прегледам,предава,предавачи,предавачите,предавања,предавањата,предавање,предвидена,предзнаења,предзнаење,предлог,предложам,предложи,предложиме,предлози,предност,предноста,предностите,препорачате,претставува,претходни,претходно,приватни,придобивките,признаваат,признат,призната,приклучам,примен,принципот,припремам,припреми,пристапам,пристапот,прифакаат,прифакате,прифатите,причина,пријавам,пријават,пријави,пријавување,пријатно,програм,програма,програмата,програмер,програмирал,програмирам,програмирање,програмирањето,програмирње,програмриање,програмски,прогресот,прогрмирање,продолжувате,проект,проекти,проектите,прокети,просториите,профил,процедурата,пхп,пѕтхон,пѕтхорч,работа,работам,работат,работата,работи,работиме,работите,работни,работно,разлика,разликува,распоредот,рати,реал,реален,реални,реацт,регистрација,регистрацијата,резиме,решение,рок,саат,саати,саатот,сабота,сакал,сакала,сакам,самостојно,сајанс,св,свој,сектор,селектирате,селекција,селениум,сео,сертификат,сертификатот,ситуација,ситуацијата,скајп,скопје,следам,следат,следење,следи,следната,следни,следниот,следните,следува,слично,слободни,слушам,снимаат,снимате,снимки,советувам,сопствен,соработка,соработува,соработувате,состојба,софтвер,софтверско,софтњаре,социологија,социјални,спарк,специфични,спремате,сработените,средно,средношколец,средношколка,средношколци,ставам,стапам,старосна,статистика,стацк,стекнува,стипендираат,стипендирање,стипендија,сторителинг,сторѕтеллинг,страна,странство,стратегија,стратегѕ,стручен,студенти,студентите,студиите,сци,сциенце,сцикит,сљл,тек,текст,телефон,телефонски,теми,термин,термините,тестер,тестинг,тестирање,тестнг,технички,техничко,технологии,технологиите,теџт,тиме,тип,типографија,типот,топ,траат,трае,траење,требаат,тубе,убав,уи,уписи,уписите,упишуваат,уплата,уплати,успех,учам,учат,ученик,учесници,учесниците,учество,учествуваат,учествувам,учење,учи,училиште,училна,училница,учиме,учите,уџ,фактура,факултет,фацебоок,феит,фејсбук,финален,финалниот,финансирате,финансирање,финки,фирми,фокусот,фронт,фронтенд,фулл,фуллстацк,фулстак,фундаменталс,функционира,хадооп,хакинг,хацкинг,хтмл,цв,цело,целосно,цена,цената,цовид,цонтент,цопѕњритинг,цсс,чао,час,часа,часови,часовите,чини,школувањето,ѕоу,ѕоутубе,јава,јазик,јазици,јак,јупѕтер,јљуерѕ,ља,њарехоусе,њебдривер,њебсите
0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# TF-IDF

In [17]:
tf_idf_model = TfidfVectorizer(stop_words=stop_words_list, strip_accents='unicode')

In [18]:
tf_idf_features = tf_idf_model.fit_transform(questions_translated)
questions_df_tfidf = pd.DataFrame(data = tf_idf_features.todense(), 
                            columns=tf_idf_model.get_feature_names())

questions_df_tfidf.tail(10)

  'stop_words.' % sorted(inconsistent))


Unnamed: 0,15,16,24,30,автоматско,адс,акадеимјата,академии,академиите,академија,...,јава,јазик,јазици,јак,јупѕтер,јљуерѕ,ља,њарехоусе,њебдривер,њебсите
2927,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2928,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2929,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2930,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2931,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2932,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2933,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2934,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2935,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2936,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# TfidfVectorizer Ngrams

In [19]:
tf_idf_ngram_model = TfidfVectorizer(ngram_range=(1,2), stop_words=stop_words_list, strip_accents='unicode')

In [20]:
tf_idf_ngram_features = tf_idf_ngram_model.fit_transform(questions_translated)

questions_df_tfidf_ngram = pd.DataFrame(data = tf_idf_ngram_features.todense(), 
                            columns=tf_idf_ngram_model.get_feature_names())

questions_df_tfidf_ngram.tail(10)

  'stop_words.' % sorted(inconsistent))


Unnamed: 0,15,15 рати,16,16 30,24,24 рати,30,30 конкретно,автоматско,автоматско тестирање,...,ља следува,ља слободни,ља стекнува,ља странство,ља уплати,ља училница,њарехоусе,њарехоусе академијата,њебдривер,њебсите
2927,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2928,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2929,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2930,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2931,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2932,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2933,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2934,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2935,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2936,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Test Part

In [21]:
client = latin_to_cyrillic(["на колку рати може да плаќам на data science"])

CountVectorizer

In [22]:
client_count_feature = count_vector_model.transform(client)

TfIdfVectorizer

In [23]:
client_tfidf_feature = tf_idf_model.transform(client)

TfIdfVectorizer Ngram

In [24]:
client_tfidf_ngram_feature = tf_idf_ngram_model.transform(client)

# Define distance between two vectors

In [25]:
def distance_vectors(answer, features, client_feature):
    cosine_function = lambda a, b : round(np.inner(a, b)/(np.linalg.norm(a)*np.linalg.norm(b)), 3)
    distances = []
    for vector in features.toarray():
        for clietn_vector in client_feature.toarray():
            cosine = cosine_function(vector, clietn_vector)
            distances.append(cosine)
    index = np.argmax(distances)
    max_cosine = max(distances)
    return answer.answers[index], index, max_cosine   

# Example :  
   Find the closest dataset question based on user defined question

In [26]:
client

['на колку рати може да плаќам на дата сциенце']

In [27]:
cv_answer, cv_index, cv_cosine = distance_vectors(other_col, count_vector_features, client_count_feature)

In [28]:
print("Most equal answer :", cv_answer)
print('Raw number', cv_index)
print('Cosine coefficient:', cv_cosine)

Most equal answer : Data Science е интердисциплинарно поле во кое се применуваат научни методи, процеси, алгоритми и системи за извлекување на корисно знаење и информации од структуирани и неструктуирани податоци.  Data Science  подразбира примена на знаење од областса на Machine learning, Python, Big Data, Business Inteligence, SQL, математика и статистика.
Raw number 1742
Cosine coefficient: 0.707


In [29]:
tfidf_answer, tfidf_index, tfidf_cosine = distance_vectors(other_col, tf_idf_features, client_tfidf_feature)

In [30]:
print("Most equal answer :", tfidf_answer)
print('Raw number', tfidf_index)
print('Cosine coefficient:', tfidf_cosine)

Most equal answer : За секоја рата добивате соодветна фактура и плаќањето може да се врши на жито сметка на Brainster, и од физичко но и од правно лице
Raw number 2117
Cosine coefficient: 0.863


In [31]:
ngram_answer, ngram_index, ngram_cosine = distance_vectors(other_col, tf_idf_ngram_features, client_tfidf_ngram_feature)

In [32]:
print("Most equal answer :", ngram_answer)
print('Raw number', ngram_index)
print('Cosine coefficient:', ngram_cosine)

Most equal answer : Moже да се плаќа на 15 месечни рати без камата
Raw number 864
Cosine coefficient: 0.636


# Word Embbeding

In [33]:
bpemb_mk = BPEmb(lang="mk", dim=300)

In [34]:
questions_translated[1:5]

['колку трае академијата за дигитален маркетинг?',
 'колку месеци недели е академијата за дигитален маркетинг',
 'колку недели е академијата за дигитален маркетинг',
 'колку месеци е академијата за дигитален маркетинг']

In [35]:
#cos_sim = np.argmax(cosine_similarity([vector],embed_questions))

In [36]:
def result_embed(questions, clinet_question):
# embeding of question from dataset
    embed_questions = []
    for question in questions:
        embed_questions.append(bpemb_mk.embed(question).mean(axis=0))

# preproccess and ebmeding of client question
    prepared_question = latin_to_cyrillic([clinet_question])

    for query in [prepared_question]:
        query_embedding = bpemb_mk.embed(query).mean(axis=0)
        
        # cdist give us how much is the error similarity
        distances = cdist([query_embedding], embed_questions, "cosine")[0]

        results = zip(range(len(distances)), distances)
        results = sorted(results, key=lambda x: x[1])
        
        print("Прашање:", query)
        print("\n======================\n")
        print("\nTop 5 најдобри резултати:\n")

        for idx, distance in results[0:10]:
            print(other_col.answers[idx].strip(),'\n', "(Score: %.4f)" % (1-distance),'\n', "Index: ", idx)
    

In [46]:
result_embed(questions_translated, 'vo koja biblioteka rabotite deep learning')

Прашање: ['во која библиотека работите дееп леарнинг']



Top 5 најдобри резултати:

Невронски мрежи е предвидено да се работат во Keras 
 (Score: 0.8691) 
 Index:  1664
Невронски мрежи е предвидено да се работат во Keras 
 (Score: 0.8683) 
 Index:  1894
Да, невронските мрежи се опфатени во модулот за машинско учење 
 (Score: 0.7431) 
 Index:  1652
Да, невронските мрежи се опфатени во модулот за машинско учење 
 (Score: 0.7305) 
 Index:  1882
Да, невронските мрежи се опфатени во модулот за машинско учење 
 (Score: 0.6430) 
 Index:  1651
Да, невронските мрежи се опфатени во модулот за машинско учење 
 (Score: 0.6257) 
 Index:  1881
На академијата за Data science е предвиден 11 неделен модул посветен на Machine Learning кои ги покрива следните области: supervised и unsupervised learning 
 (Score: 0.5671) 
 Index:  1647
Најчесто користена библиотека е Sci-kit Learn 
 (Score: 0.5650) 
 Index:  1896
Најчесто користена библиотека е Sci-kit Learn 
 (Score: 0.5649) 
 Index:  1667
Најчесто кори

# Classification Model

In [47]:
def dataset_preprocess(X_part, labels):
    embed_questions = []
    for question in X_part:
        embed_questions.append(bpemb_mk.embed(question).mean(axis=0))
    questions_embed = np.array(embed_questions)    
    target = to_categorical(labels)
    
    X_train, X_test, y_train, y_test = train_test_split(questions_embed, target, test_size=0.20, random_state=10)
    
    return X_train, X_test, y_train, y_test
    

In [48]:
X_train, X_test, y_train, y_test = dataset_preprocess(questions_translated, other_col.category_id)

In [49]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(2349, 300) (588, 300) (2349, 8) (588, 8)


In [50]:
model = Sequential()

model.add(Dense(256, input_dim=300))
model.add(Activation('relu'))

model.add(Dense(128))
model.add(Activation('relu'))

model.add(Dense(64))
model.add(Activation('relu'))

model.add(Dense(32))
model.add(Activation('relu'))

model.add(Dense(16))
model.add(Activation('relu'))

model.add(Dense(10))
model.add(Activation('relu'))

model.add(Dense(8))
model.add(Activation('softmax'))

print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 256)               77056     
_________________________________________________________________
activation_1 (Activation)    (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               32896     
_________________________________________________________________
activation_2 (Activation)    (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 64)                8256      
_________________________________________________________________
activation_3 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 32)               

In [51]:
model.compile(loss='categorical_crossentropy', optimizer='Adadelta', metrics=['accuracy'])

In [52]:
my_callback1 = ModelCheckpoint('best_model.pt', verbose=1, save_best_only=True, mode='max', monitor='val_accuracy')
#my_callback2 = EarlyStopping(patience=7)

my_callbacks = [my_callback1]

In [53]:
model.fit(X_train, y_train, batch_size=None,
    epochs=50,
    verbose=1,
    callbacks=my_callbacks,
    validation_split=0.0,
    validation_data=(X_test, y_test),
    shuffle=True,
    class_weight=None,
    sample_weight=None,
    initial_epoch=0,
    steps_per_epoch=None,
    validation_steps=None,
    validation_freq=1,
    max_queue_size=10,
    workers=1)

Train on 2349 samples, validate on 588 samples
Epoch 1/50

Epoch 00001: val_accuracy improved from -inf to 0.74490, saving model to best_model.pt
Epoch 2/50

Epoch 00002: val_accuracy did not improve from 0.74490
Epoch 3/50

Epoch 00003: val_accuracy improved from 0.74490 to 0.97449, saving model to best_model.pt
Epoch 4/50

Epoch 00004: val_accuracy improved from 0.97449 to 0.98980, saving model to best_model.pt
Epoch 5/50

Epoch 00005: val_accuracy did not improve from 0.98980
Epoch 6/50

Epoch 00006: val_accuracy did not improve from 0.98980
Epoch 7/50

Epoch 00007: val_accuracy did not improve from 0.98980
Epoch 8/50

Epoch 00008: val_accuracy did not improve from 0.98980
Epoch 9/50

Epoch 00009: val_accuracy did not improve from 0.98980
Epoch 10/50

Epoch 00010: val_accuracy did not improve from 0.98980
Epoch 11/50

Epoch 00011: val_accuracy did not improve from 0.98980
Epoch 12/50

Epoch 00012: val_accuracy did not improve from 0.98980
Epoch 13/50

Epoch 00013: val_accuracy did n


Epoch 00040: val_accuracy did not improve from 0.98980
Epoch 41/50

Epoch 00041: val_accuracy did not improve from 0.98980
Epoch 42/50

Epoch 00042: val_accuracy did not improve from 0.98980
Epoch 43/50

Epoch 00043: val_accuracy did not improve from 0.98980
Epoch 44/50

Epoch 00044: val_accuracy did not improve from 0.98980
Epoch 45/50

Epoch 00045: val_accuracy did not improve from 0.98980
Epoch 46/50

Epoch 00046: val_accuracy did not improve from 0.98980
Epoch 47/50

Epoch 00047: val_accuracy did not improve from 0.98980
Epoch 48/50

Epoch 00048: val_accuracy did not improve from 0.98980
Epoch 49/50

Epoch 00049: val_accuracy did not improve from 0.98980
Epoch 50/50

Epoch 00050: val_accuracy did not improve from 0.98980


<keras.callbacks.callbacks.History at 0x1f67167a828>

In [54]:
model.load_weights('best_model.pt')

In [55]:
predict_question = model.predict(X_test)

In [56]:
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose = 1, callbacks=my_callbacks)
print("Accuracy relu activation: %.2f%%\n" % (scores[1]*100))

Accuracy relu activation: 98.98%



In [57]:
#Convert predictions to 0/1 vectors
y_pred_relu = np.array([int(np.argmax(predict_question[i])) for i in range(len(predict_question))])
y_test_array = [np.argmax(y_test[i]) for i in range(len(y_test))]

relu_accuracy = (y_pred_relu == y_test_array).mean()

In [58]:
precision, recall, relu_fscore, support = score(y_test_array, y_pred_relu, average='macro')
print(np.round(relu_accuracy, 3), np.round(relu_fscore, 3))


0.99 0.991


In [None]:
labels = ['општо', 'маркетинг', 'дизајн', 'front-end програмирање', 
         'full-stack програмирање', 'Data Science', 'софтверско тестирање', 'UX/UI']

cr_we_relu = classification_report(y_test_array, y_pred_relu)
print('Classification Report: relu \n', cr_we_relu)
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test_array, y_pred_relu),
                              display_labels=labels)

disp = disp.plot(xticks_rotation='vertical')

plt.show()

In [None]:
def class_predict(client_question):
    client_question_embed = bpemb_mk.embed(client_question).mean(axis=0)
    question_reshape = client_question_embed.reshape(1, 300)
    class_predict = np.argmax(model.predict(question_reshape))
    print(labels[class_predict])

In [None]:
class_predict('kolku pari e akademijata za full-stack')

In [None]:
plt.hist(y_test_array, bins=8, color='g' )  
plt.ylabel('test_sample')
plt.xlabel('category')
plt.show()

# Make Benchmark and provide info

# Make Summary about your results