# Automatic question tagging system

**The main goal**: create idendentificator of tags from question text.

Dataset with the text of 10% of questions and answers from the Stack Overflow programming Q&A website.

This is organized as three tables:

**Questions** contains the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions whose Id is a multiple of 10.  
**Answers** contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table.  
**Tags** contains the tags on each of these questions.

The main steps:  
1. Data preprocessing:  
1.1. Data cleaning: remove html tags, numbers, datetimes, special characters from title and body and etc. (regex, nltk, BeutifulSoup)  
1.2. Removing stop words (nltk)  
1.3. Tokenization (nltk) 
1.4. Lemmatization (WordNetLemmatizer)  
2. Vectorization(CountVectorizer/TfidfVectorizer(tune parametrs (n_grams, max_features and others)))  
3. Create classifier (Multinominal Bayes, Linear SVM, KNN and others)  
4. Tune classifier parametrs (GridSearchCV)  
5. Estimate classifier performance  

Use your knowledge to create machine learning pipeline for the most accurate predictions. Metric for maximization - **ROC_AUC**. Expected ROC_AUC > 0.6

In [1]:
import pandas as pd
from tqdm import tqdm
import spacy
from sklearn.naive_bayes import GaussianNB,MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.problem_transform import LabelPowerset
from skmultilearn.adapt import MLkNN
import neattext.functions as nfx
tqdm.pandas()

In [3]:
question= pd.read_csv('data/Questions.csv', encoding='latin')

In [4]:
answer= pd.read_csv('data/Answers.csv', encoding='latin')

In [8]:
tags= pd.read_csv('data/Tags.csv', encoding='latin')

In [6]:
question = question.drop(['OwnerUserId', 'CreationDate', 'ClosedDate', 'Score'], axis=1)
question

Unnamed: 0,Id,Title,Body
0,80,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...
1,90,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...
2,120,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...
3,180,Function for creating color wheels,<p>This is something I've pseudo-solved many t...
4,260,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...
...,...,...,...
1264211,40143210,URL routing in PHP (MVC),<p>I am building a custom MVC project and I ha...
1264212,40143300,Bigquery.Jobs.Insert - Resumable Upload?,<p>The API docs show that you should be able t...
1264213,40143340,Obfuscating code in android studio,<p>Under minifyEnabled I changed from false to...
1264214,40143360,How to fire function after v-model change?,<p>I have input which I use to filter my array...


In [7]:
question = question.set_index('Id')
question

Unnamed: 0_level_0,Title,Body
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
80,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...
90,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...
120,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...
180,Function for creating color wheels,<p>This is something I've pseudo-solved many t...
260,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...
...,...,...
40143210,URL routing in PHP (MVC),<p>I am building a custom MVC project and I ha...
40143300,Bigquery.Jobs.Insert - Resumable Upload?,<p>The API docs show that you should be able t...
40143340,Obfuscating code in android studio,<p>Under minifyEnabled I changed from false to...
40143360,How to fire function after v-model change?,<p>I have input which I use to filter my array...


# Data preparation

In [8]:
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])

def lemmatize(text: str) -> list:
    doc = nlp(text)
    lemmas = [token.lemma_ for token in doc]
    return lemmas

In [9]:
def data_preprocessing(text: str) -> str:
    text = nfx.remove_urls(nfx.remove_html_tags(text))
    text = nfx.remove_emails(text)
    text = nfx.remove_emojis(text)
    text = nfx.remove_numbers(text)
    text = nfx.remove_special_characters(text)
    text = nfx.remove_stopwords(text)
    text = nfx.remove_punctuations(text, most_common=False)
    text = nfx.remove_multiple_spaces(text)
    text = text.lower()

    lemmas = lemmatize(text)

    return ' '.join(lemmas)

In [10]:
merged_answers = answer[['ParentId', 'Body']].groupby(['ParentId'])['Body'].progress_apply(' '.join).reset_index()
answer = merged_answers
del merged_answers

100%|██████████| 1102568/1102568 [00:26<00:00, 41663.21it/s]


In [11]:
answer = answer.set_index('ParentId')
answer

Unnamed: 0_level_0,Body
ParentId,Unnamed: 1_level_1
80,<p>I wound up using this. It is a kind of a ha...
90,"<p><a href=""http://svnbook.red-bean.com/"">Vers..."
120,<p>The Jeff Prosise version from MSDN magazine...
180,<p>I've read somewhere the human eye can't dis...
260,"<p>Yes, I thought about that, but I soon figur..."
...,...
40142860,<p>It's faster and more reliable to work with ...
40142900,"<p>It's not you, it's LinkedIn. See others com..."
40142910,<p>Try add <code>retrun false</code> in the <c...
40142940,<p>Here's how you can do it:</p>\n\n<pre><code...


In [12]:
question

Unnamed: 0_level_0,Title,Body
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
80,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...
90,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...
120,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...
180,Function for creating color wheels,<p>This is something I've pseudo-solved many t...
260,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...
...,...,...
40143210,URL routing in PHP (MVC),<p>I am building a custom MVC project and I ha...
40143300,Bigquery.Jobs.Insert - Resumable Upload?,<p>The API docs show that you should be able t...
40143340,Obfuscating code in android studio,<p>Under minifyEnabled I changed from false to...
40143360,How to fire function after v-model change?,<p>I have input which I use to filter my array...


In [12]:
question = question.join(answer, rsuffix='_answer')

In [14]:
question

Unnamed: 0_level_0,Title,Body,Body_answer
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
80,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...,<p>I wound up using this. It is a kind of a ha...
90,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...,"<p><a href=""http://svnbook.red-bean.com/"">Vers..."
120,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...,<p>The Jeff Prosise version from MSDN magazine...
180,Function for creating color wheels,<p>This is something I've pseudo-solved many t...,<p>I've read somewhere the human eye can't dis...
260,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...,"<p>Yes, I thought about that, but I soon figur..."
...,...,...,...
40143210,URL routing in PHP (MVC),<p>I am building a custom MVC project and I ha...,
40143300,Bigquery.Jobs.Insert - Resumable Upload?,<p>The API docs show that you should be able t...,
40143340,Obfuscating code in android studio,<p>Under minifyEnabled I changed from false to...,
40143360,How to fire function after v-model change?,<p>I have input which I use to filter my array...,


In [13]:
dataset = pd.DataFrame()

In [14]:
dataset['q_and_a'] = question['Title'] + ' ' + question['Body'] + ' ' + question['Body_answer'].fillna('')

In [15]:
dataset

Unnamed: 0_level_0,q_and_a
Id,Unnamed: 1_level_1
80,SQLStatement.execute() - multiple queries in o...
90,Good branching and merging tutorials for Torto...
120,ASP.NET Site Maps <p>Has anyone got experience...
180,Function for creating color wheels <p>This is ...
260,Adding scripting functionality to .NET applica...
...,...
40143210,URL routing in PHP (MVC) <p>I am building a cu...
40143300,Bigquery.Jobs.Insert - Resumable Upload? <p>Th...
40143340,Obfuscating code in android studio <p>Under mi...
40143360,How to fire function after v-model change? <p>...


In [15]:
dataset = dataset['q_and_a'].progress_apply(data_preprocessing)
dataset

100%|██████████| 1264216/1264216 [6:50:10<00:00, 51.37it/s]   


Id
80          sqlstatement execute multiple query statement ...
90          good branching merge tutorial tortoisesvn good...
120         asp net site map get experience create sql bas...
180         function create color wheel pseudo solve time ...
260         add script functionality net application littl...
                                  ...                        
40143210    url route php mvc building custom mvc project ...
40143300    bigquery job insert resumable upload api doc a...
40143340    obfuscating code android studio minifyenable c...
40143360    fire function v model change input use filter ...
40143380    npm run mocha test file cache run mocha test n...
Name: q_and_a, Length: 1264216, dtype: object

In [16]:
dataset.to_csv('data/preprocessed_data1_2.csv', sep=',', encoding='utf-8')

In [3]:
dataset = pd.read_csv('data/preprocessed_data1_2.csv', sep=',', index_col='Id')

In [4]:
dataset.head()

Unnamed: 0_level_0,q_and_a
Id,Unnamed: 1_level_1
80,sqlstatement execute multiple query statement ...
90,good branching merge tutorial tortoisesvn good...
120,asp net site map get experience create sql bas...
180,function create color wheel pseudo solve time ...
260,add script functionality net application littl...


In [7]:
tags = tags[tags['Tag'].str.strip().astype(bool)]

In [8]:
tags = tags[tags['Tag'].notna()]

In [33]:
tags

Unnamed: 0,Id,Tag
0,80,flex
1,80,actionscript-3
2,80,air
3,90,svn
4,90,tortoisesvn
...,...,...
3750989,40143360,javascript
3750990,40143360,vue.js
3750991,40143380,npm
3750992,40143380,mocha


In [9]:
merged_tags = tags.groupby(['Id'])['Tag'].progress_apply(', '.join).reset_index()

100%|██████████| 1264214/1264214 [00:43<00:00, 28834.25it/s]


In [10]:
merged_tags = merged_tags.set_index('Id')
merged_tags

Unnamed: 0_level_0,Tag
Id,Unnamed: 1_level_1
80,"flex, actionscript-3, air"
90,"svn, tortoisesvn, branch, branching-and-merging"
120,"sql, asp.net, sitemap"
180,"algorithm, language-agnostic, colors, color-space"
260,"c#, .net, scripting, compiler-construction"
...,...
40143210,"php, .htaccess"
40143300,google-bigquery
40143340,"android, android-studio"
40143360,"javascript, vue.js"


In [35]:
merged_tags

Unnamed: 0_level_0,Tag
Id,Unnamed: 1_level_1
80,"flex, actionscript-3, air"
90,"svn, tortoisesvn, branch, branching-and-merging"
120,"sql, asp.net, sitemap"
180,"algorithm, language-agnostic, colors, color-space"
260,"c#, .net, scripting, compiler-construction"
...,...
40143210,"php, .htaccess"
40143300,google-bigquery
40143340,"android, android-studio"
40143360,"javascript, vue.js"


In [11]:
dataset = dataset.join(merged_tags, how='right')

In [12]:
dataset

Unnamed: 0_level_0,q_and_a,Tag
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
80,sqlstatement execute multiple query statement ...,"flex, actionscript-3, air"
90,good branching merge tutorial tortoisesvn good...,"svn, tortoisesvn, branch, branching-and-merging"
120,asp net site map get experience create sql bas...,"sql, asp.net, sitemap"
180,function create color wheel pseudo solve time ...,"algorithm, language-agnostic, colors, color-space"
260,add script functionality net application littl...,"c#, .net, scripting, compiler-construction"
...,...,...
40143210,url route php mvc building custom mvc project ...,"php, .htaccess"
40143300,bigquery job insert resumable upload api doc a...,google-bigquery
40143340,obfuscating code android studio minifyenable c...,"android, android-studio"
40143360,fire function v model change input use filter ...,"javascript, vue.js"


In [13]:
dataset.to_csv('data/preprocessed_data_2.csv', sep=',', encoding='utf-8')

In [3]:
data = pd.read_csv('data/preprocessed_data_2.csv', delimiter=',')

In [4]:
data = data.set_index('Id')
data

Unnamed: 0_level_0,q_and_a,Tag
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
80,sqlstatement execute multiple query statement ...,"flex, actionscript-3, air"
90,good branching merge tutorial tortoisesvn good...,"svn, tortoisesvn, branch, branching-and-merging"
120,asp net site map get experience create sql bas...,"sql, asp.net, sitemap"
180,function create color wheel pseudo solve time ...,"algorithm, language-agnostic, colors, color-space"
260,add script functionality net application littl...,"c#, .net, scripting, compiler-construction"
...,...,...
40143210,url route php mvc building custom mvc project ...,"php, .htaccess"
40143300,bigquery job insert resumable upload api doc a...,google-bigquery
40143340,obfuscating code android studio minifyenable c...,"android, android-studio"
40143360,fire function v model change input use filter ...,"javascript, vue.js"


In [5]:
data['Tag'] = data['Tag'].progress_apply(lambda row: row.split(', '))

100%|██████████| 1264214/1264214 [00:02<00:00, 498269.08it/s]


In [6]:
data.head()

Unnamed: 0_level_0,q_and_a,Tag
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
80,sqlstatement execute multiple query statement ...,"[flex, actionscript-3, air]"
90,good branching merge tutorial tortoisesvn good...,"[svn, tortoisesvn, branch, branching-and-merging]"
120,asp net site map get experience create sql bas...,"[sql, asp.net, sitemap]"
180,function create color wheel pseudo solve time ...,"[algorithm, language-agnostic, colors, color-s..."
260,add script functionality net application littl...,"[c#, .net, scripting, compiler-construction]"


In [9]:
tags = tags.sort_values('Tag').dropna()

In [10]:
tags.tail()

Unnamed: 0,Id,Tag
3367736,36801350,zynq
2863261,32140970,zynq
2699353,30507030,zynq
3687053,39632310,zypper
3349419,36638390,zypper


In [11]:
tags_frequency = tags.value_counts(subset=['Tag'])

In [12]:
tags_list = tags['Tag'].unique()

In [13]:
tags_top_200 = [tag for tag in tags_list if tag in tags_frequency[:100]]

In [14]:
tags_top_200

['.htaccess',
 '.net',
 'ajax',
 'algorithm',
 'android',
 'angularjs',
 'apache',
 'api',
 'arrays',
 'asp.net',
 'asp.net-mvc',
 'bash',
 'c',
 'c#',
 'c++',
 'class',
 'codeigniter',
 'cordova',
 'css',
 'css3',
 'database',
 'django',
 'eclipse',
 'entity-framework',
 'excel',
 'excel-vba',
 'facebook',
 'file',
 'forms',
 'function',
 'git',
 'google-chrome',
 'google-maps',
 'hibernate',
 'html',
 'html5',
 'image',
 'ios',
 'iphone',
 'java',
 'javascript',
 'jquery',
 'json',
 'jsp',
 'laravel',
 'linq',
 'linux',
 'list',
 'matlab',
 'maven',
 'mongodb',
 'multithreading',
 'mysql',
 'node.js',
 'objective-c',
 'oracle',
 'osx',
 'performance',
 'perl',
 'php',
 'postgresql',
 'python',
 'python-2.7',
 'python-3.x',
 'qt',
 'r',
 'regex',
 'rest',
 'ruby',
 'ruby-on-rails',
 'ruby-on-rails-3',
 'scala',
 'shell',
 'sockets',
 'spring',
 'sql',
 'sql-server',
 'sql-server-2008',
 'sqlite',
 'string',
 'swift',
 'swing',
 'symfony2',
 'tsql',
 'twitter-bootstrap',
 'uitableview'

In [15]:
data.head()

Unnamed: 0_level_0,q_and_a,Tag
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
80,sqlstatement execute multiple query statement ...,"[flex, actionscript-3, air]"
90,good branching merge tutorial tortoisesvn good...,"[svn, tortoisesvn, branch, branching-and-merging]"
120,asp net site map get experience create sql bas...,"[sql, asp.net, sitemap]"
180,function create color wheel pseudo solve time ...,"[algorithm, language-agnostic, colors, color-s..."
260,add script functionality net application littl...,"[c#, .net, scripting, compiler-construction]"


In [16]:
columns = [ tag_ for tag_ in tags_top_200]
columns.insert(0, 'q_and_a')

In [None]:
columns

In [17]:
q_and_a_splited_tags = pd.DataFrame(columns=columns)
rows = []
def add_flags(row):
    new_row = [row['q_and_a']]
    for t_ in tags_top_200:
        new_row.append(1 if t_ in row['Tag'] else 0)
    rows.append(new_row)

In [18]:
data.progress_apply(add_flags, axis=1)

100%|██████████| 1264214/1264214 [09:47<00:00, 2151.73it/s]


Id
80          None
90          None
120         None
180         None
260         None
            ... 
40143210    None
40143300    None
40143340    None
40143360    None
40143380    None
Length: 1264214, dtype: object

In [19]:
separated_tags_frame = pd.DataFrame(data=rows, columns=columns, dtype=float)

  separated_tags_frame = pd.DataFrame(data=rows, columns=columns, dtype=float)


In [20]:
separated_tags_frame

Unnamed: 0,q_and_a,.htaccess,.net,ajax,algorithm,android,angularjs,apache,api,arrays,...,visual-studio,visual-studio-2010,wcf,web-services,windows,winforms,wordpress,wpf,xcode,xml
0,sqlstatement execute multiple query statement ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,good branching merge tutorial tortoisesvn good...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,asp net site map get experience create sql bas...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,function create color wheel pseudo solve time ...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,add script functionality net application littl...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1264209,url route php mvc building custom mvc project ...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1264210,bigquery job insert resumable upload api doc a...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1264211,obfuscating code android studio minifyenable c...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1264212,fire function v model change input use filter ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
separated_tags_frame.to_csv('data/q&a_separated_tags_2.csv', sep=',', encoding='utf-8')

In [3]:
final_dataset = pd.read_csv('data/q&a_separated_tags_2.csv', sep=',', index_col=0)

In [4]:
final_dataset

Unnamed: 0,q_and_a,.htaccess,.net,ajax,algorithm,android,angularjs,apache,api,arrays,...,visual-studio,visual-studio-2010,wcf,web-services,windows,winforms,wordpress,wpf,xcode,xml
0,sqlstatement execute multiple query statement ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,good branching merge tutorial tortoisesvn good...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,asp net site map get experience create sql bas...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,function create color wheel pseudo solve time ...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,add script functionality net application littl...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1264209,url route php mvc building custom mvc project ...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1264210,bigquery job insert resumable upload api doc a...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1264211,obfuscating code android studio minifyenable c...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1264212,fire function v model change input use filter ...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
seed = 99

In [6]:
final_dataset = final_dataset.sample(frac=0.3, random_state=seed)

In [7]:
final_dataset

Unnamed: 0,q_and_a,.htaccess,.net,ajax,algorithm,android,angularjs,apache,api,arrays,...,visual-studio,visual-studio-2010,wcf,web-services,windows,winforms,wordpress,wpf,xcode,xml
34539,display html inside text area rail know questi...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
963446,rail join exclude certain id follow query pull...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
578607,matlab mex file crash window debug mex file co...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
519952,jquery ajax work call ashx handler asp button ...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
729187,pattern fall object particular level note refe...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
118083,update div net mvc jquery net mvc jquery creat...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1183485,jquery fire occur page modal want right functi...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1088404,java web start start sporadic machine find sol...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
261362,determine tomorrow date j try determine tomorr...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    final_dataset['q_and_a'],
    final_dataset[final_dataset.columns[1:]],
    test_size=0.3,
    random_state=seed
)
del final_dataset

In [19]:
X_train.shape[0], X_test.shape[0], y_train.shape[0], y_test.shape[0]

(265484, 113780, 265484, 113780)

In [22]:
from sklearn.metrics import f1_score, roc_auc_score

In [10]:
vectorizer = TfidfVectorizer(min_df=300)
vectorizer.fit(X_train)
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

In [43]:
def build_model(model,mlb_estimator,xtrain,ytrain, xtest, ytest):
    clf = mlb_estimator(model)
    clf.fit(xtrain,ytrain)
    clf_predictions = clf.predict(xtest)
    score_f1 = f1_score(ytest, clf_predictions, average='micro', zero_division=0)
    score_roc = roc_auc_score(ytest.to_numpy(), clf_predictions.toarray(), average='micro')
    result = {"f1 score":score_f1, "roc auc score": score_roc}
    return result

 *ClassifierChain->MultinomialNB*

In [35]:
build_model(MultinomialNB(),ClassifierChain,X_train,y_train, X_test, y_test)

{'f1 score': 0.4370116545210597, 'roc auc score': 0.7070007072073735}

 *LabelPowerset->MultinomialNB*

In [34]:
build_model(MultinomialNB(),LabelPowerset,X_train,y_train, X_test, y_test)

{'f1 score': 0.43222260516243716, 'roc auc score': 0.6503465817925771}

 *BinaryRelevance->MultinomialNB*

In [36]:
build_model(MultinomialNB(),BinaryRelevance,X_train,y_train, X_test, y_test)

{'f1 score': 0.44881146974391506, 'roc auc score': 0.6734211273003048}

*MLkNN*

In [15]:
mlknn_classifier = MLkNN()
mlknn_classifier.fit(X_train, y_train.to_numpy())

MLkNN()

In [16]:
predicts = mlknn_classifier.predict(X_test)

In [20]:
roc_auc_score(y_test, predicts.toarray(), average='micro')

0.6760656510423764

In [17]:
f1_score(y_test, predicts, average='micro')

0.46006032661850405

In [37]:
build_model(GaussianNB(), ClassifierChain,X_train,y_train, X_test, y_test)

{'f1 score': 0.17267681723297937, 'roc auc score': 0.8143920608786583}

_Doesn't work_

In [38]:
build_model(GaussianNB(), LabelPowerset,X_train,y_train, X_test, y_test)

KeyboardInterrupt: 

In [None]:
build_model(GaussianNB(), BinaryRelevance,X_train,y_train, X_test, y_test)

In [15]:
from sklearn.model_selection import GridSearchCV
parameters = {'k': range(1,3), 's': [0.5, 0.7, 1.0]}
score = 'f1_micro'
clf = GridSearchCV(MLkNN(), parameters, scoring=score)
clf.fit(X_train, y_train.to_numpy())
(clf.best_params_, clf.best_score_)

KeyboardInterrupt: 

In [15]:
from sklearn.svm import SVC

In [None]:
build_model(SVC(random_state=seed),ClassifierChain,X_train,y_train, X_test, y_test)

In [None]:
build_model(SVC(random_state=seed),LabelPowerset,X_train,y_train, X_test, y_test)

In [None]:
build_model(SVC(random_state=seed),BinaryRelevance,X_train,y_train, X_test, y_test)

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipe = Pipeline(
    steps=[
        ('TfidfVectorizer', TfidfVectorizer(min_df=200)),
        ('LabelPowerset', LabelPowerset())
    ]
)

In [10]:
params_grid = {
    'TfidfVectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'LabelPowerset__classifier': [MultinomialNB()]
}

In [25]:
from sklearn.model_selection import KFold

cv = KFold(n_splits=5, shuffle=True, random_state=seed)

In [12]:
pipeline_gc = GridSearchCV(
    pipe,
    param_grid=params_grid,
    scoring=f1_micro,
    n_jobs=1,
    return_train_score=True,
    error_score='raise',
    verbose=True
)

In [13]:
pipeline_gc.fit(X_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


MemoryError: Unable to allocate 14.0 GiB for an array with shape (212387, 8829) and data type float64