# Retrieval of Relevant Answers to Stack Overflow Queries by Ranking the Answers

Stack Overflow is the most used Q&A forum by software developers. It acts as a valuable repository for novice programmers by helping them resolve their queries. As per the Stack Overflow survey 2019, almost 97% of developers visited the Stack Overflow website last year. It also states that each month, about 50 million people visit Stack Overflow to learn, share, and build their careers. Although this forum offers numerous advantages for developers, on the flip side there are few drawbacks associated with it. Due to massive collaboration, there exist different forms of the same questions and each question has more than one answer. Many users find it difficult to get relevant answers to their queries. To overcome these difficulties, we aim to study the semantic similarity between questions, and answer choices available in the Stack Overflow system and analyze the most relevant answers to the question searched using a ranking methodology to produce the most appropriate answer set for the users. 

We divided this project into three parts, 
1.	Question Processing
2.	Answer Processing
3.	Question and Answer Relevance Mapping

Question Processing : 
    In this phase, we compare probe questions with stack overflow question corpus to find most similar questions in the corpus. 
Answer Processing: 
    In this phase, associated answers for selected questions are filters and grouped together to form various answer clusters. 
Question and Answer Mapping:
    Each answer clusters are compared to search questions using cosine similarity to check relevance between question and answers. 

## Question Processing

### Data Extraction 

In [2]:
import pandas as pd
import warnings

# Ignore warnings
warnings.filterwarnings('ignore')
quest = pd.read_csv("./Questions.csv", encoding='latin-1')
quest.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...
1,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...
2,535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...
3,594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...
4,683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...


**DATA PREPROCESSING**

In [3]:
#Filtering WH questions 
Wh_df = quest[quest.Title.str.lower().str.startswith('wh')]
Wh_df.shape

(22722, 6)

In [8]:
#Converting text to lower cased tokens
#Convert Accented Characters
from gensim.utils import simple_preprocess
def simple_process(question_list):
    for each in question_list:
        yield(simple_preprocess(str(each), deacc = True, max_len = 20, min_len = 1 ))
Question_Title = list(Wh_df['Title'])
Question_data = list(simple_process(Question_Title))

In [9]:
#Removing stopwords
import nltk.data
stop_words = nltk.corpus.stopwords.words("english")
def rem_stop(question):
    return [word for word in question if word not in stop_words ]

Question_data_filtered1 = [rem_stop(question) for question in Question_data ]
Question_data_filtered = [ ' '.join(question) for question in Question_data ]

#### Feature Extraction - Word2Vec

In [11]:
#Loading the stack overflow trained word2vec model
from gensim.models.keyedvectors import KeyedVectors
word_vect = KeyedVectors.load_word2vec_format("./SO_vectors_200.bin", binary=True)

In [14]:
#Handling unseen words - As data is trained on pretrained model, there exits few words not available in model, removing such words
def unseen_words(question):
    return [word for word in question if word in word_vect.vocab ]
word2vec_list = [unseen_words(question) for question in Question_data_filtered1 ]
word_evec_trans = [word_vect[question] for question in word2vec_list if question != []]
word2vec_list

[['best', 'way', 'distribute', 'python', 'command', 'line', 'tools'],
 ['best',
  'way',
  'grab',
  'parse',
  'command',
  'line',
  'arguments',
  'passed',
  'python',
  'script'],
 ['java', 'python', 'garbage', 'collection', 'methods', 'different'],
 ['python', 'date', 'time', 'conversion', 'seem', 'wrong'],
 ['best', 'way', 'duplicate', 'fork', 'windows'],
 ['learn', 'pypy', 'translation', 'function'],
 ['refactoring', 'tools', 'use', 'python'],
 ['best', 'way', 'use', 'web', 'services', 'python'],
 ['python', 'iter', 'mapping', 'return', 'iterkeys', 'instead', 'iteritems'],
 ['instance', 'variable'],
 ['double', 'star', 'star', 'python', 'parameters'],
 ['easiest', 'way', 'read', 'foxpro', 'dbf', 'file', 'python'],
 ['class', 'methods', 'python'],
 ['best', 'way', 'return', 'multiple', 'values', 'function', 'python'],
 ['best', 'way', 'bit', 'field', 'manipulation', 'python'],
 ['tuple', 'useful'],
 ['find',
  'time',
  'space',
  'complexity',
  'built',
  'sequence',
  'types'

In [30]:
#Sentence Embedding
import numpy as np
def sentence_emb(question,word_vect,dim = 200):
    question_embedding = np.zeros(dim)
    valid_words = 0
    for word in question:
        if word in word_vect.vocab:
            valid_words += 1
            question_embedding += word_vect[word]
    if valid_words > 0:
        return question_embedding/valid_words
    else:
        return question_embedding
all_title_embeddings = []
i = 0
for question in word2vec_list:
    if question != []:
        list_val = list(sentence_emb(question, word_vect))
        all_title_embeddings.append(list_val) 

In [36]:
#Identifying empty values - to merge the processed data we have to get their index value. 
i = 0
empty_values = []
for question in word2vec_list:
    if question == []:
        empty_values.append(i)
    else:
        i = i+1

In [39]:
#Removing empty values from data - empty values corresponds to titles with no proper details
Wh_df_1 = Wh_df.copy(deep = True)
Wh_df_1 = Wh_df_1.drop(Wh_df_1.index[empty_values])
ids_fil = Wh_df_1.index

#### Feature Extraction - Bert Embedding

In [45]:
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('bert-base-nli-mean-tokens')

In [46]:
#Converting text to vectors 
question_embeddings = embedder.encode(Question_data_filtered)

**Search Question Processing**

**Text Similarity - Probe Question & Question Corpus**

In [0]:
query = ["How to find version of python installed in my system"]
#Bert Encoding
query_embedding = list(embedder.encode(query))
ids = Wh_df.index
#Creating dataframe that holds all processed titles with index 
df_bert = pd.DataFrame({'pos':ids, 'en_ques':Question_data_filtered})
#cosine similarity
df_res = pd.DataFrame()
import scipy.spatial
distances = scipy.spatial.distance.cdist([list(query_embedding[0])], question_embeddings, "cosine")[0]
results = zip(range(len(distances)), distances)
results = sorted(results, key=lambda x: x[1])
#Fetching top 5 similar question
for idx, distance in results[0:5]:
    index_val = list(df_bert[df_bert['en_ques'] == Question_data_filtered[idx]]['pos'])
    df_v = pd.DataFrame({'Id': Wh_df[Wh_df.index.isin(index_val)]['Id'],'Title':Wh_df[Wh_df.index.isin(index_val)]['Title'],
                         'Score': 1-distance })
    df_res = df_res.append(df_v)
    #print(Question_data_filtered[idx].strip(), "(Score: %.4f)" % (1-distance))
df_res



Unnamed: 0,Id,Title,Score
83866,8917885,Which version of Python do I have installed?,0.943389
375196,29458249,Which version of Pip to use with my Python ins...,0.931776
417133,31678261,Where is a program that will let me execute Py...,0.925464
24906,3049569,Where do I put utility functions in my Python ...,0.921216
180429,17098004,which version of python is used when I run it ...,0.91991


In [0]:
#Word2vec Encoding
#Search Question Preprocessing
Question_data_query = list(simple_process(query))
Question_data_filtered1_query = [rem_stop(question) for question in Question_data_query ]
word2vec_list_query = [unseen_words(question) for question in Question_data_filtered1_query ]
query_emb = list(sentence_emb(word2vec_list_query[0], word_vect))
empty_list = []
ques = []
#Removing empty values from data - empty values corresponds to titles with no proper details 
for ind,question in enumerate(word2vec_list):
    if question == []:
        empty_list.append(ind)
for inds,question in enumerate(Question_data_filtered):
    if inds not in empty_list:
        ques.append(question)
df_w2v = pd.DataFrame({'pos':ids_fil, 'en_ques':ques})
#Cosine Similarity
distances = scipy.spatial.distance.cdist([list(query_emb)], all_title_embeddings, "cosine")[0]
results = zip(range(len(distances)), distances)
results = sorted(results, key=lambda x: x[1])
#Top 5 Similar Question
for idx, distance in results[0:5]:
    index_val = [df_w2v[df_w2v['en_ques'] == ques[idx]]['pos']]
    df_v = pd.DataFrame({'Id': Wh_df_1[Wh_df_1.index.isin(index_val)]['Id'],'Title': Wh_df_1[Wh_df_1.index.isin(index_val)]['Title'],
                         'Score': 1-distance })
    df_res = df_res.append(df_v)
    print(ques[idx].strip(), "(Score: %.4f)" % (1-distance))
#dataframe to hold combined results of word2vec & bert findings
#Ranking the data based on the similarity score
df_res = df_res.reset_index(drop = True)   
df_res = df_res.sort_values(['Score'], ascending = False)
#If both model produce same question, then duplicate values are removed
dupl_val = df_res['Id'].duplicated()
df_res[~dupl_val]


which version of python do i have installed (Score: 0.9019)
why python m venv myenv installs older version of pip into myenv than any version of pip i can find anywhere on the system (Score: 0.8874)
which version of pip to use with my python installs (Score: 0.8613)
where do i find the xml dom python package for the python and i have a suse x version of linux version (Score: 0.8544)
which version of pydev should i install with python (Score: 0.8405)


Unnamed: 0,Id,Title,Score
0,8917885,Which version of Python do I have installed?,0.943389
1,29458249,Which version of Pip to use with my Python ins...,0.931776
2,31678261,Where is a program that will let me execute Py...,0.925464
3,3049569,Where do I put utility functions in my Python ...,0.921216
4,17098004,which version of python is used when I run it ...,0.91991
6,29689514,Why 'python3 -m venv myenv' installs older ver...,0.88743
8,9027337,where do I find the xml.dom python package for...,0.854427
9,23300560,which version of pydev should I install with p...,0.840455


In [0]:
#Fetching Parent ID whose score is greater thaan 0.90
Parent_id = list(df_res[df_res['Score'] >= 0.90]['Id'])
Parent_id

[8917885, 29458249, 31678261, 3049569, 17098004, 8917885]

In [0]:
if len(Parent_id) == 0:
  print("Could you please add more information or be more specific, as we couldnt able to get similar matches for your question")

## Answer Processing

### Data Extraction 

In [0]:
import pandas as pd
ans_main_df = pd.read_csv("./Answers.csv", encoding='latin-1')

**DATA PREPROCESSING**

In [0]:
#Finding tags present in the body tag
'''
from bs4 import BeautifulSoup
def tags(ans_text):
    soup = BeautifulSoup(ans_text, "html.parser")
    tags = [tag.name for tag in soup.find_all()]
    return tags
tags_uniq = []
tags_uniq.extend(ans_main_df['Body'].apply(tags))   
import itertools
tags_u = list(itertools.chain.from_iterable(tags_uniq))
set(tags_u)
'''

'\nfrom bs4 import BeautifulSoup\ndef tags(ans_text):\n    soup = BeautifulSoup(ans_text, "html.parser")\n    tags = [tag.name for tag in soup.find_all()]\n    return tags\ntags_uniq = []\ntags_uniq.extend(ans_main_df[\'Body\'].apply(tags))   \nimport itertools\ntags_u = list(itertools.chain.from_iterable(tags_uniq))\nset(tags_u)\n'

In [0]:
#Parse the HTML BODY of the answer corpus and remove the code,image & URL texts 
#Extract text part from Answer corpus
import re
from bs4 import BeautifulSoup
remove_tags = ['a','img','code']
req_tags = ['p','pre']
def code_img_url(text_data):
    soup = BeautifulSoup(text_data, "html.parser")
    for tag in soup.find_all(remove_tags):
        tag.decompose()
    para = ' '.join([tex.text for tex in soup.find_all(req_tags) if tex.text != "" ])
    return para

In [0]:
#Filter Answers for top similar question 
equi_ans = ans_main_df[ans_main_df['ParentId'].isin(Parent_id)]
equi_ans['Processed_body'] = equi_ans['Body'].apply(code_img_url) 
equi_ans

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body,Processed_body
69358,3049575,34211.0,2010-06-15T22:36:48Z,3049569,4,<p>If you don't want to make it a member of th...,If you don't want to make it a member of the ...
69591,3058446,290340.0,2010-06-17T02:02:12Z,3049569,8,<p><strong>Make it a static function...</stron...,Make it a static function... Your definition w...
193393,8917907,1118101.0,2012-01-18T21:45:51Z,8917885,230,<pre><code>python -V\n</code></pre>\n\n<p><a h...,may also work (introduced in version 2.5)
193394,8917909,4714.0,2012-01-18T21:45:52Z,8917885,73,<p>Python 2.5+:</p>\n\n<pre><code>python --ver...,Python 2.5+: Python 2.4-:
193395,8917910,118.0,2012-01-18T21:45:54Z,8917885,22,<p>At a command prompt type:</p>\n\n<pre><code...,At a command prompt type:
193396,8917940,675637.0,2012-01-18T21:47:31Z,8917885,19,<p>When I open <code>Python (command line)</co...,When I open the first thing it tells me is th...
366142,17098298,1983854.0,2013-06-13T22:31:52Z,17098004,1,"<p>To know which version is used by default, t...","To know which version is used by default, type..."
366147,17098416,1598412.0,2013-06-13T22:41:43Z,17098004,0,<pre><code>python --version\n</code></pre>\n\n...,Then head on over to your .bashrc (should be i...
454252,20896732,3155933.0,2014-01-03T04:47:53Z,8917885,31,<p>in a Python IDE just copy and paste in the ...,in a Python IDE just copy and paste in the fol...
666378,29458415,707650.0,2015-04-05T14:37:45Z,29458249,2,"<ol>\n<li><p>Yes, it is safe. Python uses this...","Yes, it is safe. Python uses this naming like ..."


In [0]:
#Data Preprocessing
Ans_body = list(equi_ans['Processed_body'])
Ans_data = list(simple_process(Ans_body))

In [0]:
#vectorizing 
#TF-IDF EMBEDDING 
Ans_data_filtered = [ ' '.join(ans) for ans in Ans_data if ans != [] ]
from sklearn.feature_extraction.text import TfidfVectorizer
vectorz = TfidfVectorizer()
tfidf_ans = vectorz.fit_transform(Ans_data_filtered)
#SVD decomposition to get dense vectors
from sklearn.decomposition import TruncatedSVD
tsvd = TruncatedSVD()
transformed = tsvd.fit_transform(tfidf_ans)

**Hierarchical Clustering**

In [0]:
ans_id = equi_ans[~(equi_ans['Processed_body'] == "")].index
import scipy.spatial.distance as distance
import scipy.cluster.hierarchy as hierarchy
D_tf = distance.pdist(transformed,'cosine')
L_tf= hierarchy.linkage(D_tf)
cls_tf = hierarchy.fcluster(L_tf,0.85,criterion = 'inconsistent')
df_cls_ans = pd.DataFrame({'pos':ans_id, 'Cluster_tf':cls_tf})
ans_final = pd.concat([equi_ans,df_cls_ans.set_index('pos')], axis = 1)
#Cluster results of tf-idf vectorized model
ans_final

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body,Processed_body,Cluster_tf
69358,3049575,34211.0,2010-06-15T22:36:48Z,3049569,4,<p>If you don't want to make it a member of th...,If you don't want to make it a member of the ...,5.0
69591,3058446,290340.0,2010-06-17T02:02:12Z,3049569,8,<p><strong>Make it a static function...</stron...,Make it a static function... Your definition w...,4.0
193393,8917907,1118101.0,2012-01-18T21:45:51Z,8917885,230,<pre><code>python -V\n</code></pre>\n\n<p><a h...,may also work (introduced in version 2.5),2.0
193394,8917909,4714.0,2012-01-18T21:45:52Z,8917885,73,<p>Python 2.5+:</p>\n\n<pre><code>python --ver...,Python 2.5+: Python 2.4-:,1.0
193395,8917910,118.0,2012-01-18T21:45:54Z,8917885,22,<p>At a command prompt type:</p>\n\n<pre><code...,At a command prompt type:,5.0
193396,8917940,675637.0,2012-01-18T21:47:31Z,8917885,19,<p>When I open <code>Python (command line)</co...,When I open the first thing it tells me is th...,4.0
366142,17098298,1983854.0,2013-06-13T22:31:52Z,17098004,1,"<p>To know which version is used by default, t...","To know which version is used by default, type...",4.0
366147,17098416,1598412.0,2013-06-13T22:41:43Z,17098004,0,<pre><code>python --version\n</code></pre>\n\n...,Then head on over to your .bashrc (should be i...,4.0
454252,20896732,3155933.0,2014-01-03T04:47:53Z,8917885,31,<p>in a Python IDE just copy and paste in the ...,in a Python IDE just copy and paste in the fol...,3.0
666378,29458415,707650.0,2015-04-05T14:37:45Z,29458249,2,"<ol>\n<li><p>Yes, it is safe. Python uses this...","Yes, it is safe. Python uses this naming like ...",3.0


In [0]:
#Word2vec Embedding
#Data Preprocessing
word2vec_list_ans = [unseen_words(question) for question in Ans_data ]
all_ans_embeddings = []
for ans in word2vec_list_ans:
    if ans != []:
        list_val = list(sentence_emb(ans, word_vect))
        all_ans_embeddings.append(list_val)
#Clustering 
D1_ans = distance.pdist(all_ans_embeddings,'cosine')
L1_ans= hierarchy.linkage(D1_ans)
cls1_ans = hierarchy.fcluster(L1_ans,0.85,criterion = 'inconsistent')
df_cls_w2v_ans = pd.DataFrame({'pos':ans_id, 'Cluster_w2v':cls1_ans})
ans_final = pd.concat([ans_final,df_cls_w2v_ans.set_index('pos')], axis = 1)
#Cluster results of word2vec vectorized model
ans_final

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body,Processed_body,Cluster_tf,Cluster_w2v
69358,3049575,34211.0,2010-06-15T22:36:48Z,3049569,4,<p>If you don't want to make it a member of th...,If you don't want to make it a member of the ...,5.0,3.0
69591,3058446,290340.0,2010-06-17T02:02:12Z,3049569,8,<p><strong>Make it a static function...</stron...,Make it a static function... Your definition w...,4.0,3.0
193393,8917907,1118101.0,2012-01-18T21:45:51Z,8917885,230,<pre><code>python -V\n</code></pre>\n\n<p><a h...,may also work (introduced in version 2.5),2.0,3.0
193394,8917909,4714.0,2012-01-18T21:45:52Z,8917885,73,<p>Python 2.5+:</p>\n\n<pre><code>python --ver...,Python 2.5+: Python 2.4-:,1.0,1.0
193395,8917910,118.0,2012-01-18T21:45:54Z,8917885,22,<p>At a command prompt type:</p>\n\n<pre><code...,At a command prompt type:,5.0,2.0
193396,8917940,675637.0,2012-01-18T21:47:31Z,8917885,19,<p>When I open <code>Python (command line)</co...,When I open the first thing it tells me is th...,4.0,4.0
366142,17098298,1983854.0,2013-06-13T22:31:52Z,17098004,1,"<p>To know which version is used by default, t...","To know which version is used by default, type...",4.0,3.0
366147,17098416,1598412.0,2013-06-13T22:41:43Z,17098004,0,<pre><code>python --version\n</code></pre>\n\n...,Then head on over to your .bashrc (should be i...,4.0,3.0
454252,20896732,3155933.0,2014-01-03T04:47:53Z,8917885,31,<p>in a Python IDE just copy and paste in the ...,in a Python IDE just copy and paste in the fol...,3.0,3.0
666378,29458415,707650.0,2015-04-05T14:37:45Z,29458249,2,"<ol>\n<li><p>Yes, it is safe. Python uses this...","Yes, it is safe. Python uses this naming like ...",3.0,3.0


In [0]:
#Bert Embedding 
#Encoding the data
Ans_embeddings = embedder.encode(Ans_data_filtered)
D_bert_ans = distance.pdist(Ans_embeddings,'cosine')
L_bert_ans= hierarchy.linkage(D_bert_ans)
cls_bert_ans = hierarchy.fcluster(L_bert_ans,0.85,criterion = 'inconsistent')
df_cls_bert_Ans = pd.DataFrame({'pos':ans_id, 'Cluster_bert':cls_bert_ans})
ans_final = pd.concat([ans_final,df_cls_bert_Ans.set_index('pos')], axis = 1)
#Cluster results 
ans_final

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body,Processed_body,Cluster_tf,Cluster_w2v,Cluster_bert
69358,3049575,34211.0,2010-06-15T22:36:48Z,3049569,4,<p>If you don't want to make it a member of th...,If you don't want to make it a member of the ...,5.0,3.0,11.0
69591,3058446,290340.0,2010-06-17T02:02:12Z,3049569,8,<p><strong>Make it a static function...</stron...,Make it a static function... Your definition w...,4.0,3.0,4.0
193393,8917907,1118101.0,2012-01-18T21:45:51Z,8917885,230,<pre><code>python -V\n</code></pre>\n\n<p><a h...,may also work (introduced in version 2.5),2.0,3.0,11.0
193394,8917909,4714.0,2012-01-18T21:45:52Z,8917885,73,<p>Python 2.5+:</p>\n\n<pre><code>python --ver...,Python 2.5+: Python 2.4-:,1.0,1.0,2.0
193395,8917910,118.0,2012-01-18T21:45:54Z,8917885,22,<p>At a command prompt type:</p>\n\n<pre><code...,At a command prompt type:,5.0,2.0,1.0
193396,8917940,675637.0,2012-01-18T21:47:31Z,8917885,19,<p>When I open <code>Python (command line)</co...,When I open the first thing it tells me is th...,4.0,4.0,1.0
366142,17098298,1983854.0,2013-06-13T22:31:52Z,17098004,1,"<p>To know which version is used by default, t...","To know which version is used by default, type...",4.0,3.0,8.0
366147,17098416,1598412.0,2013-06-13T22:41:43Z,17098004,0,<pre><code>python --version\n</code></pre>\n\n...,Then head on over to your .bashrc (should be i...,4.0,3.0,10.0
454252,20896732,3155933.0,2014-01-03T04:47:53Z,8917885,31,<p>in a Python IDE just copy and paste in the ...,in a Python IDE just copy and paste in the fol...,3.0,3.0,3.0
666378,29458415,707650.0,2015-04-05T14:37:45Z,29458249,2,"<ol>\n<li><p>Yes, it is safe. Python uses this...","Yes, it is safe. Python uses this naming like ...",3.0,3.0,4.0


All the cluster values are manually evaluated for various search question, we could see bert embedded model produces best result

In [0]:
ans_final = ans_final[['Id', 'OwnerUserId', 'CreationDate', 'ParentId', 'Score', 'Body',
       'Processed_body','Cluster_bert']]

## Question and Answer Relevance Mapping

In [0]:
#Cosine Similarity between query and answer corpus
df_ans = pd.DataFrame()
import scipy.spatial
distances_ans = scipy.spatial.distance.cdist([list(query_embedding[0])], Ans_embeddings, "cosine")[0]
df_cls_bert_Ans_ques = pd.DataFrame({'pos':ans_id, 'Similarity_Score':distances_ans})
ans_final = pd.concat([ans_final,df_cls_bert_Ans_ques.set_index('pos')], axis = 1)
ans_final = ans_final.dropna()
ans_final

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body,Processed_body,Cluster_bert,Similarity_Score
69358,3049575,34211.0,2010-06-15T22:36:48Z,3049569,4,<p>If you don't want to make it a member of th...,If you don't want to make it a member of the ...,11.0,0.577834
69591,3058446,290340.0,2010-06-17T02:02:12Z,3049569,8,<p><strong>Make it a static function...</stron...,Make it a static function... Your definition w...,4.0,0.327658
193393,8917907,1118101.0,2012-01-18T21:45:51Z,8917885,230,<pre><code>python -V\n</code></pre>\n\n<p><a h...,may also work (introduced in version 2.5),11.0,0.670012
193394,8917909,4714.0,2012-01-18T21:45:52Z,8917885,73,<p>Python 2.5+:</p>\n\n<pre><code>python --ver...,Python 2.5+: Python 2.4-:,2.0,0.290339
193395,8917910,118.0,2012-01-18T21:45:54Z,8917885,22,<p>At a command prompt type:</p>\n\n<pre><code...,At a command prompt type:,1.0,0.640489
193396,8917940,675637.0,2012-01-18T21:47:31Z,8917885,19,<p>When I open <code>Python (command line)</co...,When I open the first thing it tells me is th...,1.0,0.667392
366142,17098298,1983854.0,2013-06-13T22:31:52Z,17098004,1,"<p>To know which version is used by default, t...","To know which version is used by default, type...",8.0,0.416672
366147,17098416,1598412.0,2013-06-13T22:41:43Z,17098004,0,<pre><code>python --version\n</code></pre>\n\n...,Then head on over to your .bashrc (should be i...,10.0,0.490607
454252,20896732,3155933.0,2014-01-03T04:47:53Z,8917885,31,<p>in a Python IDE just copy and paste in the ...,in a Python IDE just copy and paste in the fol...,3.0,0.1703
666378,29458415,707650.0,2015-04-05T14:37:45Z,29458249,2,"<ol>\n<li><p>Yes, it is safe. Python uses this...","Yes, it is safe. Python uses this naming like ...",4.0,0.240963


In [0]:
#Aggregating the scores to get combined results for each of the cluster values
agg_val = {}
agg_val['Score'] = 'sum'
agg_val['Cluster_bert'] = 'first' 
agg_val['Similarity_Score'] = 'mean'
Clus_score = ans_final[['Score','Cluster_bert','Similarity_Score']].groupby('Cluster_bert').agg(agg_val)
Clus_score = Clus_score.reset_index(drop = True)
print(Clus_score)

    Score  Cluster_bert  Similarity_Score
0      45           1.0          0.697442
1      73           2.0          0.275012
2      34           3.0          0.176688
3      11           4.0          0.281017
4       1           5.0          0.267977
5       0           6.0          0.313341
6      11           7.0          0.366153
7       1           8.0          0.416672
8       6           9.0          0.318808
9       0          10.0          0.490607
10    234          11.0          0.623923


In [0]:
#Ranking methodalogy - score * similarity score
#As majority of the answer part contains code chunks, leaving less information in natural text to get actual relevance. 
#To overcome this issue, we made use of the score feature available in answer.csv. 
#Score value is the user annotated score for each answer to the question. 
#We combined both similarity scores and this score value to get an overall score for each cluster. 
#Clusters with maximum score are ranked as top answer output. 
Ranking_score_bert = Clus_score['Score'] * Clus_score['Similarity_Score']
Clus_score['Ranking_score_bert'] = Ranking_score_bert
Clus_score_bert = Clus_score.sort_values(['Ranking_score_bert'], ascending = False)
id_ans_bert = list(ans_final[ans_final['Cluster_bert'] == Clus_score_bert['Cluster_bert'].iloc[0]]['Id'])

In [0]:
#Top Ranked Cluster results
Top_answers_bert = ans_main_df[ans_main_df['Id'].isin(id_ans_bert) ]
Top_answers_bert = Top_answers_bert.sort_values(['Score'], ascending = False)
Top_answers_bert

Unnamed: 0,Id,OwnerUserId,CreationDate,ParentId,Score,Body
193393,8917907,1118101.0,2012-01-18T21:45:51Z,8917885,230,<pre><code>python -V\n</code></pre>\n\n<p><a h...
69358,3049575,34211.0,2010-06-15T22:36:48Z,3049569,4,<p>If you don't want to make it a member of th...


In [0]:
#Expanded version of Top Answers
for text in Top_answers_bert['Body']:
    soup = BeautifulSoup(text, "html.parser")
    para = ' '.join([tex.text for tex in soup.find_all()])
    print(para)

python -V
 python -V
 http://docs.python.org/using/cmdline.html#generic-options http://docs.python.org/using/cmdline.html#generic-options --version may also work (introduced in version 2.5) --version
If you don't want to make it a member of the Table class you could put it into a utilities module. Table utilities


**Future Scope:**

●	Extend research for other question pairs apart from WH-questions.

●	Build models that represent the entire Stack Overflow corpus having various programming domain-specific results.

●	Build an interactive Q&A search engine and related  API for wider use.
