Uma das maiores reclamações de usuários do stackOverflow é a quantidade de perguntas duplicadas, um problema que diminui a qualidade do site. Através de técnicas de NLP, queremos entender se perguntas duplicadas são realmente um problema.

In [1]:
import pandas as pd 

In [2]:
# carregando os dados
questions_df = pd.read_csv('data/Questions_sample.csv', encoding="ISO-8859-1", usecols =['Id','Score', 'Title', 'Body'])

In [3]:
questions_df.head(10)


Unnamed: 0,Id,Score,Title,Body
0,37914700,1,How to assign variable inside of the string in...,<p>I wanna store sql query in XML and execute ...
1,26870250,1,Server not receiving in socket communication,<p>I'm trying to make a Java program in which ...
2,20454920,0,HTML tag input text line-height not working,<p>I have this :</p>\n\n<pre><code>&lt;form me...
3,6731940,0,Message Sent to Deallocated Instance,<p>I'm using TouchXML to parse an element in i...
4,28003210,3,Need SQL help ranking users by combining data ...,<p>users</p>\n\n<pre><code> id | first_name...
5,20542020,0,popen Hangs and cause CPU usage to 100,<p>I have a code that use popen to execute a s...
6,31910720,0,Is it possible to set default ttl for all keys...,"<p>I've read redis config <a href=""https://raw..."
7,10281700,0,Validating that each project imports our Commo...,"<p>I run a fairly-large team (5 solutions, ~15..."
8,27214890,1,Search for two specific elements in multidimen...,<p>consider the following vector</p>\n\n<pre><...
9,9420130,1,Loading a particular frame in Delphi 6 causes ...,<p>I have a frame that never had any problems ...


In [4]:
questions_df.tail(10)


Unnamed: 0,Id,Score,Title,Body
19990,8894560,2,mySql to get last records of duplicate entries,<blockquote>\n <p><strong>Possible Duplicate:...
19991,24492600,2,Search in numpy object arrays,<p>I've got a <strong>Numpy Object Array</stro...
19992,30974440,1,Rarely occurring ConcurrentModificationExcepti...,<p>Sometimes i'm experiencing the following ex...
19993,35253200,0,Typescript move code to multiple files,<p>I have code like that in my <code>app.ts</c...
19994,1303560,0,Jquery Fade out Transparency In IE,<p>I'm working on a site and I have run into a...
19995,16499640,0,GAE datastore put and get,<p>Hi I'm just playing around with the datasto...
19996,12738910,0,Rotating Image script not keeping preset classes,<p>Im using a script to rotate through various...
19997,18043500,1,Why does time difference measurement with `std...,<p>I measured several time differences by usin...
19998,30372760,0,How to change background of non-visible button...,<p>My code works only for visible views. </p>\...
19999,2757040,0,"RPC for java/python with rest support, HTML mo...",<p>Here's my set of requirements: I'm looking ...


In [5]:
# limpando os dados
import re
import string
import contractions

def normalize_text(s):
    s = s.lower()
    return s

def remove_html_tags(text):
    text = re.sub('<pre>.*?</pre>', '', text, flags=re.DOTALL)
    text = re.sub('<code>.*?</code>', '', text, flags=re.DOTALL)
    text = re.sub('<[^>]+>', '', text, flags=re.DOTALL)
    return text.replace("\n", "")

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)    

def remove_contractions(text):
    return contractions.fix(text)

    

In [6]:
questions_df['Body'] = questions_df['Body'].apply(remove_html_tags)
questions_df['Body'] = questions_df['Body'].apply(remove_contractions)
questions_df['Body'] = questions_df['Body'].apply(normalize_text)
questions_df['Title'] = questions_df['Title'].apply(remove_html_tags)
questions_df['Title'] = questions_df['Title'].apply(remove_contractions)
questions_df['Title'] = questions_df['Title'].apply(normalize_text)



Com os dados limpos, podemos começar a fazer uma analise de similaridade entre as perguntas.

In [7]:
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


Nesta função, vamos calcular a similaridade entre as perguntas. Para isso, vamos usar a biblioteca sklearn, que possui uma função chamada cosine_similarity, que calcula a similaridade entre dois vetores. Para obter esses vetores, vamos usar a biblioteca TfidfVectorizer, que transforma uma lista de documentos em uma matriz de termos de frequência inversa.

criamos uma nova coluna chamada similarTo, que vai conter um dict com o id da pergunta e a similaridade entre elas. Se não encontrarmos nenhum resultado, a similarTo vai ser um dict vazio.

In [8]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


def find_similar_questions(df, threshold):
    # Create a vectorizer to convert the text data into numerical vectors
    vectorizer = TfidfVectorizer(stop_words='english')
    
    # Convert the text data into vectors
    vectors = vectorizer.fit_transform(df['Title'] + ' ' + df['Body'])
    
    # Compute the pairwise cosine similarity matrix between all questions
    cosine_sim = cosine_similarity(vectors)
    
    # Initialize an empty list to store the similar questions for each question
    similar_to_values = []
    
    # Iterate over each question
    for question in df.itertuples():
        # Get the index and ID of the current question
        i = question.Index
        question_id = question.Id
        
        # Get the similarity scores of the current question with all other questions
        similarity_scores = list(enumerate(cosine_sim[i]))
        
        # Remove the similarity score of the current question with itself
        similarity_scores = [(j, score) for j, score in similarity_scores if j != i]
        
        # Sort the similarity scores in descending order
        similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
        
        # Filter the similarity scores by the threshold
        similar_scores = [(df.iloc[j]['Id'], score) for j, score in similarity_scores if score >= threshold]
        
        # If the current question has similar questions above the threshold
        if similar_scores:
            # Add the similar questions to the list
            similar_questions = {j: score for j, score in similar_scores}
            similar_to_values.append(similar_questions)
        else:
            # If the current question has no similar questions above the threshold, add an empty dictionary to the list
            similar_to_values.append({})
    
    # Add the similar questions list as a new column to the DataFrame
    df['SimilarTo'] = similar_to_values
    
    return df

In [9]:
# (demora mais de 2 minutos)
similar_questions_df = find_similar_questions(questions_df, 0.7)



In [10]:
similar_questions_df.head(10)

Unnamed: 0,Id,Score,Title,Body,SimilarTo
0,37914700,1,how to assign variable inside of the string in...,i want to store sql query in xml and execute i...,{}
1,26870250,1,server not receiving in socket communication,i am trying to make a java program in which th...,{}
2,20454920,0,html tag input text line-height not working,i have this :very simple.the text is in the md...,{}
3,6731940,0,message sent to deallocated instance,i am using touchxml to parse an element in ios...,{}
4,28003210,3,need sql help ranking users by combining data ...,usersservice1service2here is what my result wo...,{}
5,20542020,0,popen hangs and because cpu usage to 100,i have a code that use popen to execute a scri...,{}
6,31910720,0,is it possible to set default ttl for all keys...,i have read redis config document but cannot f...,{}
7,10281700,0,validating that each project imports our commo...,"i run a fairly-large team (5 solutions, ~150 p...",{}
8,27214890,1,search for two specific elements in multidimen...,consider the following vectorwhich is a three ...,{}
9,9420130,1,loading a particular frame in delphi 6 causes ...,i have a frame that never had any problems bef...,{}


In [11]:

filtered_similar_questions_df = similar_questions_df.loc[similar_questions_df['SimilarTo'] != {}]
filtered_similar_questions_df.head(15)


Unnamed: 0,Id,Score,Title,Body,SimilarTo
61,9154270,1,"vim move line up,down,left,right","in netbeans with ctrl+left, ctrl+up, ctrl+righ...",{23025930: 0.7233997503802955}
222,22959390,0,i have a questions regarding matlab matrix,suppose i have this matrix:i want this matrix ...,"{10901520: 0.7192533941946715, 17597310: 0.710..."
1110,30419690,3,cutting posixlt types by hour in r,suppose i have the following datawhat i am try...,"{28018320: 0.7409016313750877, 30072410: 0.730..."
1171,18515230,7,javascript map in leaflet how to refresh,i have a basic geojson program in javascript b...,{27646070: 0.7168251160581205}
1559,9396590,9,is md5 decryption possible?,possible duplicate: is it possible to decry...,{8910050: 0.7004291615260755}
1732,14834780,16,relation between akka and scala.actors in 2.10,"the scala 2.10 release notes says this: ""akka...","{9714150: 0.9562011319016563, 21018500: 0.8471..."
1830,19772820,0,error handling when reading from buffer,i have a tablet sending a 5 byte message to a ...,{30072410: 0.7260943092325658}
2089,24158810,0,kohana 3.1 create imagejpeg or imagepng captcha,i am trying to put a simple captcha in kohana ...,{10632520: 0.8351285685110357}
2414,16050620,0,call external page & load dynamic content into...,"well, i understand there is a way to change co...",{16007330: 0.7697189645496809}
2966,9533830,0,sinatra rack middleware hijacks '/' root url,i am trying to use a sinatra app as middleware...,{39895110: 0.7125851373919012}


In [12]:
filtered_similar_questions_df.tail(15)

Unnamed: 0,Id,Score,Title,Body,SimilarTo
13041,28018320,0,join in codeigniter returns multiple values,i have a problem joining two tables. in my fir...,{30419690: 0.7409016313750877}
13060,17597310,0,octave/matlab: create new matrix based on exis...,"in octave/matlab, say i have:how would i make ...",{22959390: 0.7104194925966363}
13095,10901520,0,how to loading two parts of the same file to t...,i have one file (file.csv) filled with integer...,{22959390: 0.7192533941946715}
13781,23025930,-1,windows how to use the right ctrl key as the l...,i am wondering what should i do to let the win...,{9154270: 0.7233997503802955}
14669,1143130,1,dropdownlist items in list,can anyone tell me shortest way to add all ite...,{30772210: 0.7298205377060533}
14946,30583960,0,"best practices for rebase, squashing and incor...",i am new to rebasing. i have already pushed s...,{35104610: 0.7054175562614433}
15750,16007330,0,ajax load div of one page called from another ...,i am facing with some glitches regarding the a...,{16050620: 0.7697189645496809}
16670,21018500,2,"facebook video upload exception ""(oauthexcepti...",my wpf application i am using facebook c# sdk ...,"{9714150: 0.8787714108752057, 14834780: 0.8471..."
17363,30072410,-1,timestamp splitting incorrect - javascript,i have a timestamp loaded from mssql into a we...,"{30419690: 0.7304779143096128, 19772820: 0.726..."
17589,16920880,3,restkit .20.1 nested array with relationship m...,i am having trouble with restkit and mapping t...,"{9714150: 0.8166481829329381, 14834780: 0.7877..."


In [13]:
filtered_similar_questions_df.size

205

In [16]:
filtered_similar_questions_df.size/questions_df.size * 100

0.20500000000000002

Após testar nossa função em diferentes samplings, percebemos que a porcentagem de perguntas similares em relação ao total de perguntas é muito baixa, o que nos leva a crer que o problema de perguntas duplicadas não é tão grande quanto imaginávamos, ou que o fato de termos reduzido o dataset para menos de 10% do original tenha influenciado nesse resultado.

Outro ponto importante é que a função não leva em consideração a similaridade entre as respostas, pois uma pergunta pode ser considerada duplicada mesmo tendo respostas diferentes, o que pode ser um problema para o usuário que está procurando uma resposta diferente para sua pergunta. Considerar as respostas no cálculo também afetaria no desempenho da função, pois teríamos que calcular a similaridade entre todas as respostas das perguntas similares, o que aumentaria o tempo de execução e recursos usados, resaltando que com apenas menos de 10% do dataset original, o tempo e recursos gastos para calcular a similaridade já é muito alto (utilizar mais do que 25.000 questões ultrapassa 16Gb de memória RAM).

Através desse estudo, podemos chegar as seguintes conclusões:

O stackoverflow é conhecido por tem um moderação muito forte, o que pode ser um dos motivos para a baixa porcentagem de perguntas duplicadas, já que muitas perguntas duplicadas são fechadas pelos moderadores ou até deletadas pelo usuário.

Não é eficiente calcular a similaridade entre todas as perguntas, pois o tempo de execução é muito alto. Uma possível solução seria utilizar as tags das perguntas para filtrar e após isso calcular a similaridade entre perguntas que compartilham tags similares.

Técnicas de filtragem utilizando score da pergunta ou data de fechamento da pergunta podem ser utilizadas para filtrar, já que perguntas duplicadas tendem a ter um score menor e serem fechadas mais rapidamente. Mas, é possível que a pergunta original tenha um score alto e a duplicada tenha um score baixo, o que pode ser um problema para essa solução, já que a pergunta original não entrara nesse novo dataset filtrado.