Uma das maiores reclamações de usuários do stackOverflow é a quantidade de perguntas duplicadas, um problema que diminui a qualidade do site. Através de técnicas de NLP, queremos entender se perguntas duplicadas são realmente um problema.

In [1]:
import pandas as pd 

In [2]:
# carregando os dados
questions_df = pd.read_csv('data/Questions.csv', encoding="ISO-8859-1", usecols =['Id','Score', 'Title', 'Body'], nrows=20000)

In [3]:
questions_df.head(10)


Unnamed: 0,Id,Score,Title,Body
0,80,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...
1,90,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...
2,120,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...
3,180,53,Function for creating color wheels,<p>This is something I've pseudo-solved many t...
4,260,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...
5,330,29,Should I use nested classes in this case?,<p>I am working on a collection of classes use...
6,470,13,Homegrown consumption of web services,<p>I've been writing a few web services for a ...
7,580,21,Deploying SQL Server Databases from Test to Live,<p>I wonder how you guys manage deployment of ...
8,650,79,Automatically update version number,<p>I would like the version property of my app...
9,810,9,Visual Studio Setup Project - Per User Registr...,<p>I'm trying to maintain a Setup Project in <...


In [4]:
questions_df.tail(10)


Unnamed: 0,Id,Score,Title,Body
19990,1114270,8,Android Screen Timeout,<p>I know its possible to use a wakelock to h...
19991,1114310,4,When should I break into GUI/game development?,<p>I am a hobbyist console C++ developer. I ha...
19992,1114340,0,Add values of a function to listbox when click...,<p>I have two checkboxes and one listbox. I as...
19993,1114400,3,Using generics for arrays,<p>Is it possible to use generics for arrays?<...
19994,1114420,0,"Iphone, objective-c how to make a Jump method ...",<p>I have this IBAction that is suppose to mak...
19995,1114470,0,"Trim all chars off file name after first ""_""",<p>I'd like to trim these purchase order file ...
19996,1114540,7,Xcode question: Quickly jump to a particular s...,<p>What is the quickest way to jump to a parti...
19997,1114550,3,Serializing a generic collection with XMLSeria...,<p>Why won't XMLSerializer process my generic ...
19998,1114580,1,Using Yahoo Fire Eagle on Grails / Java,<p>Has anyone implemented the Yahoo Fire Eagle...
19999,1114600,1,How to share code & xib files between iPhone a...,<p>I'm in the process of creating an app. I'd...


In [5]:
# limpando os dados
import re
import string
import contractions

def normalize_text(s):
    s = s.lower()
    return s

def remove_html_tags(text):
    text = re.sub('<pre>.*?</pre>', '', text, flags=re.DOTALL)
    text = re.sub('<code>.*?</code>', '', text, flags=re.DOTALL)
    text = re.sub('<[^>]+>', '', text, flags=re.DOTALL)
    return text.replace("\n", "")

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)    

def remove_contractions(text):
    return contractions.fix(text)

    

In [6]:
questions_df['Body'] = questions_df['Body'].apply(remove_html_tags)
questions_df['Body'] = questions_df['Body'].apply(remove_contractions)
questions_df['Body'] = questions_df['Body'].apply(normalize_text)
questions_df['Title'] = questions_df['Title'].apply(remove_html_tags)
questions_df['Title'] = questions_df['Title'].apply(remove_contractions)
questions_df['Title'] = questions_df['Title'].apply(normalize_text)



Com os dados limpos, podemos começar a fazer uma analise de similaridade entre as perguntas.

In [7]:
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


Nesta função, vamos calcular a similaridade entre as perguntas. Para isso, vamos usar a biblioteca sklearn, que possui uma função chamada cosine_similarity, que calcula a similaridade entre dois vetores. Para obter esses vetores, vamos usar a biblioteca TfidfVectorizer, que transforma uma lista de documentos em uma matriz de termos de frequência inversa.

criamos uma nova coluna chamada similarTo, que vai conter um dict com o id da pergunta e a similaridade entre elas. Se não encontrarmos nenhum resultado, a similarTo vai ser um dict vazio.

In [8]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


def find_similar_questions(df, threshold):
    # Create a vectorizer to convert the text data into numerical vectors
    vectorizer = TfidfVectorizer(stop_words='english')
    
    # Convert the text data into vectors
    vectors = vectorizer.fit_transform(df['Title'] + ' ' + df['Body'])
    
    # Compute the pairwise cosine similarity matrix between all questions
    cosine_sim = cosine_similarity(vectors)
    
    # Initialize an empty list to store the similar questions for each question
    similar_to_values = []
    
    # Iterate over each question
    for question in df.itertuples():
        # Get the index and ID of the current question
        i = question.Index
        question_id = question.Id
        
        # Get the similarity scores of the current question with all other questions
        similarity_scores = list(enumerate(cosine_sim[i]))
        
        # Remove the similarity score of the current question with itself
        similarity_scores = [(j, score) for j, score in similarity_scores if j != i]
        
        # Sort the similarity scores in descending order
        similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
        
        # Filter the similarity scores by the threshold
        similar_scores = [(df.iloc[j]['Id'], score) for j, score in similarity_scores if score >= threshold]
        
        # If the current question has similar questions above the threshold
        if similar_scores:
            # Add the similar questions to the list
            similar_questions = {j: score for j, score in similar_scores}
            similar_to_values.append(similar_questions)
        else:
            # If the current question has no similar questions above the threshold, add an empty dictionary to the list
            similar_to_values.append({})
    
    # Add the similar questions list as a new column to the DataFrame
    df['SimilarTo'] = similar_to_values
    
    return df

In [9]:
similar_questions_df = find_similar_questions(questions_df, 0.7)



In [10]:
similar_questions_df.head(10)

Unnamed: 0,Id,Score,Title,Body,SimilarTo
0,80,26,sqlstatement.execute() - multiple queries in o...,i have written a database generation script in...,{}
1,90,144,good branching and merging tutorials for torto...,are there any really good tutorials explaining...,{}
2,120,21,asp.net site maps,has anyone got experience creating sql-based a...,{}
3,180,53,function for creating color wheels,this is something i have pseudo-solved many ti...,{}
4,260,49,adding scripting functionality to .net applica...,i have a little game written in c#. it uses a ...,{}
5,330,29,should i use nested classes in this case?,i am working on a collection of classes used f...,{}
6,470,13,homegrown consumption of web services,i have been writing a few web services for a ....,{}
7,580,21,deploying sql server databases from test to live,i wonder how you guys manage deployment of a d...,{}
8,650,79,automatically update version number,i would like the version property of my applic...,{}
9,810,9,visual studio setup project - per user registr...,i am trying to maintain a setup project in (y...,{}


In [11]:

filtered_similar_questions_df = similar_questions_df.loc[similar_questions_df['SimilarTo'] != {}]
filtered_similar_questions_df.head(15)


Unnamed: 0,Id,Score,Title,Body,SimilarTo
273,26800,11,xpath and selecting a single node,i am using xpath in .net to parse an xml docum...,{100500: 0.729711367101728}
380,34920,23,how do i lock a file in perl?,what is the best way to create a lock on a fil...,{410270: 0.7096728625667104}
450,41300,86,emacs in windows,how do you run emacs in windows?what is the be...,{189490: 0.8048765186300925}
603,52550,75,"what does the comma operator , do in c?",what does the operator do in c?,{149500: 0.8821434755082395}
682,58510,181,"using .net, how can you find the mime type of ...",i am looking for a simple way to get a mime ty...,{1029740: 0.7648153937223412}
840,68640,18,can you have a class in a struct?,is it possible in c# to have a struct with a m...,{646890: 0.7106321066353134}
1199,100500,6,how do you bind in xaml to a dynamic xpath?,i have a list box that displays items based on...,{26800: 0.729711367101728}
1821,149500,39,what does the comma operator do?,what does the following code do in c/c++?,{52550: 0.8821434755082395}
2298,184710,515,what is the difference between a deep copy and...,what is the difference between a deep copy and...,{647260: 0.7025272966836916}
2364,189490,78,where can i find my .emacs file for emacs runn...,i tried looking for the .emacs file for my win...,{41300: 0.8048765186300925}


In [12]:
filtered_similar_questions_df.tail(15)

Unnamed: 0,Id,Score,Title,Body,SimilarTo
15331,899090,44,linq - where not exists,what is the equivalent of following statement ...,{423840: 0.7278074110816055}
15462,905410,5,.net or java based small desktop app,i had posted a question a few days ago and tha...,{897770: 0.8275055204413637}
16210,938620,2,how to check a popup menu item?,how to check a popup menu item?,{631580: 0.7398198301443908}
16358,945620,20,how to use a wsdl file to create a wcf proxy?,i have an old wsdl file and i want to use wcf ...,{950150: 0.7497796585898421}
16457,950150,87,how to use a wsdl file to create a wcf service...,i have an old wsdl file and i want to create a...,{945620: 0.7497796585898421}
17160,985280,157,can i call an overloaded constructor from anot...,can i call an overloaded constructor from anot...,{829870: 0.7036307419323536}
17564,1004560,4,rhino mocks & compact framework,i have been experimenting with rhino mocks for...,{466520: 0.721053429572297}
18097,1029740,158,get mime type from filename extension,how can i get the mime type from a file extens...,{58510: 0.7648153937223412}
18245,1036380,18,can i still develop 32-bit applications using ...,i am wondering if i can still develop 32-bit a...,{771240: 0.7046698351298859}
18377,1041520,0,apache rewriterule not working without page # ...,i have a rewrite rule set up in my .htaccess f...,{1042370: 0.7989176173577902}


Vamos pegar uma pergunta qualquer e checar a similaridade com as outras perguntas.

In [26]:
filtered_similar_questions_df.loc[filtered_similar_questions_df['Id'] == 905410].Title.values[0]


'.net or java based small desktop app'

In [27]:
filtered_similar_questions_df.loc[filtered_similar_questions_df['Id'] == 897770].Title.values[0]


'.net or java based small desktop app'

In [28]:
filtered_similar_questions_df.loc[filtered_similar_questions_df['Id'] == 905410].Body.values[0]


'i had posted a question a few days ago and thanks a lot to those who already responded. i am reposting the question because it seemed like i needed to clarify our requirements. so here it goes in more detail.i am trying to get a very small desktop app built - something that can be downloaded by people very quickly. i am trying to decide whether i shd build it in .net or java. i have two objectives: 1. very quick download 2. targeting the largest set of users (in that order)i know java will be cross platform, but if a lot of windows users do not have jre installed on their computers, i am told they will need to download some 15mb of jre software to make this app run whereas .net could be pre-installed in most windows machines.as mentioned, a small, very quickly downloadable app is more important to me than a cross platform app. so i want to go with the platform that is pre-installed in the highest number of computers, so that my users just download my app without also requiring an addi

In [29]:
filtered_similar_questions_df.loc[filtered_similar_questions_df['Id'] == 897770].Body.values[0]


'i am trying to get a very small desktop app built - something that can be downloaded by people very quickly. i am trying to decide whether i should build it in .net or java. i know java will be cross platform, but i want to know if a lot of windows users do not have jre installed on their computers, in which case i am told they will need to download some 15mb of jre software to make this app run whereas .net will automatically be pre-installed in most windows machines. does anyone know what percentage of windows users do not have jre on their machines? and what % age of windows users have .net pre-installed? ps. the decision for us is: if a large number of windows users have jre, then go for java, if not, then go for .net. '

Podemos ver que, de fato, a pergunta com Id 905410 é similar a pergunta com Id 897770. Se lermos o texto, podemos ver que a pergunta foi até postada duas vezes, com o mesmo título, pela mesma pessoa, mas com bodies diferentes.