<h1>Get Data</h1>

<h2>Data Collection</h2>

Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive.
More info about the dataset is given at: https://www.kaggle.com/stackoverflow/stackoverflow

To collect the data we need to gather Questions and Answer that were posted on Stack Overflow. Thus what we need are the following:

- Title
- Question body
- Answers for that question
- Votes for each answers

We decide to restrict the query only to questions that has 'python' has a tag, due to the abundance of Q&A in Stack Overflow, to perform better test and try to give more precise answers.
However this process can be done over other argument just by changing the LIKE '%python%' word.

In [1]:
import pandas as pd
import numpy as np
import spacy
import preprocessing
from csv import reader 
import tfidf
import scipy

pd.reset_option("^display")

<h2>Load Data</h2>

In [2]:
df = pd.read_csv('DB/Original_data.csv', index_col=[0])

#df = df.sample(frac = 0.2)

print('Missing Values:')
df.isna().sum()

Missing Values:


id         0
title      0
body       0
tags       0
answers    0
score      0
dtype: int64

<h2>Manipolate dataframe</h2>

In [3]:
#create column with answers aggregate by title of the questions
df2 = df 
func = lambda x: "\n".join(x)
df2 = df2.groupby('id')["answers"].agg([("answers",func)])



In [4]:
# concat answers for each post beacause there are a record with question duplicated for each answer to that question 


grouped = df.groupby(['id','title', 'body','tags'],as_index=False).agg("sum","score")

grouped_df = pd.DataFrame(grouped)
grouped_df = pd.merge(grouped_df, df2, left_on='id', right_on='id', how='left')
grouped_df



Unnamed: 0,id,title,body,tags,score,answers
0,742,Class views in Django,"<p><a href=""http://www.djangoproject.com/"" rel...",python|django|views|oop,75,"<p>I needed to use class based views, but I wa..."
1,773,How do I use itertools.groupby()?,<p>I haven't been able to find an understandab...,python|itertools,1081,<p>A neato trick with groupby is to run length...
2,1829,How do I make a menu that does not require the...,<p>I've got a menu in Python. That part was ea...,python,22,<p>The reason msvcrt fails in IDLE is because...
3,2933,Create a directly-executable cross-platform GU...,<p>Python works on multiple platforms and can ...,python|user-interface|deployment|tkinter|relea...,410,<p>PySimpleGUI wraps tkinter and works on Pyth...
4,5102,How do you set up Python scripts to work in Ap...,<p>I tried to follow a couple of googled up tu...,python|apache|apache2,27,"<p>Yes, mod_python is pretty confusing to set ..."
...,...,...,...,...,...,...
297600,69060217,QLineSeries can't append the QDateTime().toMse...,<p>When I append the datetime and value to lin...,python|pyside6|qchart,0,<p>below is error screen\nI have no idea what'...
297601,69060327,Problem with the print statement in Python,<p>What it's wrong with this code?</p>\n<pre><...,python,3,"<p>I couldnt comment, so I am posting as answe..."
297602,69060696,How do I extract two dates from a string in py...,<pre><code>y = &quot;The program will run from...,python|regex|web-scraping|re|findall,1,<p>You should try <code>findall</code> instead...
297603,69060948,Python Input and Output?,<p>I am working on learning python and I for s...,python|input|output,1,<p>in this line <code>numTickets = input(&quot...


<h2>Preprocessing Part</h2>

In the preprocessing process here before we calling a function from the preprocessing file that remove the html tags and the part of text to not consider (for example the codes sections) and then we using other function to clean the text with NLP technics.

<h3>Manipolating answers</h3>

In [5]:
#Removing tags
answers = grouped_df["answers"]
preprocessing.remove_tags(answers)

#Clearing text 
answers_processed = answers.apply(lambda x: preprocessing.clear_text(x))


In [6]:
answers_processed.head()
answers_processed.isna().sum() 
grouped_df['answers_processed'] = answers_processed

<h3>Manipolating questions</h3>

In [7]:
#Merge title with body 
questions = grouped_df["body"]
preprocessing.remove_tags(questions)
questions

0           view points to a function, which can be a pr...
1          I haven't been able to find an understandable...
2          I've got a menu in Python. That part was easy...
3          Python works on multiple platforms and can be...
4          I tried to follow a couple of googled up tuto...
                                ...                        
297600     When I append the datetime and value to lines...
297601     What it's wrong with this code?      This is ...
297602        returns:      This gives me the first date...
297603     I am working on learning python and I for som...
297604     To solve the problem, I implemented the follo...
Name: body, Length: 297605, dtype: object

In [8]:
#Clearing text 
questions_processed = questions.apply(lambda x: preprocessing.clear_text(x))
grouped_df['questions_processed'] = questions_processed
questions_processed

0         view points function problem want change bit f...
1         able find understandable explanation actually ...
2         got menu python part easy using get selection ...
3         python works multiple platforms used desktop w...
4         tried follow couple googled tutorials setting ...
                                ...                        
297600    append datetime value lineseries error occured...
297601    wrong code python code running windows see any...
297602            returns gives first date get second date 
297603    working learning python reason output get alwa...
297604    solve problem implemented following code sortc...
Name: body, Length: 297605, dtype: object

<h3>Manipolating titles</h3>
Create a column only for the processed title of the questions 

In [9]:
processed_title = grouped_df.title.apply(lambda x: preprocessing.clear_text(x))
grouped_df['processed_title'] = processed_title
processed_title 

0                                       class views django 
1                                    use itertools groupby 
2         make menu require user press enter make select...
3         create directly executable cross platform gui ...
4                           set python scripts work apache 
                                ...                        
297600       qlineseries append qdatetime tomsecssincepoch 
297601                      problem print statement python 
297602                     extract two dates string python 
297603                                 python input output 
297604           dutch national flag python implementation 
Name: title, Length: 297605, dtype: object

Drop columns that are not utils anymore

In [10]:
#post_corpus = processed_title + '\n '+ questions_processed + '\n ' + answers_processed
grouped_df.drop("answers", axis=1, inplace=True)
grouped_df.drop("body", axis=1, inplace=True)
#grouped_df["post_corpus"] = post_corpus
grouped_df["questions"] = questions
grouped_df

Unnamed: 0,id,title,tags,score,answers_processed,questions_processed,processed_title,questions
0,742,Class views in Django,python|django|views|oop,75,needed use class based views wanted able use f...,view points function problem want change bit f...,class views django,"view points to a function, which can be a pr..."
1,773,How do I use itertools.groupby()?,python|itertools,1081,neato trick groupby run length encoding one li...,able find understandable explanation actually ...,use itertools groupby,I haven't been able to find an understandable...
2,1829,How do I make a menu that does not require the...,python,22,reason msvcrt fails idle idle accessing librar...,got menu python part easy using get selection ...,make menu require user press enter make select...,I've got a menu in Python. That part was easy...
3,2933,Create a directly-executable cross-platform GU...,python|user-interface|deployment|tkinter|relea...,410,pysimplegui wraps tkinter works python also ru...,python works multiple platforms used desktop w...,create directly executable cross platform gui ...,Python works on multiple platforms and can be...
4,5102,How do you set up Python scripts to work in Ap...,python|apache|apache2,27,yes mod python pretty confusing set httpd conf...,tried follow couple googled tutorials setting ...,set python scripts work apache,I tried to follow a couple of googled up tuto...
...,...,...,...,...,...,...,...,...
297600,69060217,QLineSeries can't append the QDateTime().toMse...,python|pyside6|qchart,0,error screen idea wrong installed python insta...,append datetime value lineseries error occured...,qlineseries append qdatetime tomsecssincepoch,When I append the datetime and value to lines...
297601,69060327,Problem with the print statement in Python,python,3,couldnt comment posting answer need put print ...,wrong code python code running windows see any...,problem print statement python,What it's wrong with this code? This is ...
297602,69060696,How do I extract two dates from a string in py...,python|regex|web-scraping|re|findall,1,try instead output,returns gives first date get second date,extract two dates string python,returns: This gives me the first date...
297603,69060948,Python Input and Output?,python|input|output,1,line got user input got convert input like see...,working learning python reason output get alwa...,python input output,I am working on learning python and I for som...


<h3>Filter Tags</h3>
Filter out the tags, selecting only the 30 most common for better processing, so we can have less variability in the data.

In [11]:
# Convert raw text data of tags into lists
grouped_df["tags"] = grouped_df["tags"].apply(lambda x: x.split('|'))   

# Make a dictionary to count the frequencies for all tags
tag_freq_dict = {}

for tags in grouped_df["tags"]:
    for tag in tags:
        #Remove tags python, python2.7 e python3 for further processing 
        if "python" not in tag :
            
            if tag not in tag_freq_dict:
                tag_freq_dict[tag] = 0
            else:
                tag_freq_dict[tag] += 1
        else:
            tags.remove(tag)
            
grouped_df["tags"]


0                                      [django, views, oop]
1                                               [itertools]
2                                                        []
3         [user-interface, deployment, tkinter, release-...
4                                         [apache, apache2]
                                ...                        
297600                                    [pyside6, qchart]
297601                                                   []
297602                   [regex, web-scraping, re, findall]
297603                                      [input, output]
297604          [if-statement, dutch-national-flag-problem]
Name: tags, Length: 297605, dtype: object

In [12]:
#Selecting the most common number of tags in our database 
import heapq
most_common_tags = heapq.nlargest(30, tag_freq_dict, key=tag_freq_dict.get)
most_common_tags

['dataframe',
 'numpy',
 'pandas',
 'dictionary',
 'django',
 'matplotlib',
 'list',
 'beautifulsoup',
 'tkinter',
 'flask',
 'keras',
 'csv',
 'javascript',
 'django-models',
 'web-scraping',
 'scipy',
 'scikit-learn',
 'for-loop',
 'datetime',
 'django-rest-framework',
 'sqlalchemy',
 'selenium-webdriver',
 'pyspark',
 'machine-learning',
 'pip',
 'loops',
 'string',
 'deep-learning',
 'scrapy',
 'tensorflow']

Select only the data with at least one of the most common tags

In [13]:
final_indices = []
for i,tags in enumerate(grouped_df["tags"].values.tolist()):
    if len(set(tags).intersection(set(most_common_tags)))>0:   # The minimum length for common tags should be 1
        final_indices.append(i)

final_data = grouped_df.iloc[final_indices]

final_data 

Unnamed: 0,id,title,tags,score,answers_processed,questions_processed,processed_title,questions
0,742,Class views in Django,"[django, views, oop]",75,needed use class based views wanted able use f...,view points function problem want change bit f...,class views django,"view points to a function, which can be a pr..."
3,2933,Create a directly-executable cross-platform GU...,"[user-interface, deployment, tkinter, release-...",410,pysimplegui wraps tkinter works python also ru...,python works multiple platforms used desktop w...,create directly executable cross platform gui ...,Python works on multiple platforms and can be...
8,19339,Transpose/Unzip Function (inverse of zip)?,"[list, matrix, transpose]",954,since returns tuples trick seems clever useful...,list item tuples like convert lists first cont...,transpose unzip function inverse zip,I have a list of 2-item tuples and I'd like t...
9,21961,Date/time conversion using time.mktime seems w...,[datetime],12,local time fancy time tuple incidentally seem ...,return number seconds since epoch since giving...,date time conversion using time mktime seems w...,should return the number of seconds sinc...
24,36139,How to sort a list of strings?,"[string, sorting]",691,proper way sort strings previous example work ...,best way creating alphabetically sorted list p...,sort list strings,What is the best way of creating an alphabeti...
...,...,...,...,...,...,...,...,...
297594,69059605,How to check if a column of DataFrame contain ...,"[pandas, dataframe, types]",2,try even outputs create subset dataframe check...,example got data frame gt gt question filter e...,check column dataframe contain float type,"For example, I've got this data frame. ..."
297596,69059907,How to increase/change a value in a dataset wi...,"[pandas, for-loop]",0,create boolean auxiallry column multiply value...,want create loop iterates data probability end...,increase change value dataset condition python,What I want to do is: create a for loop that ...
297597,69059948,Check if elements in numpy array are between c...,[numpy],2,direct way add axis broadcast think broadcasti...,numpy array arbitrary size numpy array exactly...,check elements numpy array columns column nump...,I have a 1-D numpy array with arbitrary size...
297598,69060114,Pytorch geometric: how to explain the input in...,"[machine-learning, graph, pytorch, data-science]",0,okay think got understanding output dimensions...,reading pytorch geometric documentation page c...,pytorch geometric explain input code snippet,I am reading PyTorch geometric documentation ...


In [14]:
# Normalize numeric data for the scores
final_data['score'] = (final_data['score'] - final_data['score'].mean()) / (final_data['score'].max() - final_data['score'].min())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_data['score'] = (final_data['score'] - final_data['score'].mean()) / (final_data['score'].max() - final_data['score'].min())


In [15]:
# Combine the lists back into text data
final_data['tags'] = final_data['tags'].apply(lambda x: '|'.join(x))

final_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_data['tags'] = final_data['tags'].apply(lambda x: '|'.join(x))


Unnamed: 0,id,title,tags,score,answers_processed,questions_processed,processed_title,questions
0,742,Class views in Django,django|views|oop,0.006875,needed use class based views wanted able use f...,view points function problem want change bit f...,class views django,"view points to a function, which can be a pr..."
3,2933,Create a directly-executable cross-platform GU...,user-interface|deployment|tkinter|release-mana...,0.039978,pysimplegui wraps tkinter works python also ru...,python works multiple platforms used desktop w...,create directly executable cross platform gui ...,Python works on multiple platforms and can be...
8,19339,Transpose/Unzip Function (inverse of zip)?,list|matrix|transpose,0.093733,since returns tuples trick seems clever useful...,list item tuples like convert lists first cont...,transpose unzip function inverse zip,I have a list of 2-item tuples and I'd like t...
9,21961,Date/time conversion using time.mktime seems w...,datetime,0.000650,local time fancy time tuple incidentally seem ...,return number seconds since epoch since giving...,date time conversion using time mktime seems w...,should return the number of seconds sinc...
24,36139,How to sort a list of strings?,string|sorting,0.067745,proper way sort strings previous example work ...,best way creating alphabetically sorted list p...,sort list strings,What is the best way of creating an alphabeti...
...,...,...,...,...,...,...,...,...
297594,69059605,How to check if a column of DataFrame contain ...,pandas|dataframe|types,-0.000338,try even outputs create subset dataframe check...,example got data frame gt gt question filter e...,check column dataframe contain float type,"For example, I've got this data frame. ..."
297596,69059907,How to increase/change a value in a dataset wi...,pandas|for-loop,-0.000536,create boolean auxiallry column multiply value...,want create loop iterates data probability end...,increase change value dataset condition python,What I want to do is: create a for loop that ...
297597,69059948,Check if elements in numpy array are between c...,numpy,-0.000338,direct way add axis broadcast think broadcasti...,numpy array arbitrary size numpy array exactly...,check elements numpy array columns column nump...,I have a 1-D numpy array with arbitrary size...
297598,69060114,Pytorch geometric: how to explain the input in...,machine-learning|graph|pytorch|data-science,-0.000536,okay think got understanding output dimensions...,reading pytorch geometric documentation page c...,pytorch geometric explain input code snippet,I am reading PyTorch geometric documentation ...


In [16]:
#Check if the final data has some null values 
final_data.isna().sum()

final_data = final_data.dropna()

final_data = final_data[final_data['processed_title'].notna()]
final_data 

Unnamed: 0,id,title,tags,score,answers_processed,questions_processed,processed_title,questions
0,742,Class views in Django,django|views|oop,0.006875,needed use class based views wanted able use f...,view points function problem want change bit f...,class views django,"view points to a function, which can be a pr..."
3,2933,Create a directly-executable cross-platform GU...,user-interface|deployment|tkinter|release-mana...,0.039978,pysimplegui wraps tkinter works python also ru...,python works multiple platforms used desktop w...,create directly executable cross platform gui ...,Python works on multiple platforms and can be...
8,19339,Transpose/Unzip Function (inverse of zip)?,list|matrix|transpose,0.093733,since returns tuples trick seems clever useful...,list item tuples like convert lists first cont...,transpose unzip function inverse zip,I have a list of 2-item tuples and I'd like t...
9,21961,Date/time conversion using time.mktime seems w...,datetime,0.000650,local time fancy time tuple incidentally seem ...,return number seconds since epoch since giving...,date time conversion using time mktime seems w...,should return the number of seconds sinc...
24,36139,How to sort a list of strings?,string|sorting,0.067745,proper way sort strings previous example work ...,best way creating alphabetically sorted list p...,sort list strings,What is the best way of creating an alphabeti...
...,...,...,...,...,...,...,...,...
297594,69059605,How to check if a column of DataFrame contain ...,pandas|dataframe|types,-0.000338,try even outputs create subset dataframe check...,example got data frame gt gt question filter e...,check column dataframe contain float type,"For example, I've got this data frame. ..."
297596,69059907,How to increase/change a value in a dataset wi...,pandas|for-loop,-0.000536,create boolean auxiallry column multiply value...,want create loop iterates data probability end...,increase change value dataset condition python,What I want to do is: create a for loop that ...
297597,69059948,Check if elements in numpy array are between c...,numpy,-0.000338,direct way add axis broadcast think broadcasti...,numpy array arbitrary size numpy array exactly...,check elements numpy array columns column nump...,I have a 1-D numpy array with arbitrary size...
297598,69060114,Pytorch geometric: how to explain the input in...,machine-learning|graph|pytorch|data-science,-0.000536,okay think got understanding output dimensions...,reading pytorch geometric documentation page c...,pytorch geometric explain input code snippet,I am reading PyTorch geometric documentation ...


Eliminate null values if presents

In [17]:
final_data = final_data[final_data['processed_title'].notna()]

final_data = final_data[final_data['questions_processed'].notna()]

final_data = final_data[final_data['answers_processed'].notna()]

In [18]:
final_data['post_corpus'] = final_data['processed_title'] + final_data['questions_processed'] + final_data['answers_processed']

In [19]:
# Save the data
final_data.to_csv('DB/Preprocessed_data.csv', index=False)