# Stack Overflow dataset analysis

This notebook uses the MSR2021Replication scripts to run an analysis on other StackOverflow datasets, using the Mallet tool to analyze and cluster the dataset into a predetermined number of topics. This notebook aims to simplify the use of those scripts and make them more understandable and possible to be used in other datasets.


## Install python libraries

The notebook's first step is installing the libraries used on the scripts.

In [None]:
!pip install -r notebook/requirements.txt

In [None]:
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('word_tokenize')

Import the notebook scripts to this notebook.

In [1]:
import sys
sys.path.insert(0, 'notebook/')

## Export variables

To customize the scripts to the correct dataset and output path, configure the environment variables to use the path for your dataset and configure the output folder. 


In [2]:
# Export path to the raw dataset
%env DATASET_PATH=./tcc/so_questions.csv

# Export the output path
%env OUTPUT_PATH=./output

# Export the number of topics division
%env TOPICS_NUM=15


env: DATASET_PATH=./tcc/so_questions.csv
env: OUTPUT_PATH=./output
env: TOPICS_NUM=15


## Prepare dataset for Mallet

The following scripts cleans the StackOverflow dataset and prepare the documents where the Mallet tool will execute the algorithm to separate the topics.


In [3]:
from clean_stackoverflow_data import clean_so_data
from export_so_to_mallet import export_to_mallet

print('Cleaning dataset...')
clean_so_data()

print('Exporting documents to Mallet...')
export_to_mallet()
print('Done!')


Folder ./output/so_data/ created!
Cleaning dataset...
./tcc/so_questions.csv
Loaded CSV!
Removed HTML tags!
Removed stopwords!
Folder ./output/processed/ created!
Saved new csv!
Exporting documents to Mallet...
Done!


## Run the Mallet Tool

The next step is to run the Mallet tool. The mallet commands are using the environment variables set in the beggining of this notebook.

In [7]:
!mallet/mallet-2.0.8/bin/mallet import-dir --input $OUTPUT_PATH/so_data/ --output $OUTPUT_PATH/so.mallet --keep-sequence --remove-stopwords --extra-stopwords extra_stop_words/so.txt

Labels = 
   ./output/so_data/


In [8]:
!mallet/mallet-2.0.8/bin/mallet train-topics --random-seed 100 --input $OUTPUT_PATH/so.mallet --num-topics 15 --optimize-interval 20 --output-state $OUTPUT_PATH/so-topic-state.gz --output-topic-keys $OUTPUT_PATH/so_keys.txt --output-doc-topics $OUTPUT_PATH/so_composition.txt --diagnostics-file $OUTPUT_PATH/so_diagnostics.xml


Mallet LDA: 15 topics, 4 topic bits, 1111 topic mask
Data loaded.
max tokens: 2249
total tokens: 10416218
<10> LL/token: -9,22266
<20> LL/token: -8,65279
<30> LL/token: -8,50522
<40> LL/token: -8,44217

0	0,33333	native react location map code react-native latitude longitude user work create make library marker find mobile http mapview web implement 
1	0,33333	const test animation error detox rctview true animated.view react duration type null animated import start transform code element svg function 
2	0,33333	const import return state export dispatch store redux action date default type function case connect error reducer console.log provider react 
3	0,33333	style view text center height width image color backgroundcolor flex touchableopacity justifycontent const fontsize alignitems onpress source row flexdirection scrollview 
4	0,33333	react error native react-native project android file ios build expo run version path device work issue running pod code found 
5	0,33333	item data k

<200> LL/token: -8,33591
<210> LL/token: -8,33503
[beta: 0,01705] 
<220> LL/token: -8,31239
<230> LL/token: -8,28166
[beta: 0,01756] 
<240> LL/token: -8,27068

0	0,2362	native react location map code react-native library latitude longitude mobile web marker http api create user ios google link android 
1	0,07465	const test animation true false detox rctview animated.view height width amp animated null return duration svg transform view start position 
2	0,11241	const import state return dispatch export store redux action date default type case function connect reducer provider mapstatetoprops error console.log 
3	0,21803	style view text center width height color image touchableopacity backgroundcolor flex onpress const justifycontent alignitems fontsize source import react row 
4	0,26633	error android react native react-native project file build ios run version expo device path running pod work issue command code 
5	0,16718	item data key flatlist index const array return list div rende

[beta: 0,01819] 
<400> LL/token: -8,16964
<410> LL/token: -8,16806
[beta: 0,01822] 
<420> LL/token: -8,16804
<430> LL/token: -8,16651
[beta: 0,01823] 
<440> LL/token: -8,16666

0	0,01865	location const map latitude longitude marker amp mapview coordinate true error region null lat position object key console.log return code 
1	0,02225	const true test animation false rctview detox animated.view width height amp view return null animated duration svg function start transform 
2	0,0365	const import state return dispatch export store redux action date default type function case reducer connect mapstatetoprops provider console.log error 
3	0,10123	style view text center width height color image backgroundcolor touchableopacity flex onpress import const justifycontent alignitems fontsize source react button 
4	0,10068	error android react-native react project native build file run version expo ios device path pod command running xcode failed install 
5	0,0679	item data flatlist index const ke

[beta: 0,01832] 
<600> LL/token: -8,16129
<610> LL/token: -8,16187
[beta: 0,01832] 
<620> LL/token: -8,16125
<630> LL/token: -8,16072
[beta: 0,01834] 
<640> LL/token: -8,16089

0	0,01653	location const latitude map longitude marker amp mapview error coordinate true region null lat position console.log object return key code 
1	0,02132	const true animation test false height width rctview detox animated.view amp return view animated function duration null start svg position 
2	0,03436	const import state return export dispatch action store redux date default function type case reducer connect error mapstatetoprops console.log provider 
3	0,0934	style view text center width height color image touchableopacity backgroundcolor flex import onpress const justifycontent alignitems fontsize react source button 
4	0,07089	error android react-native project build react run native file version path pod ios command running expo device failed xcode install 
5	0,06228	item data flatlist index key cons

[beta: 0,01835] 
<800> LL/token: -8,15969
<810> LL/token: -8,15987
[beta: 0,01835] 
<820> LL/token: -8,15924
<830> LL/token: -8,15961
[beta: 0,01835] 
<840> LL/token: -8,15955

0	0,01607	location const latitude longitude map marker mapview amp error coordinate true region null position lat console.log object return key code 
1	0,02086	const true animation test false width height rctview detox animated.view amp view return animated function duration null svg start position 
2	0,03426	const import state return export dispatch redux store action date default function type case reducer connect mapstatetoprops console.log provider error 
3	0,09224	style view text center height width color image touchableopacity backgroundcolor flex import onpress const justifycontent react alignitems fontsize source button 
4	0,06487	error android project react-native build run react file native version path pod command ios failed running xcode device install debug 
5	0,06175	item data flatlist index key co

[beta: 0,01837] 
<1000> LL/token: -8,15953

Total time: 12 minutes 32 seconds


## Parse results

The following script parse the mallet output and place all questions from the same topic into one file. Resulting in one file per document.

In [3]:
from parse_topics_composition import parse_topics
from unite_topics_in_one_file import unite_questions_documents_by_topic

print('Parsing topics...')
parse_topics()
print('Uniting questions by topic...')
unite_questions_documents_by_topic()
print('Done!')

Parsing topics...
Uniting questions by topic...
Folder ./output/topics created!


FileNotFoundError: [Errno 2] No such file or directory: './output/topics/topic_1.csv'