# STAGE 0 - Download all files in a customize local repository

### Input
- **git_url**: link of the GitHub URL 'https://github.com/andrisimonetti/SLR-toolkit_2' (Not change)
- **your_repository**: the path of the repository you want to download all files from GitHub

### Output
- A new folder with the name of the GitHub repository will be created. In the folder contains all files need to run the codes below

In [None]:
from git import Repo
git_url = 'https://github.com/andrisimonetti/SLR-toolkit_2'
your_repository = 'your/local/path'
Repo.clone_from(git_url, your_repository)

# STAGE 1 - Dataset creation

## Phase 1: Create the data frame from a file downloaded from Scopus or Web of Science

### Input
 - **filename**: name of the file downloaded from web of Sciece with .txt extension. The file must be in the same folder of this Notebook.

### Output
 - "Dataset_input.xslx" file, a DataFrame in which each row represents a paper and the columns are:
     - id of text; First author; Authors; Article title; Abstract; Source title; Pubblication year; Journal abbreviation; Journal iso abbreviation; References; Total number of citations; Number of internal citations; References internal id; doi.

In [None]:
import preprocessing_functions as prp

In [None]:
prp.read_wos_txt(filename'savedrecs12.txt')

# Phase 2: Add the top journal flags
### Inputs:
 - **filename**: file to indicate which are the Top journals. The default file is 'TopJournal_list.txt' and the list was taken from https://journalranking.org considering all the journals ranked as 3, 4, and 4*. If you want to use a different file, create a file .txt with the list of journals, where each journal name is divided by newline as in the file 'TopJournal_list.txt'.


 - **df_file**: name of the file with .xlsx extension. The file is a DataFrame with a column, named 'source_title', that refers to the name of the journal where the paper was pubblished.

### Output:
- 'Dataset_input.xslx' file, the input DataFrame  with a new column ('TOPJ') that indicate if the journal is top or not.

In [None]:
slr.add_top_journal(filename="TopJournal_list.txt", df_file='Dataset_input.xlsx')

# Phase 3: Clean the texts
### Input:
- **filename**: name of the file with .xlsx extension. The file is a DataFrame as created in th previuos step, named 'Dataset_input.xlsx'.
    - Note: each text (abstract) should be a single string

### Output:
- 'Dataset_input.xslx' file, the input DataFrame with a new columns ('clean_text) with the cleaned version of the abstract

In [None]:
slr.preprocess(filename='Dataset_input.xlsx')

# STAGE 2 - Analysis

# Phase 3: Topic definition and document-topic associations
##  Input values required to run the analysis:

- **file_name**: name of the file created in Stage 1 ("Dataset_Input.xlsx").
    - If you want to create your data frame skipping Stage 1, note that the input data frame must have the following columns:
        - 'text_id': the identifier of the document, e.g. 'd1'.
        - 'Total number of citations': total number of citations.
        - 'Number of internal citations': number of citations within the dataset.
        - 'References internal id': string of 'text_id' separated by space reporting the documents that appear as references, e.g. "d1 d2 d3 d4 d5".
        - "TopJ": boolean ('Y' or 'N') if the pubblication belongs to Top Journal or not.
        - "clean_text": pre-processed text as string; words are separated by space.
    


- **method_w**: correction method for multiple testing problem: 'bonf' (Bonferroni correction) or 'fdr' (False Discovery Rate, Benjamini Y and Hochberg Y, 1995); the default value is 'bonf'.


- **soglia_w**: value of the threshold to construct the Statistically Validated Network of words; the defaut value is 0.01.


- **soglia_d**: value of the threshold for document-topic associations with Bonferroni correction method, the defaut value is 0.01.


- **soglia_0**: value of the threshold to calculate the associations between documents and the 'General'-topic with the Bonferroni correction method; the defaut value is 0.01.


## Outputs:
- **svn_words.txt**: file with the edges list of the Statistically Validated Network of words; each row represents a link between two nodes ('source' and 'target') with its p-value and Pearson correlation coefficient. The file consists of four columns separated by '\t':
    - source, target, p-values, weight (Pearson correlation coefficient)
    
    
- **topic_definition.xlsx**: data frame describing the Topics found as community of words in the Statistically Validated Network. The data frame has the following columns:
    - 'topic': the identifier of the topic, e.g. 'topic_1'
    - 'word': the word in stemming form
    - 'modularity contribution': the importance of the node (word) within the community(topic) in terms of modularity contribution
    - 'original words': the list of words before the stemming procedure


- **Topic_Document_association.xlsx**: data frame describing the associations between documents and topics; 'topic_0' represents the 'General' topic.  The data frame has the following columns:
    - 'text_id': the identifier of the document, e.g. 'd1'
    - 'topic': the identifier of the topic, e.g. 'topic_1'
    - 'p-value': the p-value of the over-expression of the topic within the document
    - 'correlation': Person correlation coefficient of the over-expression of the topic within the document.

In [None]:
import toolkit_functions as tlk

In [None]:
tlk.run_analysis(file_name='Dataset_Input.xlsx', method_w='fdr', soglia_w=0.01,
                 soglia_d=0.01, soglia_0=0.01)

# Phase 4: Statistics of topics selected

## Input values required:

- **file_name1**: file with .xlsx extension. The file "Dataset_Input.xlsx" was used as input in Phase 1.


- **file_name2**:  file with .xlsx extension. The file "Topic_Document_association.xlsx" was obtained in Phase 1.


- **file_name3**: file with .xlsx extension. The file "topic_definition.xlsx" was obtained in Phase 1.


- **file_name4**: file with .xlsx extension. The file "topic_label_example.xlsx" was created to store the labels of topics in STAGE 2 - Step 2. The file must be a data frame with two columns called 'topic' and 'label'. The column 'topic' contains the id of topics as in the file "topic_definition.xlsx". The column 'label' contains the name you assigned to the topic selected. The file 'topic_label_example.xlsx' is an example.

## Outputs:

- **'stats_topic.xlsx'**: data frame describing the topics and some related statistics.

In [None]:
tlk.run_stats(file_name1='Dataset_Input.xlsx', file_name2='Topic_Document_association.xlsx', 
              file_name3='topic_definition.xlsx', file_name4='topic_label_example.xlsx')



# Phase 5: Plot the topics selected

## Input:
- **df**: file with .xlsx extension. The output file 'stats_topic.xlsx' of Phase 4.
- **name**: name of the file to save the plot.

## Output:
- scatter-plot of documents along two dimensions: 'ratio of citations' (x-axis) and 'ratio of top journals' (y-axis).

In [None]:
tlk.plot_stats_1(df='stats_topic.xlsx', name='topic_overview_1.pdf')

## Input:
- **df**: file with .xlsx extension. The output file 'stats_topic.xlsx' of Phase 4.
- **name**: name of the file to save the plot.

## Output:
- scatter-plot of documents along two dimensions: 'number of internal citations' (x-axis) and 'total number of citations' (y-axis).

In [None]:
tlk.plot_stats_2(df='stats_topic.xlsx', name='topic_overview_1.pdf')