# STAGE 0 - Python Setup

- Run the code below to verify if the following packages are already installed, otherwise it will install them
    - numpy, pandas, sklearn, scipy, networkx, matplotlib, tqdm

In [None]:
import subprocess
import sys

def verify_package(package_name):
    try:
        # Importa il pacchetto per verificare se è installato
        __import__(package_name)
        print(f"Il pacchetto '{package_name}' è già installato.")
    except ImportError:
        print(f"Il pacchetto '{package_name}' non è installato. Installazione in corso...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])

list_packages = ['numpy','pandas', 'sklearn', 'scipy', 'networkx', 'matplotlib', 'tqdm', 'xlsxwriter']
for package in list_packages:
    verify_package(package)


### Only for cleaning the texts
- Run the code below to verify if the following packages are already installed, otherwise it will install them
    - spacy, nltk, unidecode

In [None]:
list_packages_2 = ['spacy', 'nltk', 'unidecode']
for package in list_packages_2:
    verify_package(package)

## Run the code below to import the functions for the analysis

In [None]:
import toolkit_functions as tlk



# STAGE 1 - Dataset creation

## Phase 1: Create the data frame from a file downloaded from Scopus or Web of Science

### Input from Web of Science
 - **filename**: name of the file downloaded from Web of Sciece with .txt extension. The file must be in the same folder of this Notebook.

### Output
 - **Dataset_input.xslx** file: a data frame in which each row represents a paper and the columns are:
     - text id; Authors; Article title; Abstract; Source title; Pubblication year; Journal abbreviation; Journal ISO abbreviation; References; Total number of citations; Number of internal citations; References internal id; DOI.

In [None]:
tlk.read_wos_txt(filename='your_file.txt')

## Or

### Input from Scopus
 - **filename**: name of the file downloaded from Scopus with .csv extension. The file must be in the same folder of this Notebook.

### Output
 - **Dataset_input.xslx** file: a data frame in which each row represents a paper and the columns are:
     - text id; Authors; Article title; Abstract; Source title; Pubblication year; Journal abbreviation; References; Total number of citations; Number of internal citations; References internal id; DOI.

In [None]:
tlk.read_scopus_csv(filename='your_file.csv')

#
# Phase 2: Add the top journal flags
### Inputs:
 - **filename**: file to indicate which are the Top journals. The default file is 'TopJournal_list.txt' and the list was taken from https://journalranking.org considering all the journals ranked as 3, 4, and 4* according to the ABS ranking. If you want to use a different file, create a file .txt with the list of journals, where each journal name is divided by newline as in the file 'TopJournal_list.txt'.


 - **df_file**: name of the file with .xlsx extension. The file is a data frame created in the previuos step, named 'Dataset_input.xlsx'. The file is a data frame with a column, named 'source_title', that refers to the name of the journal where the paper was pubblished.

### Output:
- **Dataset_input.xslx** file: the input data frame  with a new column ('TOPJ') to indicate if the journal is top or not.

In [None]:
tlk.add_top_journal(filename='TopJournal_list.txt', df_file='Dataset_input.xlsx')

#
# Phase 3: Clean the texts
### Input:
- **filename**: name of the file with .xlsx extension. The file is a data frame as created in the previuos step, named 'Dataset_input.xlsx'.
    - Note: each text (abstract) should be a single string.

### Output:
- **Dataset_input.xslx** file: the input data frame with a new column ('clean_text) with the cleaned version of the abstract.

In [None]:
tlk.preprocess(filename='Dataset_input.xlsx')

#
# STAGE 2 - Analysis

## Run the code below to import the functions for the analysis

In [None]:
import toolkit_functions as tlk

#
# Phase 3: Topic definition and document—topic associations
##  Input values required to run the analysis:

- **file_name**: name of the file created in Stage 1 ("Dataset_Input.xlsx").
    - If you created your data frame skipping Stage 1, note that the input data frame must have the following columns:
        - 'text_id': the identifier of the document, e.g. "d1".
        - 'Total number of citations': total number of citations.
        - 'clean_text': pre—processed text as string; words are separated by space.
    


- **method**: correction method for multiple testing problem: 'bonf' (Bonferroni correction) or 'fdr' (False Discovery Rate, (Benjamini Y and Hochberg Y, 1995)); the default value is 'fdr'.


- **threshold**: value of the threshold to construct the Statistically Validated Network of words; the defaut value is 0.01.


## Outputs:
- **SVN words.txt** file: the edges list of the Statistically Validated Network of words; each row represents a link between two nodes ('source' and 'target') with its p-value and Pearson correlation coefficient. The file consists of four columns separated by '\t':
    - source, target, p-values, weight (Pearson correlation coefficient).
    
    
- **Topic definition.xlsx** file: a data frame describing the Topics found as community of words in the Statistically Validated Network. The data frame has the following columns:
    - 'topic': the identifier of the topic, e.g. "topic_1".
    - 'word': the word in stemming form.
    - 'modularity contribution': the importance of the node (word) within the community (topic) in terms of modularity contribution.
    - 'original words': the list of words before the stemming procedure.


- **Topic Document association.xlsx** file: a data frame describing the associations between documents and topics; "topic_0" represents the 'General' topic.  The data frame has the following columns:
    - 'text_id': the identifier of the document, e.g. "d1".
    - 'Article title': the title of the paper.
    - 'topic': the identifier of the topic, e.g. "topic_1".
    - 'p-value': the p-value of the over—expression of the topic within the document.
    - 'correlation': Pearson correlation coefficient of the over—expression of the topic within the document.

In [None]:
tlk.run_analysis(file_name='Dataset_input.xlsx', method='fdr', threshold=0.01)

#
# Phase 4: Statistics of the selected topics

## Input values required:

- **file_name1**: file with .xlsx extension. The file 'Dataset_Input.xlsx' was used as input in Phase 1.
    - If you created your data frame skipping Stage 1, note that the input data frame must have the following columns:
        - 'text_id': the identifier of the document, e.g. "d1".
        - 'Total number of citations': total number of citations.
        - 'Number of internal citations': number of citations within the dataset.
        - 'TopJ': boolean ("Y" or "N") if the pubblication belongs to Top Journal or not.


- **file_name2**:  file with .xlsx extension. The file 'Topic Document association.xlsx' was obtained in Phase 1.


- **file_name3**: file with .xlsx extension. The file 'Topic definition.xlsx' was obtained in Phase 1.


- **file_name4**: file with .xlsx extension. The file 'topic_label_example.xlsx' was created to store the labels of topics in STAGE 2 - Step 2. The file must be a data frame with two columns called 'topic' and 'label'. The column 'topic' contains the id of topics as in the file 'Topic definition.xlsx'. The column 'label' contains the name you assigned to the topic selected. The file 'topic_label_example.xlsx' is an example.

## Outputs:

- **stats_topic.xlsx** file: a data frame describing the topics and some related statistics.

In [None]:
tlk.run_stats(file_name1='Dataset_input.xlsx', file_name2='Topic Document association.xlsx',
              file_name3='Topic definition.xlsx', file_name4='topic_label_example.xlsx')

#
# Phase 5: Selection of articles

## Input values required:

- **file_name1**: file with .xlsx extension. The file 'Dataset_Input.xlsx' was used as input in Phase 1.
    - If you created your data frame skipping Stage 1, note that the input data frame must have the following columns:
        - 'text_id': the identifier of the document, e.g. "d1".
        - 'Total number of citations': total number of citations.
        - 'Number of internal citations': number of citations within the dataset.
        - 'TopJ': boolean ("Y" or "N") if the pubblication belongs to Top Journal or not.


- **file_name2**:  file with .xlsx extension. The file 'Topic Document association.xlsx' was obtained in Phase 1.


- **file_name3**: file with .xlsx extension. The file 'topic_label_example.xlsx' was created to store the labels of topics in STAGE 2 - Step 2. The file must be a data frame with two columns called 'topic' and 'label'. The column 'topic' contains the id of topics as in the file 'Topic definition.xlsx'. The column 'label' contains the name you assigned to the topic selected. The file 'topic_label_example.xlsx' is an example.


- **selection**: the selection criterion. You can use 'broad', selecting the papers with at least one citation in Scopus or Web of Science database,  or 'narrow', selecting the papers with at least one citation within the dataset downloaded. The default value is 'broad'.



## Outputs:

- **Final Dataset selection.xlsx** file: a data frame containing only the articles in the final selection that meet your filers.

In [None]:
tlk.dataset_selection(file_name1='Dataset_input.xlsx', file_name2='Topic Document association.xlsx',
                      file_name3='topic_label_example.xlsx', selection='broad')

#
#
# STAGE 3 - Plots

# Phase 6: Plot the selected topics

## Input:
- **filename**: file with .xlsx extension. The output file 'stats_topic.xlsx' of Phase 4.
- **name**: name of the file to save the plot.

## Output:
- Scatter-plot of topics along two dimensions: 'ratio of citations' (x-axis) and 'ratio of top journals' (y-axis). 'ratio of citations' is obtained as the 'Average number of citations within the dataset' divided by 'Average number of citations'. 'ratio of top journals' is obtained as the 'Number of papers from top journals over—expressed' divided by 'Number of papers over—expressed'.

In [None]:
tlk.plot_stats_1(filename='stats_topic.xlsx', name='topic_overview_1.pdf')

## Input:
- **filename**: file with .xlsx extension. The output file 'stats_topic.xlsx' of Phase 4.
- **name**: name of the file to save the plot.

## Output:
- Scatter-plot of topics along two dimensions: 'Average number of citations within the dataset' (x-axis) and 'Average number of citations' (y-axis).

In [None]:
tlk.plot_stats_2(filename='stats_topic.xlsx', name='topic_overview_2.pdf')