# 0. if you want to create your own DataFrame
## The input DataFrame must have the following columns:

- "text_id": id associated to each document
    - example: [d1, d2, d3, ..., d100, ...]
- "tot_cit": total number of citations
- "internal_cit": number of citations within the dataset
- "references_internal_id": string of text_id that cite the document separated by space
    - example: "d1 d2 d3 d4 d5 .."
- "TopJ": boolean if the pubblication belongs to Top Journal or not
- "clean_text": preprocessed text as string; words are separated by space

### The file 'Dataset_input_example.xlsx' provides an example

# 1. Save the Data Frame in a file with name Dataset_Input.xlsx

# 2. Run the codes below
##  Input values required to run the analysis:

- **file_name**: name of the file created above ("Dataset_Input.xlsx") or in the Step-'Dataset_creation'; The file 'Dataset_Input_example.xlsx' is an example.


- **method_w**: correction method for multiple testing problem: 'bonf' (Bonferroni correction) or 'fdr' (False Discovery Rate, Benjamini Y and Hochberg Y, 1995); the default value is 'bonf'.


- **soglia_w**: value of the threshold to construct the Statistically Validated Network of words; the defaut value is 0.01.


- **soglia_d**: value of the threshold for document-topic associations with Bonferroni correction method, the defaut value is 0.01.


- **soglia_0**: value of the threshold to calculate the associations between documents and the 'General'-topic with the Bonferroni correction method; the defaut value is 0.01.


## Outputs:
- **svn_words.txt**: file with the edges list of the Statistically Validated Network of words; each row represents a link between two nodes ('source' and 'target') with its p-value and Pearson correlation coefficient. The file consists of four columns separated by '\t':
    - source, target, p-values, weight (Pearson correlation coefficient)
    
    
- **topic_definition.xlsx**: data frame describing the Topics found as community of words in the Statistically Validated Network.


- **Topic_Document_association.xlsx**: data frame describing the associations between documents and topics; 'topic_0' represents the 'General' topic.

In [None]:
import toolkit_functions as tlk

In [None]:
tlk.run_analysis(file_name='Dataset_Input_example.xlsx', method_w='bonf', soglia_w=0.05,
                 soglia_d=0.01, soglia_0=0.01)

# 3. Run the codes below

## Input values required:

- **file_name1**: file with .xlsx extension. The file used as input in the function 'run_analysis' in the code above; The file example is 'Dataset_Input_example.xlsx'

- **file_name2**:  file with .xlsx extension. The file "Topic_Document_association.xlsx" (output after running the function 'run_analysis') related to the document-topic associations".
     
- **file_name3**: file with .xlsx extension. The file created to store the labels of topics. The file must be a data frame with two columns called 'topic' and 'label'. The column 'topic' contains the id of topics as in the file "topic_definition.xlsx" (output of after running the function 'run_analysis'). The column 'label' contains the name you assigned to the topic selected. The file 'topic_label_example.xlsx' is an example.

## Outputs:

- **'stats_topic.xlsx'**: data frame describing the topics and some related statistics.

In [None]:
tlk.run_stats(file_name1='Dataset_Input_example.xlsx', file_name2='Topic_Document_association.xlsx', 
              file_name3='topic_definition.xlsx', file_name4='topic_label_example.xlsx')

