# Molecular Targets Discovery Pipeline (MTP)

## Gemini API

https://ai.google.dev/gemini-api/docs

#### API key - Free of charge

https://aistudio.google.com/app/apikey

### Google Enable API

  - You are about to enable 'Generative Language API'.

https://ai.google.dev/gemini-api/docs/oauth

### Google Python projects

#### Gemini API Text Implementation

https://github.com/RepellentSpy/Gemini-API-Text-Implementation/tree/main

#### gemini-api 0.1.6

https://pypi.org/project/gemini-api/


#### Gemini-API

https://github.com/dsdanielpark/Gemini-API

## LLM - Large Language Model

### Gemini flash

gemini-1.5-flash-latest

### Gemini

#### Gemini consensus  (mtp 12)

#### Run all (mtp 08a)

   - mtp_covid_08a_digital_curation_run_all()

### i_dfp
  - i_dfp == 0: N DEGs/DAPs
  - i_dfp == 1: N genes at the middle of the table
  - i_dfp == 2: N genes at the end of the table
  - i_dfp == 3: N genes not in the table (fixed)

### Expectation
  - i_dfp == 0: n_Yes >> n_No
  - i_dfp == 1: n_No > n_Yes
  - i_dfp == 2: n_No >> n_Yes
  - i_dfp == 3: n_Yes ~ 0

<h2 style="color:violet;">Read_gemini()</h2>

  - Read Gemini's answers
    - set a model: gem.set_gemini_num_model(chosen_model)
    - df = gem.read_gemini(question_name, run=run, verbose=True)
    - columns:
      - pathway_id
      - pathway
      - fdr
      - curation
      - response_explain
      - score_explain
      - question
      - disease
      - case
      - s_case
      - pathway_found


![](figures/read_gemini.png)

<h2 style="color:violet;">Semantic reproducibility definition</h2>

  - reproducibility: all answers must be the same for all semantic variation questions
  - given a run and a model
    - all 4 semantic questions must result in the same answer Yes or No to be reproducible
    - run_gemini_calc_all_semantic_reproducibility(run_list)
      - for each chosen_model: run_gemini_calc_model_semantic_reproducibility()
        - given a model
        - loop over all cases
          - for each question and i_dfp --> concatenates question_{idfp) columns
          - self.calc_llm_model_reproducibility(dfn)
    - open_gemini_reproducibility_run_model()

### Semantic reproducibility for all runs, all models

![](figures/semantic_reproducibility.png)

### Semantic reproducibility for a given run and model

  - defined a run and a model

![](./figures/semantic_reproducibility_example.png)

<h2 style="color:cyan;">Consensus definition</h2>

  - vote majority
  - if tie: Doubt

#### Pseudocode
  - n(Yes) > n(No): Yes
  - n(No) > n(Yes): No
  - n(Yes) == n(No): Doubt

<h3 style="color:violet;">Semantic Consensus definition</h3>

  - given a run and a case
  - calc the consensus between aswers for all Gemini models

<h2 style="color:cyan;">Gemini Reproducibility</h2>

  - given a run
  - comparing 2 models
    - all 4 semantic questions must result in the same answer Yes or No
    - run_gemini_calc_all_reproducibility()
    - open_gemini_reproducibility_run_model()

### PubMed Search (mtp 11)

  - mtp_covid_11a_query_find_papers_in_pubmed
    - pubmed_lib
      - pub.run_case_pathway_pubmed_search(case=case, with_gender=with_gender, ..)
         - self.set_dates(inidate, enddate)
         - terms_and_or, terms_not2 = self.case_to_terms()
         - ret = self.build_df_enr_pubmed_search_table(verbose=verbose)
         - df_pmid = self.run_terms_loop_pathways()
         - save summaries:
           - self.calc_summary_by_pmid(df_pmid)
           - self.calc_summary_by_pathway(df_pmid)

<h2 style="color:cyan;">Methods</h2>

<h4 style="color:yellow;">run_gemini_consensus_statitics_all_models()</h4>
<h4 style="color:yellow;">open_gemini_consensus_statitics_run_all_models()</h4>

  - dfpiv = run_gemini_consensus_statitics_all_models(run=run, force=force, verbose=verbose)
    - calc the pivot table with 4 semantic questions times n models
  - dfpiv = gem.open_gemini_consensus_statitics_run_all_models(run=run)
    - return the pivot table with 4 semantic questions times n models

<h4 style="color:yellow;">summary_stat_dpiv()</h4>

  - dfsumm = gem.summary_stat_dpiv(dfpiv, verbose=verbose)
    - return summary pivot table
    - by case, model, i_dfp, semantic question
       - totalize: Yes, Possible, Low, and No

  - report_gemini

  - 

  - run_gemini_calc_all_semantic_reproducibility

  - run_gemini_consensus_statitics

  - summary_stat_dpiv

  - stat_between_dfp_using_dpiv

  - gemini_summary_consensus_statitics

  - get_2_models

  - compare_2_runs_unanimous_mean

  - bar_plot_change_opinion

  - bar_plot_yes_no_opinion

  - run_all_comparing_geminis_by_runs

  - compare_2_runs_total_answers
    - given 2 runs: run01 and run02
    - the method sums all yes, possible, low evidence, and no
    - for all cases
    - it returns:
       - dftot: total table
       - dfstat: statistics table comparing the 2 runs

  - compare_2_models_venn_diagram


#### for COVID

```
test=False
save_file=True
force=False
verbose=False

for case in case_list:
    for with_gender in [True, False]:
        print(">>>",  case, with_gender)

        terms_not_param = ['NOT', 'MERS', 'SARS-CoV-1']
        terms1_param = ["OR", 'COVID', 'SARS-CoV-2']
        connective_param = 'AND'

    
        _ = pub.run_case_pathway_pubmed_search(case=case, with_gender=with_gender...)

    print("")
print("-------------- end --------------")
```

#### case_to_terms_covid

```


	def case_to_terms_covid(self, verbose:bool=False) -> (List, List):

		if self.case == 'g1_female':
			term_list = ['female', 'asymptomatic'] if self.with_gender else ['asymptomatic']
			not_list = ['severe', 'intensive', 'outpatient']

		elif self.case == 'g1_male':
			term_list = ['male', 'asymptomatic'] if self.with_gender else ['asymptomatic']
			not_list = ['severe', 'intensive', 'outpatient']

		elif self.case == 'g2a_female':
			term_list = ['female', 'mild'] if self.with_gender else ['mild']
			not_list = ['severe', 'intensive']

		elif self.case == 'g2a_male':
			term_list = ['male', 'mild']  if self.with_gender else ['mild']
			not_list = ['severe', 'intensive']

		elif self.case == 'g2b_female':
			term_list = ['female', 'moderate', 'outpatient'] if self.with_gender else ['moderate', 'outpatient']
			not_list = ['mild', 'asymptomatic', 'severe', 'intensive']

		elif self.case == 'g2b_male':
			term_list = ['male', 'moderate', 'outpatient'] if self.with_gender else ['moderate', 'outpatient']
			not_list = ['mild', 'asymptomatic', 'severe', 'intensive']

		elif self.case == 'g3_female_adult':
			term_list = ['female', 'severe'] if self.with_gender else ['severe']
			not_list = ['elder', 'outpatient', 'mild', 'moderate', 'asymptomatic']

		elif self.case == 'g3_male_adult':
			term_list = ['male', 'severe'] if self.with_gender else ['severe']
			not_list = ['elder', 'outpatient', 'mild', 'moderate', 'asymptomatic']

		elif self.case == 'g3_female_elder':
			term_list = ['female', 'elder', 'severe'] if self.with_gender else ['elder', 'severe']
			not_list = ['outpatient', 'mild', 'moderate', 'asymptomatic']

		elif self.case == 'g3_male_elder':
			term_list = ['male', 'elder', 'severe'] if self.with_gender else ['elder', 'severe']
			not_list = ['outpatient', 'mild', 'moderate', 'asymptomatic']

		else:
			print(f"Error: could not define the case {self.case} for {self.prefix}")
			term_list = []
			not_list = []

		'''	return term_list + term_not '''
		return term_list, not_list + ['child', 'neonat', 'newborn']
```

#### for Medulloblastoma

```
test=False
save_file=False
force=False
verbose=False

for case in case_list:
    print(">>>",  case,)

    terms_not_param = ['NOT', 'COVID', 'SARS-CoV']
    terms1_param = ['medulloblastoma']
    connective_param = 'AND'
    
    _ = pub.run_case_pathway_pubmed_search(case=case, with_gender=False, terms1=terms1_param, 
                                           terms_not=terms_not_param, connective=connective_param, 
                                           test=test, save_file=save_file, force=force, verbose=verbose)
    print("")
print("-------------- end --------------")
```

#### case_to_terms_medulloblastoma

```
	def case_to_terms_medulloblastoma(self, verbose:bool=False) -> (List, List):

		if self.case == 'WNT':
			term_list = ['WNT']
		elif self.case == 'SHH':
			term_list = ['OR', 'SHH', 'Hedgehog']
		elif self.case == 'G3':
			term_list = ['OR', 'Group 3', 'G3']
		elif self.case == 'G4':
			term_list = ['OR', 'Group 4', 'G4']
		else:
			print(f"Error: could not define the case {self.case} for {self.prefix}")
			term_list = []

		'''	return term_list + term_not '''
		return term_list, []
```        

#### All PubMed

  - dfpub = pub.merge_all_pubmeds()

### Gemini x Pubmed: save both and only one

  - save_pubmed_x_gemini_both_and_only_one()
    - compare_pubmed_x_gemini()
      - pub.fname_pubmed_x_gemini%(case, i_dfp, run, pub.gem.gemini_model)
      - instead of
        - run_gemini_consensus_statitics_all_models(sel_ptw_pubmed)
        - dfpiv2 = open_gemini_consensus_statitics_run_filter_idfp_consensus(sel_ptw_pubmed=sel_ptw_pubmed)
        - dfpiva = open_gemini_consensus_statitics_run_all_models(run=run, sel_ptw_pubmed=sel_ptw_pubmed, verbose=verbose)
        - gemini_summary_consensus_statitics(sel_ptw_pubmed)
        - compare_2_models_venn_diagram(sel_ptw_pubmed)
        - get_2_models(sel_ptw_pubmed)
        - compare_2_runs_unanimous_mean(sel_ptw_pubmed)
        - compare_2_runs_total_answers(sel_ptw_pubmed)
        - report_gemini(sel_ptw_pubmed)
        - sel_ptw_pubmed:bool=False)


### Gemini counts
  - Count Yes and No per model, run versus iq and i_dfp:
    - 2 iq have PubMed inside the search (pubmed=True) and 2 have not
    - i_dfp: 0 to 3, 0=enriched, 1=middle, 2=end of the table, and 3=out of enriched table

## Gemini Reproducibility

  - 2 types:
    - Hard reproducibility: comparing all question-answers
    - Soft reproducibility: comparing consensus

### Hard reproducibility (comparing each question answers)

  - Comparing two runs - run-run reproducibility (RRR)
  - Comparing two models - inter-model reproducibility (IMR)
  - both, without i_dfp=3 (random out of table selected pathways)

### Soft reproducibility (comparing consensuses)

  - One Model Consensus Reproducibility (OMCR)
    - comparing all 4 DSSP - given one run, one model
  - comparing 2 runs - run-run consensus reproducibility (RRCR)
    - 2 runs, all models, all cases, all i_dfp - compare question's answer 
  - comparing 2 models - inter-model consensus reproducibility” (IMCR)
    - given one run, comparing 2 models, all cases, all i_dfp
  - all models, comparing 2 runs - all-models consensus reproducibility” (AMCR)
    - 2 runs, comparing all models consensus, all cases, all i_dfp
  - Run-run all-models answers reproducibility (RRAMAR)
    - 2 runs, all-models, count Yes, Possible, Low, No answers.
  - comparing unanimous between runs  (compare_2_runs_unanimous_mean)
    - given one run, comparing unanimous and not consensuses, all cases, all i_dfp

### Soft reproducibility Statistics
  - counfounding table

### Soft reproducibility methods

  - Methods:
    - Calculate 4DSSP and consensus:
      - calc_dfpiv_semantic_consensus_run_per_model()

  - Inter-model consensus reproducibility
  - method: run_all_inter_model_soft_consensus_repro()
    - flexible ~consensus, not flexible ~equal consensus, n_yes, n_no
    - for each run, case, i_dfp
      - run_inter_model_soft_consensus_repro()
        - dfpiv0 = self.open_dfpiv_semantic_consensus_run_per_model(run=run, chosen_model=chosen_model0, verbose=verbose)
        - dfpiv1 = self.open_dfpiv_semantic_consensus_run_per_model(run=run, chosen_model=chosen_model1, verbose=verbose)
          - filter case and i_dfp
          - flexible: equal consensus
          - not flexible: equal consensus, n_yes, n_no

     - Summary:
       - summary_stat_dfpiv_all_models()

    - One Model Consensus Reproducibility (OMCR)
      - rever todo xxxx

     - All models consensus reproducibility (AMCR)
       - calc_gemini_dfpiva_all_models_one_run()
         - many models x 4 DSSP, n_yes, n_no, (all model) consensus, unanimous
       - open_gemini_dfpiva_all_models_one_run()
       - Comparing:
         - run_inter_model_soft_consensus_venn()
         - calc_stat_gemini_compare_2_models

     - Unanimous Reproducibility:
       - calc_all_semantic_unanimous_repro()
           - for each run
           - for each model
           - calc_run_model_4DSSQ()
             - for all cases and i_dfp (all pathways)
                - read gemini table
                  - build the 4 DSSQuestions
                  - calc unaninous
               - Unaninomous: all 4 questions are Yes or No
       - For each run/model:
         - calc: mu_unanimous, std_unanimous, n (number of pathways)
         - save in file:
            - analytics: gemini_unanimous_consistency_for_<disease>_model_<name>_run_<run>_<suffix>.tsv
            - summary:   gemini_unanimous_consistency_stats_for_<disease>_<suffix>.tsv
    - open_run_model_4DSSQ()

#### Soft reproducibility other methods
  - Venn diagrams:
     - venn_diagram_between_2models


### Comparing Gemini x Pubmed x Reviewers 
  - only selected cases and pathways
  - run01, model 1.5-flash
  - root data: colaboracoes/project/pubgem

## Rever

### Gemini Reproducibility
  - Compare two models in one run (compare_2_models_one_run)
     - venn_diagram_between_2models
     - new pathways found: FN discovery
  - semantic reproducibility per model:
     - run_gemini_calc_all_semantic_reproducibility()
     - open_gemini_semantic_reproducibility_run_model()
  - comparing runs
  - comparing models
  - comparing unanimous between runs  (compare_2_runs_unanimous_mean)

### Consensus
  - calc_gemini_summary_consensus_statitics
  - compare_2_runs_total_answers

### Statistics - counfounding table
  - Positive: i_dfp==0
  - Negative: i_dfp==1
  - pseudo negative1: i_dfp==2
  - pseudo negative2: i_dfp==3
  - calc:
    - TP, FP, TN, FN, Sensibily, Specificity, Accuracy, and F1-score
    - TN1, FN1
    - TN2, FN2

### Merge all PubMed with/without gender filter
  - COVID-19 is dependent on gender, MB is not.
  - with gender: True or False for COVID-19 and False for MB
  - PubMed search with ou without gender are different
  - i_dfps = [0] for selected data and [0,1,2,3] for all data

#### Method - pub.merge_all_pubmeds:  
  - for all cases
    - get the pathways
    - get the reactome term table
    - for each pathway - term:
      - search for pmids in PubMed
  - one run (dummy)
  - one chosen_model (3 dummy)
  - for gender True and False
  - for all cases, iqs', i_dfps'
    - get the pathways
    - get the reactome term table
    - for each pathway - term:
      - search for pmids in PubMed
  - PubMed search on 2025/01/02
