# My functions to implement:

## Classes RelationalQueryProcessor and TriplestoreQueryProcessor
Add the following method to both the classes:

def is_publication_in_db(pub_id : str) : boolean
It returns True if the publication identified by the input id is included in the dababase, False otherwise.

## Class RelationalQueryProcessor
The is_publication_in_db function takes in input the publication identifier.
First it checks if the type of the pub_id is a string. If it's true, it goes on opening a connection with the relational database and proceeds with the query.
The query is selecting ids (in the "id" columns) from three different tables (`JournalArticleTable`, `BookChapterTable`, and `ProceedingsPaperTable`) using the UNION operator. 
The UNION operator is used to combine the results of multiple SELECT statements into a single result set.
The function must return True whether the pub_id exists, otherwise it returns false.

In [None]:
#Class RelationalQueryProcessor
    def is_publication_in_db(self, pub_id):
        if type(pub_id) == str:
            with sql3.connect(self.getDbPath()) as qrdb:
                cur = qrdb.cursor()
                query = "SELECT id FROM JournalArticleTable WHERE id = ? UNION SELECT id FROM BookChapterTable WHERE id = ? UNION SELECT id FROM ProceedingsPaperTable WHERE id = ?;"
        
                cur.execute(query, (pub_id, pub_id, pub_id))
                result = cur.fetchall()
        
                if result:
                    return True
                else:
                    return False
        else:
            raise ValueError("The input parameter is not a string!")
            
#Testing
#query_anita = z.is_publication_in_db("doi:10.1162/qss_a_00112")
#print("is_publication_in_db Query\n", query_anita) --> Returns True

#query_anita = z.is_publication_in_db("doi:10.1016/j.cirpj.2018.06.002")
#print("is_publication_in_db Query\n", query_anita) --> Returns False

## Class TriplestoreQueryProcessor
First of all, the function retrieves the endpoint URL for the triplestore by calling the getEndpointUrl() method. It checks if the provided DOI (doi) is a string. If not, it raises a ValueError.
It constructs a SPARQL query using the provided DOI as a filter condition. The query selects the publication that has a type of fabio:Expression and an identifier matching the provided DOI.
It returns True if the result of the query is not empty (indicating that the publication exists in the database), otherwise, it returns False.

In [None]:
#Class TriplestoreQueryProcessor
    def is_publication_in_db(self, doi):
        endpoint = self.getEndpointUrl()
        # Check if pub_id is a string
        if not isinstance(doi, str):
                raise ValueError("pub_id must be a string")

        query = """
        PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
        PREFIX schema: <https://schema.org/>
        PREFIX fabio: <http://purl.org/spar/fabio/>

        SELECT ?publication
        WHERE {{
            ?publication rdf:type fabio:Expression ;
                     schema:identifier "{pub_id}".
         }}
         """
        result = get(endpoint, query.format(pub_id = doi), True)
        return not result.empty
    
#Testing
#Q_Anita = grp_qp.is_publication_in_db("doi:10.1007/s00521-020-05491-5")
#print(Q_Anita) #returns true

#Q_Anita = grp_qp.is_publication_in_db("doi:10.1162/qss_a_00112")
#print(Q_Anita) #returns false

## Class GenericQueryProcessor
def compute_h_index(author_id : str) : int
It returns a non-negative integer that is the maximum value 'h' such that the author identified by the input id has published 'h' papers that have each been cited at least 'h' times.

def remove_duplicates(l1 : list[Publication], l2 : list[Publication]) : list[Publication]
It takes in input two different list of publications, and returns a new list that contains the union of the publication in both list (removing the duplicates).

### compute_h_index
In order to implement the compute_h_index method I've implement in both RelationalQueryProcessor and TriplestoreQueryProcessor classes the count_citation method:

In [None]:
#Class RelationalQueryProcessor
    def count_citations(self, ref_doi):
            with sql3.connect(self.getDbPath()) as qrdb:
                cur = qrdb.cursor()
                query = "SELECT COUNT (*) FROM ReferencesTable WHERE ref_doi = ?"
                cur.execute(query, (ref_doi,))
                num_cit = cur.fetchone()[0]
                #[0]
        
            return num_cit
#query_h = z.count_citations("doi:10.1162/qss_a_00023")
#print("count_citations Query\n", query_h)

In [None]:
#Class TripleStoreQueryProcessor
    def count_citations(self, ref_doi):
        endpoint = self.getEndpointUrl()
        query = f"""
        PREFIX schema: <https://schema.org/>

        SELECT (COUNT(?publication) as ?numCitations)
        WHERE {{ 
        ?publication schema:citation ?o.
        FILTER (regex (str(?o), "{ref_doi}", "i")) 
        }}

        """
        result = get(endpoint, query, True)
        num_citations = 0
        if not result.empty and 'numCitations' in result.columns:
            num_citations = int(result['numCitations'].iloc[0])

        return num_citations
#Q_index = grp_qp.count_citations("doi:10.1093/nar/gkz997")
#print(Q_index)

This has been done in order to have a method to count the citations with a given doi.
Then the compute_h_index method has been defined in the class GenericQueryProcessor.
It's a method for computing the H-index of an author based on their publications and for retrieving the number of citations for a given publication ID. 
It first retrieves the publications associated with the given `author_id` using the `getPublicationsByAuthorId` method. Then, it collects the number of citations for each publication using the `get_citations` method and stores them in a list called `citations`. After sorting the list of citations in descending order, it iterates through the list to calculate the H-index based on the number of citations for each publication.
The H-index is the highest number `h_index` such that at least `h_index` publications have `h_index` citations each.

The get_citations method retrieves the number of citations for a given publication ID. It determines the type of the `publication_id` (e.g., if it starts with "doi:") and calls the appropriate query method from the available `queryProcessor`. If the `publication_id` starts with "doi:", it assumes that it's a Digital Object Identifier and attempts to find the number of citations using either a `RelationalQueryProcessor` or a `TriplestoreQueryProcessor`, depending on the type of processor available.

In [None]:
#Class GenericQueryProcessor
    def compute_h_index(self, author_id):
        publications = self.getPublicationsByAuthorId(author_id)
        citations = []
        #print("publications:", publications)

        # Collect citations for each publication
        for publication in publications:
            citations.append(self.get_citations(publication.id))  # get_citations is a method to get citations of a publication

        citations.sort(reverse=True)
        h_index = 0
        for i, citation in enumerate(citations):
            if citation >= i + 1:
                h_index += 1
            else:
                break

        return h_index

    def get_citations(self, publication_id):
        # Determine the type of publication_id and call the appropriate query method
        if publication_id.startswith("doi:"):
            for processor in self.queryProcessor:
                if isinstance(processor, RelationalQueryProcessor):
                    return processor.count_citations(publication_id)
                elif isinstance(processor, TriplestoreQueryProcessor):
                    return processor.count_citations(publication_id)
        else:
            # Handle other types of identifiers if necessary
            pass

#h_index = generic.compute_h_index("0000-0001-5506-523X")
#print("H-index for the author:", h_index)
#print('hello 2')

PROBLEMI: non parte perché probabilmente il probelma sta all'interno della query getPublicationsbyAuthorId.

### remove_duplicates
The purpose of this method is to take two lists (`l1` and `l2`), combine them, and remove any duplicate elements. 
The method first combines the two input lists `l1` and `l2` into one list called `combined_list`. This is done by using the `+` operator which concatenates the lists.
It then initializes an empty set called `seen_ids`. Sets in Python automatically remove duplicates. The method iterates through each element in `combined_list`, and adds it to the `seen_ids` set. Since sets do not allow duplicate elements, any duplicate elements in `combined_list` will be removed during this process.
After all elements have been added to `seen_ids`, the method converts the set back to a list using `list()` function, creating a new list called `list_seen_ids`.
Finally, the method returns the `list_seen_ids`, which contains only unique elements from the original two lists.

In [None]:
#Class GenericQueryProcessor
    def remove_duplicates(self, l1, l2):
        combined_list = l1 + l2  # Combine both lists
        seen_ids = set()
        for publication in combined_list:
            seen_ids.add(publication)
        
        list_seen_ids = list(seen_ids)
        return list_seen_ids
    
#remove_duplicates_method = generic.remove_duplicates(['doi:10.1162/qss_a_00023', 'doi:10.1038/sdata.2016.18'], ['doi:10.1007/s11192-020-03397-6', 'doi:10.1080/19386389.2021.1999156', 'doi:10.1038/sdata.2016.18'])
#print(remove_duplicates_method)

PROBLEMA: qui mi chiedo se io debba chiamare una query aggiuntiva per avere delle liste, essendo che comunque questo metodo deve funzionare all'intenro del generic.
devo prendere i publication.id?