## Literature to be classified

For the old experiments to run you will have to switch back to the old openai version. You will know if you are on a compatible version by seeing if you can import cosine_similarity fom openai.
You only need this for the embedding based experimentation. The other experiments should work with the current version of openai.

In [5]:
%pip freeze

accelerate==0.18.0
aiofiles @ file:///home/conda/feedstock_root/build_artifacts/aiofiles_1664378549280/work
aiohttp==3.8.4
aiosignal==1.3.1
aiosqlite @ file:///home/conda/feedstock_root/build_artifacts/aiosqlite_1671461885930/work
alembic @ file:///home/conda/feedstock_root/build_artifacts/alembic_1678309044210/work
altair @ file:///home/conda/feedstock_root/build_artifacts/altair_1675180856922/work
anyio @ file:///home/conda/feedstock_root/build_artifacts/anyio_1666191106763/work/dist
appdirs==1.4.4
argon2-cffi @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi_1640817743617/work
argon2-cffi-bindings @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi-bindings_1666850883579/work
asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1670263926556/work
async-generator==1.10
async-timeout==4.0.2
attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1671632566681/work
Babel @ file:///home/conda/feedstock_root/build_artifacts/babel_16777

In [20]:
%pip install openai sklearn openai[embeddings]
# update openai
%pip install openai --upgrade
import pandas as pd
from typing import Dict
from os import path, makedirs
from tqdm.notebook import tqdm
from transformers import AutoModelForSequenceClassification, AutoTokenizer, ZeroShotClassificationPipeline
from difflib import get_close_matches
import torch
import random
pd.set_option("display.max_rows", 50)

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Dataset Helper Classes

In order to seamlessly use the created datasets, an adapter class and specific adapter classes are used

In [21]:
class ZeroShotDataset:
    def __init__(
        self,
        source_file,
        is_multi_label=False,
        custom_hypothesis="This text is about {}",
    ) -> None:
        self.source_file = source_file
        self._prepared_data: pd.DataFrame = None
        self.is_multi_label = is_multi_label
        self.custom_hypothesis = custom_hypothesis
        self._prepare()

    def _prepare(self):
        self._prepared_data = pd.DataFrame()

    def get_prepared_data(self):
        return self._prepared_data

    @staticmethod
    def zeros_like(df):
        new_df = df.copy()
        for col in new_df.columns[1:]:
            new_df[col].values[:] = 0
        return new_df

    def get_empty_data(self, single_label=True):
        return self.zeros_like(
            self.get_single_label_data() if single_label else self._prepared_data
        )

    def get_labels(self, single_label=True):
        return (
            list(self.get_single_label_data().iloc[:, 1:].columns)
            if single_label
            else list(self._prepared_data.iloc[:, 1:].columns)
        )

    def get_single_label_data(self):
        return self._prepared_data.loc[
            (self._prepared_data.sum(axis=1, numeric_only=True) == 1)
        ]


class AiInIsDataCollectionDataset(ZeroShotDataset):
    def __init__(self, custom_hypothesis="This text is about {}") -> None:
        super().__init__(
            "/home/jovyan/work/data/overviews/ai_in_is_data_collection_techniques_solutions.csv",
            is_multi_label=True,
            custom_hypothesis=custom_hypothesis,
        )

    def _prepare(self):
        df = pd.read_csv(self.source_file, sep=";", encoding="ISO-8859-1")
        df = df.iloc[:, 3:]
        df = df.rename(
            columns={"ml subset selection": "machine learning subset selection"}
        )
        self._prepared_data = df

    def get_single_label_data(self):
        df = super().get_single_label_data()
        df = df.drop("prediction markets", axis=1)
        return df


class AiInIsTypesDataset(ZeroShotDataset):
    def __init__(self, custom_hypothesis="This text is about {}") -> None:
        super().__init__(
            "/home/jovyan/work/data/overviews/ai_in_is_types_solutions.csv",
            is_multi_label=False,
            custom_hypothesis=custom_hypothesis,
        )

    def _prepare(self):
        df = pd.read_csv(self.source_file, sep=";", encoding="ISO-8859-1")
        df = df.iloc[:, 2:]
        df = pd.concat([df, df.type.str.get_dummies()], axis=1).drop("type", axis=1)
        self._prepared_data = df


class AiInIsDataCollectionFullTextDataset(ZeroShotDataset):
    def __init__(self, custom_hypothesis="This text is about {}") -> None:
        super().__init__(
            "/home/jovyan/work/data/overviews/ai_in_is_data_collection_techniques_solutions.pkl",
            is_multi_label=True,
            custom_hypothesis=custom_hypothesis,
        )

    def _prepare(self):
        df = pd.read_pickle(self.source_file)
        df = df.iloc[:, 3:]
        df = df.drop("abstract", axis=1)
        df = df.rename(columns={"text": "abstract"})
        # reorder abstract to first column
        df = df[["abstract"] + [col for col in df.columns if col != "abstract"]]
        df = df.rename(
            columns={"ml subset selection": "machine learning subset selection"}
        )
        self._prepared_data = df

    def get_single_label_data(self):
        df = super().get_single_label_data()
        df = df.drop("prediction markets", axis=1)
        return df


class AiInIsTypesFullTextDataset(ZeroShotDataset):
    def __init__(self, custom_hypothesis="This text is about {}") -> None:
        super().__init__(
            "/home/jovyan/work/data/overviews/ai_in_is_types_solutions.pkl",
            is_multi_label=False,
            custom_hypothesis=custom_hypothesis,
        )

    def _prepare(self):
        df = pd.read_pickle(self.source_file)
        df = df[['text', 'type']]
        df = df.rename(columns={"text": "abstract"})
        df = pd.concat([df, df.type.str.get_dummies()], axis=1).drop("type", axis=1)
        self._prepared_data = df


class DataCompletenessInHealthcareDataset(ZeroShotDataset):
    def __init__(self, custom_hypothesis="This text is about {}") -> None:
        super().__init__(
            "/home/jovyan/work/data/overviews/data_completeness_in_healthcare_solutions.csv",
            is_multi_label=True,
            custom_hypothesis=custom_hypothesis,
        )

    def _prepare(self):
        df = pd.read_csv(self.source_file, sep=";", encoding="ISO-8859-1")
        df = df.iloc[:, 2:]
        self._prepared_data = df


class MachineLearningInBusinessDataset(ZeroShotDataset):
    def __init__(self, custom_hypothesis="This text is about {}") -> None:
        super().__init__(
            "/home/jovyan/work/data/overviews/ml_in_business_solutions.csv",
            is_multi_label=False,
            custom_hypothesis=custom_hypothesis,
        )

    def _prepare(self):
        df = pd.read_csv(self.source_file, sep=";", encoding="ISO-8859-1")
        df = df.iloc[:, 1:]
        df = pd.concat([df, df.category.str.get_dummies()], axis=1).drop(
            "category", axis=1
        )
        self._prepared_data = df


class SearchEngineAdvertisingDataset(ZeroShotDataset):
    def __init__(self, custom_hypothesis="This text is about {}") -> None:
        super().__init__(
            "/home/jovyan/work/data/overviews/sea_topics_solutions.csv",
            is_multi_label=False,
            custom_hypothesis=custom_hypothesis,
        )

    def _prepare(self):
        df = pd.read_csv(self.source_file, sep=";", encoding="ISO-8859-1")
        df = df.iloc[:, 2:]
        df = pd.concat([df, df.topic.str.get_dummies()], axis=1).drop("topic", axis=1)
        self._prepared_data = df


class OverviewWiTheoryTypeDataset(ZeroShotDataset):
    def __init__(self, custom_hypothesis="This text is about {}") -> None:
        super().__init__(
            "/home/jovyan/work/data/overviews/overview_WI.csv",
            is_multi_label=False,
            custom_hypothesis=custom_hypothesis,
        )

    def _prepare(self):
        df = pd.read_csv(self.source_file, sep=";", encoding="ISO-8859-1")
        df = df.iloc[:, 5:7]
        df = pd.concat([df, df.theory_type.str.get_dummies()], axis=1).drop(
            "theory_type", axis=1
        )
        self._prepared_data = df


class OverviewWiResearchMethodDataset(ZeroShotDataset):
    def __init__(self, custom_hypothesis="This text is about {}") -> None:
        super().__init__(
            "/home/jovyan/work/data/overviews/overview_WI.csv",
            is_multi_label=False,
            custom_hypothesis=custom_hypothesis,
        )

    def _prepare(self):
        df = pd.read_csv(self.source_file, sep=";", encoding="ISO-8859-1")
        df = df[["abstract", "research method"]]
        df = pd.concat([df, df["research method"].str.get_dummies()], axis=1).drop(
            "research method", axis=1
        )
        self._prepared_data = df


class OverviewMisqTheoryTypesDataset(ZeroShotDataset):
    def __init__(self, custom_hypothesis="This text is about {}") -> None:
        super().__init__(
            "/home/jovyan/work/data/overviews/overview_MISQ.csv",
            is_multi_label=False,
            custom_hypothesis=custom_hypothesis,
        )

    def _prepare(self):
        df = pd.read_csv(self.source_file, sep=";", encoding="ISO-8859-1")
        df = df[["abstract", "type_mod"]]
        df = pd.concat([df, df["type_mod"].str.get_dummies()], axis=1).drop(
            "type_mod", axis=1
        )
        self._prepared_data = df


class OverviewEswaMlApproachDataset(ZeroShotDataset):
    def __init__(self, custom_hypothesis="This text is about {}") -> None:
        super().__init__(
            "/home/jovyan/work/data/overviews/overview_eswa.csv",
            is_multi_label=False,
            custom_hypothesis=custom_hypothesis,
        )

    def _prepare(self):
        df = pd.read_csv(self.source_file, sep=";", encoding="ISO-8859-1")
        df = df[["abstract", "ml_approach"]]
        df = pd.concat([df, df["ml_approach"].str.get_dummies()], axis=1).drop(
            "ml_approach", axis=1
        )
        self._prepared_data = df

We initialize every dataset in a dictionary, as they are not that big - but it will be easy to loop over them.

If you want to customize the hypothesis at a later step, do so when initializing the dataset.

In [22]:
datasets: Dict[str, ZeroShotDataset] = {
    "ai_in_is_data_collection_techniques": AiInIsDataCollectionDataset(),
    "ai_in_is_types": AiInIsTypesDataset(),
    "ai_in_is_data_collection_techniques_fulltext": AiInIsDataCollectionFullTextDataset(),
    "ai_in_is_types_fulltext": AiInIsTypesFullTextDataset(),
    "data_completeness_in_healthcare": DataCompletenessInHealthcareDataset(),
    "machine_learning_in_business": MachineLearningInBusinessDataset(),
    "search_engine_advertising": SearchEngineAdvertisingDataset(),
    "overview_wi_theory_type": OverviewWiTheoryTypeDataset(),
    "overview_wi_research_method": OverviewWiResearchMethodDataset(),
    "overview_misq_theory_types": OverviewMisqTheoryTypesDataset(),
    "overview_eswa_ml_approach": OverviewEswaMlApproachDataset()
}

In [23]:
fulltext_datasets = {
    "ai_in_is_data_collection_techniques_fulltext": AiInIsDataCollectionFullTextDataset(),
    "ai_in_is_types_fulltext": AiInIsTypesFullTextDataset()
}

## Classification Helper Classes
To streamline classification code, a static base class and a result class are instantiated

In [13]:
#from openai.embeddings_utils import cosine_similarity, get_embedding
from sklearn.metrics import PrecisionRecallDisplay, classification_report
from openai import OpenAI
import time

client = OpenAI()
import numpy as np



class ClassificationResult:
    def __init__(self, result:pd.DataFrame, dataset:ZeroShotDataset, single_label = True) -> None:
        self.dataset = dataset
        self.result = result
        self.single_label = single_label

    def get_df(self):
        return self.result

    def get_df_idxmax(self, threshold = 0.99):
        copy = self.result.copy(deep=True)
        for rowIndex, row in copy.iloc[:,1:].iterrows():
        #print(row)
            for columnIndex, value in row.items():
                if value >= threshold:
                    copy[columnIndex][rowIndex] = int(1)
                else:
                    None 
            if (row == int(1)).any():
                None
            else: 
                copy[row.idxmax()][rowIndex] = int(1)
        
        # set other values to zero
        for rowIndex, row in copy.iloc[:,1:].iterrows():
            for columnIndex, value in row.items():
                if value != int(1):
                    copy[columnIndex][rowIndex] = int(0)
                else:
                    None
        return copy

    def get_test_vs_pred(self):
        results = self.result.copy(deep=True)
        dataset = self.dataset.get_single_label_data() if self.single_label else self.dataset.get_prepared_data()
        results['y_test'] = dataset.idxmax(axis=1, numeric_only=True)
        results['y_pred'] = results.idxmax(axis=1, numeric_only=True)
        return results[['abstract', 'y_test', 'y_pred']]

    def save_tvp_to_pkl(self, path):
        self.get_test_vs_pred().to_pickle(path)

    def save_to_pkl(self, path):
        self.result.to_pickle(path)


class UniversalClassifier:
    @staticmethod
    def classify(model_name_or_path, dataset: ZeroShotDataset):
        df = dataset.get_empty_data(single_label = True)
        labels = dataset.get_labels(single_label = True)
        weights = model_name_or_path
        device = "cuda:0" if torch.cuda.is_available() else "cpu"
        model = AutoModelForSequenceClassification.from_pretrained(weights)
        tokenizer = AutoTokenizer.from_pretrained(weights)
        classifier = ZeroShotClassificationPipeline(
            model=model,
            tokenizer=tokenizer,
            candidate_labels=labels,
            return_all_scores=True,
            device=device,
            hypothesis_template=dataset.custom_hypothesis
            )
        div = 10
        o = 0
        print(f"Using prompt: {dataset.custom_hypothesis}")
        for index, row in tqdm(df.iterrows(), total=len(df.index)):
            result = classifier(row["abstract"], multi_label=False)
            for i in range(0, int(len(labels)), 1):
                df.loc[index, result["labels"][i]] = result["scores"][i]
            if index % div == 0:
                print(f"Currently on row: {o}")
                o = o + 10
        return ClassificationResult(df, dataset)
    
    @staticmethod
    def classify_randomly(dataset: ZeroShotDataset):
        df = dataset.get_empty_data(single_label = True)
        labels = dataset.get_labels(single_label = True)
        for index, row in tqdm(df.iterrows(), total=len(df.index)):
            for label in labels:
                df.loc[index, label] = random.uniform(0, 1)
        return ClassificationResult(df, dataset)
        
    @staticmethod
    def classify_using_openai(model_name: str, dataset: ZeroShotDataset):
        df = dataset.get_empty_data(single_label = True)
        labels = dataset.get_labels(single_label = True)
        print("Generating label embeddings...")
        label_embeddings = [get_embedding("This text is about " + label, engine=model_name) for label in labels]
        print("done")

        def label_score(abstract_embedding, label_embedding):
            return cosine_similarity(abstract_embedding, label_embedding)
        
        def softmax(x):
            """Compute softmax values for each sets of scores in x."""
            e_x = np.exp(x - np.max(x))
            return e_x / e_x.sum()

        div = 10
        o = 0
        for index, row in tqdm(df.iterrows(), total=len(df.index)):
            print("Generating abstract embedding...")
            abstract_embedding = client.embeddings.create(input=row["abstract"], model=model_name)['data'][0]['embedding']
            #abstract_embedding = get_embedding(row["abstract"], engine=model_name)
            print("done")

            scores = [label_score(abstract_embedding, label_embeddings[i]) for i in range(0, int(len(labels)), 1)]
            scores = softmax(scores)

            for i in range(0, int(len(labels)), 1):
                df.loc[index, labels[i]] = scores[i]
            
        return ClassificationResult(df, dataset)
        
    @staticmethod
    def classify_using_generative_model(model_name: str, dataset: ZeroShotDataset, model_provider="openai"):
        df = dataset.get_empty_data(single_label = True)
        labels = dataset.get_labels(single_label = True)
        
        label_choices = '["' + '", "'.join(label for label in labels) + '"]'
        prompt_template = f"""You are given an array of topic label strings. You are also given a paper. You need to choose the topic that best describes the  paper. 
            -----
            Topics: {label_choices}
            Make sure you categorize the paper by only choosing exactly one topic. Only name that topic. Name it exactly as in the array above. Do not say anything else, but the exact name of the topic. Do not use abbreviations. Do not use numbers. Do not use punctuation. Do not use capital letters. Do not use any other characters.
            You can not choose multiple or no topics.
            -----
            Paper:

        """
        for index, row in tqdm(df.iterrows(), total=len(df.index)):
            prompt = prompt_template
            prompt += row["abstract"] + "\n\n"
            prompt += "Topic: \n\n"
            # call openai gpt4 chat api to get the answer
            # then match the answer to the label

            response = client.chat.completions.create(model=model_name,
            messages=[
                {"role": "system", "content": "You are a scientific assistant trying to help a scientist categorize a set of papers."},
                {"role": "user", "content": prompt},
            ])
            answer = response.choices[0].message.content
            try:
                # fuzzy match answer to choice
                target = get_close_matches(answer, labels, n=1)[0]
                df.loc[index, answer] = 1
            except:
                print(f"Answer {answer} not found in labels")
        
        return ClassificationResult(df, dataset)
    
    @staticmethod
    def classify_using_generative_model_fulltext(model_name: str, dataset: ZeroShotDataset, model_provider="openai"):
        df = dataset.get_empty_data(single_label = True)
        labels = dataset.get_labels(single_label = True)
        
        label_choices = '["' + '", "'.join(label for label in labels) + '"]'
        prompt_template = f"""You are given an array of topic label strings. You are also given a paper. You need to choose the topic that best describes the paper. 
            -----
            Make sure you categorize the paper by only choosing the best-fitting topic. Only name that topic. NAME IT EXACTLY AS IN THE FOLLOWING ARRAY:
            -----
            `{label_choices}`
            -----

            Do not say anything else, but the exact name of the topic. Do not use abbreviations. Do not use numbers. Do not use punctuation. Do not use capital letters. Do not use any other characters.
            You can not choose multiple or no topics.
            For example, a choice can be "{label_choices[0]}". It cannot be "I think this is about {label_choices[0]}". Or "{label_choices[0]} subset selection". It has to be exactly "{label_choices[0]}".

            -----
            Topics to choose from: `{label_choices}`
            -----

            Paper:

        """
        for index, row in tqdm(df.iterrows(), total=len(df.index)):
            prompt = prompt_template
            prompt += row["abstract"] + "\n\n"
            prompt += "=================================================================\n\n"
            prompt += "Topic: \n\n"
            # call openai gpt4 chat api to get the answer
            # then match the answer to the label

            response = client.chat.completions.create(model=model_name,
            messages=[
                {"role": "system", "content": "You are a scientific assistant trying to help a scientist categorize a set of papers."},
                {"role": "user", "content": prompt},
            ],
            temperature=0)

            max_attempts = 5
            for attempt in range(max_attempts):
                try:
                    response = client.chat.completions.create(model=model_name,
                    messages=[
                        {"role": "system", "content": "You are a scientific assistant trying to help a scientist categorize a set of papers."},
                        {"role": "user", "content": prompt},
                    ],
                    temperature=0)
                    answer = response.choices[0].message.content
                    # If we get a response without an exception, break the loop
                except Exception as e:
                    print(f"Attempt {attempt + 1} failed: {e}")
                    if attempt < max_attempts - 1:
                        time.sleep(1)  # Wait for 1 second before retrying (optional)
                    else:
                        print("Max attempts reached. Moving to the next item.")
                        answer = None
                try:
                    # fuzzy match answer to choice
                    target = get_close_matches(answer, labels, n=1)[0]
                    df.loc[index, target] = 1
                    break
                except Exception as e2:
                    print(f"Answer {answer} not found in labels: {e2}")
                    if attempt == max_attempts - 1:
                        print(f"Answer {answer} failed {max_attempts} times. Skipping.")
                
        
        return ClassificationResult(df, dataset)
    
    def classify_multi_label(model_name_or_path, dataset: ZeroShotDataset):
        if not dataset.is_multi_label:
            raise ValueError("Dataset is not multi label")
        df = dataset.get_prepared_data()
        labels = dataset.get_labels()
        return df

In [15]:
from IPython.display import clear_output
def classify_all_datasets(model_name_or_path, datasets: Dict[str, ZeroShotDataset]):
    results: Dict[str, ClassificationResult] = {}
    for key, dataset in datasets.items():
        print(f"Classifying {key}")
        if model_name_or_path == "random":
            results[key] = UniversalClassifier.classify_randomly(dataset)
        elif model_name_or_path == "text-embedding-ada-002":
            results[key] = UniversalClassifier.classify_using_openai("text-embedding-ada-002", dataset)
        elif model_name_or_path == "gpt-3.5-turbo" or model_name_or_path == "gpt-4":
            results[key] = UniversalClassifier.classify_using_generative_model(model_name_or_path, dataset)
        elif model_name_or_path == "gpt-4-0125-preview":
            results[key] = UniversalClassifier.classify_using_generative_model_fulltext(model_name_or_path, dataset)
        else:
            results[key] = UniversalClassifier.classify(model_name_or_path, dataset)
        clear_output(wait=True)
    print("Done classifiying all datasets")
    return results


In [16]:
def write_results_to_files(model_name: str, results: Dict[str, ClassificationResult], custom_prompt: bool):
    custom = "custom_prompt" if custom_prompt else "default_prompt"
    for key, result in results.items():
        if not path.exists(f"/home/jovyan/work/data/results/{model_name}/{custom}"):
            makedirs(f"/home/jovyan/work/data/results/{model_name}/{custom}")
        result.save_to_pkl(f"/home/jovyan/work/data/results/{model_name}/{custom}/{key}.pkl")
        result.save_tvp_to_pkl(f"/home/jovyan/work/data/results/{model_name}/{custom}/{key}_tvp.pkl")
    print("Done writing all results")

## Classification
Now we can easily run classifications for each model and store the results.

### Random

In [7]:
results_random = classify_all_datasets("random", datasets)

Done classifiying all datasets


In [8]:
write_results_to_files("random", results_random, False)

Done writing all results


### OpenAI ADA Embeddings

In [20]:
results_openai = classify_all_datasets("text-embedding-ada-002", datasets)

Done classifiying all datasets


In [21]:
results_openai.get("ai_in_is_types").get_df()

Unnamed: 0,abstract,expert systems,machine learning,machine vision,natural language processing,robotics
0,Decision strategies in dynamic environments do...,0.206604,0.203087,0.197554,0.198516,0.194238
1,"In this paper, we explore how keyword ambiguit...",0.201265,0.202121,0.196594,0.204644,0.195376
2,As crowdsourced user-generated content becomes...,0.202369,0.202187,0.200055,0.198925,0.196464
3,The goal of a review article is to present the...,0.203921,0.199633,0.198321,0.200456,0.197670
4,We provide a background discussion of group su...,0.207261,0.199338,0.197569,0.200015,0.195817
...,...,...,...,...,...,...
78,Cognitive computing systems (CCS) are a new cl...,0.205280,0.199675,0.198109,0.200933,0.196003
79,A significant recent technological development...,0.202642,0.199231,0.199139,0.198245,0.200743
80,Changing demands in society and the limited ca...,0.201027,0.197686,0.197669,0.197204,0.206414
81,Robotics application has provided a fruitful c...,0.198866,0.197757,0.197942,0.198189,0.207246


In [22]:
write_results_to_files("openai", results_openai, False)

Done writing all results


### OpenAI gpt-3.5-turbo

In [7]:
results_gpt35 = classify_all_datasets("gpt-3.5-turbo", datasets)

Done classifiying all datasets


In [8]:
write_results_to_files("gpt-3.5-turbo", results_gpt35, False)

Done writing all results


### OpenAI gpt-4

In [16]:
results_gpt4 = classify_all_datasets("gpt-4", datasets)

Done classifiying all datasets


In [17]:
write_results_to_files("gpt-4", results_gpt4, False)

Done writing all results


### OpenAI gpt-4.5-turbo full text

In [17]:
results_gpt45_fulltext = classify_all_datasets("gpt-4-0125-preview", fulltext_datasets)

Done classifiying all datasets


In [18]:
write_results_to_files("gpt-4-0125-preview_fulltext", results_gpt45_fulltext, False)

Done writing all results


In [19]:
results_gpt45_fulltext.get("ai_in_is_data_collection_techniques_fulltext").get_df()

Unnamed: 0,abstract,data mining,experiment,observation,casual mapping,documentation,survey,sample analysis,machine learning subset selection,literature analysis,workshop,focus group,questionnaire
0,Journal of Strategic Information Systems 29 (2...,1,0,0,0,0,0,0,0,0,0,0,0
1,Association for Information Systems Associatio...,0,0,0,0,0,0,0,1,0,0,0,0
2,International Journal of Information Managemen...,0,0,0,0,0,0,0,0,1,0,0,0
3,International Journal of Information Managemen...,0,0,0,0,0,0,0,0,1,0,0,0
4,Association for Information Systems Associatio...,0,0,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
83,Association for Information Systems Associatio...,0,0,0,0,0,0,0,0,1,0,0,0
84,International Journal of Information Managemen...,0,0,0,0,0,0,0,0,1,0,0,0
86,Journal of the Association for Information Sys...,0,0,0,1,0,0,0,0,0,0,0,0
87,European Journal of Information Systems ISSN: ...,1,0,0,0,0,0,0,0,0,0,0,0


In [13]:
results_gpt45_fulltext.get("ai_in_is_types_fulltext").get_test_vs_pred()

Unnamed: 0,abstract,y_test,y_pred
0,This article was downloaded by: [132.187.247.3...,machine learning,machine learning
1,RESEARCH ARTICLE EXAMINING THE IMPACT OF KEYWO...,machine learning,natural language processing
2,Journal of Management Information Systems ISSN...,machine learning,machine learning
3,"View metadata, citation and similar papers at ...",machine learning,machine learning
4,Association for Information Systems Associatio...,machine learning,machine learning
...,...,...,...
83,Journal of Strategic Information Systems 29 (2...,expert systems,machine learning
84,Association for Information Systems AIS Electr...,machine learning,machine learning
86,Association for Information Systems AIS Electr...,expert systems,expert systems
87,Association for Information Systems AIS Electr...,natural language processing,expert systems


### facebook/bart-large-mnli

In [64]:
results_bart = classify_all_datasets("facebook/bart-large-mnli", datasets)

Done classifiying all datasets


In [65]:
write_results_to_files("facebook/bart-large-mnli", results_bart, False)
results_bart.get("ai_in_is_types").get_df()

Done writing all results


Unnamed: 0,abstract,expert systems,machine learning,machine vision,natural language processing,robotics
0,Decision strategies in dynamic environments do...,0.318108,0.243883,0.157560,0.145677,0.134772
1,"In this paper, we explore how keyword ambiguit...",0.180206,0.408019,0.091432,0.172996,0.147347
2,As crowdsourced user-generated content becomes...,0.219335,0.222063,0.163376,0.132300,0.262927
3,The goal of a review article is to present the...,0.170091,0.474460,0.137656,0.108397,0.109397
4,We provide a background discussion of group su...,0.278235,0.268429,0.185017,0.190553,0.077766
...,...,...,...,...,...,...
78,Cognitive computing systems (CCS) are a new cl...,0.364837,0.190017,0.151184,0.117930,0.176032
79,A significant recent technological development...,0.268274,0.199704,0.177915,0.135982,0.218126
80,Changing demands in society and the limited ca...,0.197931,0.098834,0.114749,0.089272,0.499214
81,Robotics application has provided a fruitful c...,0.084645,0.053668,0.065209,0.050806,0.745672


### microsoft/deberta-large-mnli

In [66]:
results_deberta = classify_all_datasets("microsoft/deberta-large-mnli", datasets)

Done classifiying all datasets


In [18]:
write_results_to_files("microsoft/deberta-large-mnli", results_deberta, False)
results_deberta.get("ai_in_is_types").get_test_vs_pred()

Done writing all results


Unnamed: 0,abstract,y_test,y_pred
0,Decision strategies in dynamic environments do...,machine learning,machine learning
1,"In this paper, we explore how keyword ambiguit...",machine learning,machine learning
2,As crowdsourced user-generated content becomes...,machine learning,expert systems
3,The goal of a review article is to present the...,machine learning,machine learning
4,We provide a background discussion of group su...,machine learning,expert systems
...,...,...,...
78,Cognitive computing systems (CCS) are a new cl...,expert systems,expert systems
79,A significant recent technological development...,expert systems,natural language processing
80,Changing demands in society and the limited ca...,robotics,robotics
81,Robotics application has provided a fruitful c...,robotics,robotics


### navteca/nli-deberta-v3-large

In [68]:
results_deberta_v3_large = classify_all_datasets("navteca/nli-deberta-v3-large", datasets)

Done classifiying all datasets


In [69]:
write_results_to_files("navteca/nli-deberta-v3-large", results_deberta_v3_large, False)
results_deberta_v3_large.get("ai_in_is_types").get_test_vs_pred()

Done writing all results


Unnamed: 0,abstract,y_test,y_pred
0,Decision strategies in dynamic environments do...,machine learning,machine learning
1,"In this paper, we explore how keyword ambiguit...",machine learning,machine learning
2,As crowdsourced user-generated content becomes...,machine learning,expert systems
3,The goal of a review article is to present the...,machine learning,machine learning
4,We provide a background discussion of group su...,machine learning,expert systems
...,...,...,...
78,Cognitive computing systems (CCS) are a new cl...,expert systems,expert systems
79,A significant recent technological development...,expert systems,expert systems
80,Changing demands in society and the limited ca...,robotics,robotics
81,Robotics application has provided a fruitful c...,robotics,robotics


### MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli
This is a customized deberta trained on multiple nli datasets

In [70]:
results_deberta_v3_multiple_nli =  classify_all_datasets("MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli", datasets)

Done classifiying all datasets


In [71]:
write_results_to_files("MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli", results_deberta_v3_multiple_nli, False)
results_deberta_v3_multiple_nli.get("ai_in_is_types").get_df()

Done writing all results


Unnamed: 0,abstract,expert systems,machine learning,machine vision,natural language processing,robotics
0,Decision strategies in dynamic environments do...,0.014331,0.975825,0.002980,0.004462,0.002402
1,"In this paper, we explore how keyword ambiguit...",0.000409,0.749501,0.000277,0.249505,0.000307
2,As crowdsourced user-generated content becomes...,0.030451,0.841644,0.054114,0.042688,0.031103
3,The goal of a review article is to present the...,0.000407,0.998199,0.000270,0.000871,0.000254
4,We provide a background discussion of group su...,0.460630,0.172689,0.115432,0.164355,0.086893
...,...,...,...,...,...,...
78,Cognitive computing systems (CCS) are a new cl...,0.476004,0.465385,0.006932,0.002801,0.048878
79,A significant recent technological development...,0.190563,0.694969,0.012045,0.036492,0.065931
80,Changing demands in society and the limited ca...,0.000096,0.000217,0.000087,0.000095,0.999505
81,Robotics application has provided a fruitful c...,0.000161,0.000387,0.000128,0.005503,0.993820


### ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli
This is a custom roberta trained on multiple nli datasets

In [72]:
results_roberta_multiple_nli = classify_all_datasets("ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli", datasets)

Done classifiying all datasets


In [73]:
write_results_to_files("ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli", results_roberta_multiple_nli, False)
results_roberta_multiple_nli.get("ai_in_is_types").get_df()

Done writing all results


Unnamed: 0,abstract,expert systems,machine learning,machine vision,natural language processing,robotics
0,Decision strategies in dynamic environments do...,0.180805,0.391275,0.113873,0.205523,0.108524
1,"In this paper, we explore how keyword ambiguit...",0.054871,0.562287,0.001617,0.380366,0.000859
2,As crowdsourced user-generated content becomes...,0.197237,0.444590,0.107506,0.179597,0.071069
3,The goal of a review article is to present the...,0.002689,0.990853,0.001616,0.003928,0.000913
4,We provide a background discussion of group su...,0.231693,0.209247,0.143200,0.285426,0.130433
...,...,...,...,...,...,...
78,Cognitive computing systems (CCS) are a new cl...,0.214624,0.407225,0.108701,0.145379,0.124071
79,A significant recent technological development...,0.214967,0.347290,0.120095,0.180028,0.137620
80,Changing demands in society and the limited ca...,0.000271,0.000099,0.000131,0.000259,0.999239
81,Robotics application has provided a fruitful c...,0.000105,0.000055,0.000073,0.000818,0.998949


### HiTZ/A2T_RoBERTa_SMFA_TACRED-re

In [74]:
results_a2t_roberta = classify_all_datasets("HiTZ/A2T_RoBERTa_SMFA_TACRED-re", datasets)

Done classifiying all datasets


In [75]:
write_results_to_files("HiTZ/A2T_RoBERTa_SMFA_TACRED-re", results_a2t_roberta, False)
results_a2t_roberta.get("ai_in_is_types").get_df()

Done writing all results


Unnamed: 0,abstract,expert systems,machine learning,machine vision,natural language processing,robotics
0,Decision strategies in dynamic environments do...,0.228389,0.294942,0.137782,0.216315,0.122572
1,"In this paper, we explore how keyword ambiguit...",0.077280,0.614546,0.058503,0.212142,0.037529
2,As crowdsourced user-generated content becomes...,0.208278,0.499715,0.101939,0.127396,0.062671
3,The goal of a review article is to present the...,0.137962,0.531959,0.106012,0.154173,0.069894
4,We provide a background discussion of group su...,0.276580,0.191234,0.110175,0.311802,0.110209
...,...,...,...,...,...,...
78,Cognitive computing systems (CCS) are a new cl...,0.266770,0.192379,0.127241,0.236437,0.177173
79,A significant recent technological development...,0.281129,0.194929,0.136427,0.243570,0.143945
80,Changing demands in society and the limited ca...,0.090107,0.045991,0.042145,0.084307,0.737450
81,Robotics application has provided a fruitful c...,0.173011,0.091570,0.082711,0.281083,0.371625


### roberta-large-mnli

In [33]:
roberta = classify_all_datasets("roberta-large-mnli", datasets)

Done classifiying all datasets


In [34]:
write_results_to_files("roberta-large-mnli", roberta, False)


Done writing all results


### microsoft/deberta-v2-xlarge-mnli

In [76]:
deberta_xlarge_mnli = classify_all_datasets("microsoft/deberta-v2-xlarge-mnli", datasets)

Done classifiying all datasets


In [77]:
write_results_to_files("microsoft/deberta-v2-xlarge-mnli", deberta_xlarge_mnli, False)
deberta_xlarge_mnli.get("ai_in_is_types").get_df()

Done writing all results


Unnamed: 0,abstract,expert systems,machine learning,machine vision,natural language processing,robotics
0,Decision strategies in dynamic environments do...,0.210652,0.541127,0.080211,0.087786,0.080225
1,"In this paper, we explore how keyword ambiguit...",0.111620,0.531525,0.032564,0.281008,0.043283
2,As crowdsourced user-generated content becomes...,0.321236,0.227839,0.153520,0.196164,0.101241
3,The goal of a review article is to present the...,0.174215,0.331188,0.137291,0.204729,0.152577
4,We provide a background discussion of group su...,0.273100,0.255145,0.127467,0.177173,0.167115
...,...,...,...,...,...,...
78,Cognitive computing systems (CCS) are a new cl...,0.227443,0.318585,0.153937,0.231605,0.068430
79,A significant recent technological development...,0.218063,0.363940,0.162866,0.160425,0.094706
80,Changing demands in society and the limited ca...,0.047352,0.049484,0.038652,0.048892,0.815619
81,Robotics application has provided a fruitful c...,0.042440,0.040224,0.025584,0.045936,0.845815


### microsoft/deberta-v2-xxlarge-mnli

In [7]:
deberta_xxlarge_mnli = classify_all_datasets("microsoft/deberta-v2-xxlarge-mnli", datasets)

Done classifiying all datasets


In [8]:
write_results_to_files("microsoft/deberta-v2-xxlarge-mnli", deberta_xxlarge_mnli, False)
deberta_xxlarge_mnli.get("ai_in_is_types").get_df()

Done writing all results


Unnamed: 0,abstract,expert systems,machine learning,machine vision,natural language processing,robotics
0,Decision strategies in dynamic environments do...,0.212600,0.592036,0.051081,0.071838,0.072445
1,"In this paper, we explore how keyword ambiguit...",0.094339,0.612682,0.030935,0.227502,0.034543
2,As crowdsourced user-generated content becomes...,0.276758,0.397463,0.092163,0.143439,0.090177
3,The goal of a review article is to present the...,0.110784,0.654980,0.063850,0.109248,0.061137
4,We provide a background discussion of group su...,0.405864,0.178150,0.110076,0.142716,0.163194
...,...,...,...,...,...,...
78,Cognitive computing systems (CCS) are a new cl...,0.541924,0.153522,0.094765,0.119251,0.090539
79,A significant recent technological development...,0.183962,0.345094,0.104869,0.153358,0.212718
80,Changing demands in society and the limited ca...,0.068116,0.036214,0.028929,0.036757,0.829984
81,Robotics application has provided a fruitful c...,0.027505,0.034943,0.020999,0.028267,0.888286


### scibert-mnli

In [9]:
scibert_mnli = classify_all_datasets("/home/jovyan/work/models/allenai/scibert_scivocab_uncased/mnli", datasets)

Done classifiying all datasets


In [10]:
write_results_to_files("allenai/scibert_scivocab_uncased-mnli", scibert_mnli, False)
scibert_mnli.get("ai_in_is_types").get_df()

Done writing all results


Unnamed: 0,abstract,expert systems,machine learning,machine vision,natural language processing,robotics
0,Decision strategies in dynamic environments do...,0.177797,0.239959,0.201675,0.182584,0.197985
1,"In this paper, we explore how keyword ambiguit...",0.085988,0.510749,0.098612,0.180279,0.124372
2,As crowdsourced user-generated content becomes...,0.065110,0.755211,0.058485,0.052027,0.069166
3,The goal of a review article is to present the...,0.101933,0.555004,0.136140,0.106693,0.100229
4,We provide a background discussion of group su...,0.699392,0.061473,0.071530,0.066179,0.101425
...,...,...,...,...,...,...
78,Cognitive computing systems (CCS) are a new cl...,0.177310,0.195647,0.183347,0.226458,0.217238
79,A significant recent technological development...,0.270454,0.259545,0.126717,0.110870,0.232414
80,Changing demands in society and the limited ca...,0.039095,0.137429,0.089432,0.070536,0.663508
81,Robotics application has provided a fruitful c...,0.055840,0.079833,0.121326,0.076021,0.666980


### SSCI-SciBERT-e4-mnli

In [11]:
ssci_scibert_mnli = classify_all_datasets("/home/jovyan/work/models/KM4STfulltext/SSCI-SciBERT-e4/mnli", datasets)


Done classifiying all datasets


In [12]:
write_results_to_files("KM4STfulltext/SSCI-SciBERT-e4-mnli", ssci_scibert_mnli, False)
ssci_scibert_mnli.get("ai_in_is_types").get_df()

Done writing all results


Unnamed: 0,abstract,expert systems,machine learning,machine vision,natural language processing,robotics
0,Decision strategies in dynamic environments do...,0.207216,0.196519,0.202526,0.188978,0.204760
1,"In this paper, we explore how keyword ambiguit...",0.176081,0.309032,0.181722,0.228957,0.104208
2,As crowdsourced user-generated content becomes...,0.176551,0.262218,0.204926,0.191258,0.165047
3,The goal of a review article is to present the...,0.192120,0.211258,0.192309,0.212388,0.191925
4,We provide a background discussion of group su...,0.280801,0.195622,0.180832,0.193128,0.149618
...,...,...,...,...,...,...
78,Cognitive computing systems (CCS) are a new cl...,0.210971,0.195710,0.211619,0.195195,0.186505
79,A significant recent technological development...,0.300224,0.235931,0.174174,0.161995,0.127676
80,Changing demands in society and the limited ca...,0.038930,0.052027,0.079528,0.054796,0.774719
81,Robotics application has provided a fruitful c...,0.188384,0.177914,0.152097,0.187599,0.294006
