code-kern-ai · jhoetter · May 7, 2022 · May 7, 2022
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,5 @@
+.vscode/
+
 # Jupyter
 *.ipynb
 

diff --git a/README.md b/README.md
@@ -1,153 +1,66 @@
-[![Python 3.8](https://img.shields.io/badge/python-3.8-blue.svg)](https://www.python.org/downloads/release/python-380/)
+![kern-python](https://uploads-ssl.webflow.com/61e47fafb12bd56b40022a49/62766400bd3c57b579d289bf_kern-python%20Banner.png)
+[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/)
 
-# onetask API for Python
+# Kern AI API for Python
 
-This is the official Python SDK for onetask, your IDE for programmatic data labeling.
+This is the official Python SDK for Kern AI, your IDE for programmatic data enrichment and management.
 
 ## Installation
 
-You can use pip to install the library:
-
-`$ pip install onetask`
-
-Alternatively, you can clone the repository and run the setup.py script:
-
-`$ python setup.py install`
+You can set up this library via either running `$ pip install kern-python-client`, or via cloning this repository and running `$ pip install -r requirements.txt` in your repository.
 
 ## Usage
-
-The SDK currently offers the following functions:
-- registering local Python functions as labeling functions in our system (you can of course also develop such functions within our web system)
-- _experimental_: generating embeddings for your attributes, e.g. free texts or structured data containing categories and numbers
-- _experimental_: autogenerating labeling functions from manually labeled records in your project
-- _experimental_: topic modeling using BERT embeddings that you have registered in your project
-
-All of this is also documented in the [onetask Documentation](https://onetask.readme.io/reference/getting-started), with additional screenshots to guide you through the process.
-
-
-### Instantiating a Client object
-
-You begin by creating a `Client` object. The `Client` will generate and store a session token for you based on your user name, password, and project id. The project id can be found in the URL, e.g. https://app.beta.onetask.ai/app/projects/**03f7d82c-f14c-4f0f-a1ff-59533bab30cc**/overview. Simply copy and paste this into the following pattern:
+Once you installed the package, you can access the application from any Python terminal as follows:
 
 ```python
-from onetask import Client
+from kern import Client
 
 username = "your-username"
 password = "your-password"
-project_id = "your-project-id"
-stage = "beta" # if you have onetask on local, you can also set stage to "local"
-client = Client(username, password, project_id, stage)
-```
-
-Once you correctly instantiated your Client, you can start using it for the various functions provided in the SDK. 
+project_id = "your-project-id" # can be found in the URL of the web application
 
-
-### Registering local Python labeling functions
-
-You can register functions e.g. from your local Jupyter Notebook using our SDK. When doing so, please always ensure that your labeling functions:
-- return label names that also exist in your project definition
-- have exactly one parameter; we execute labeling functions on a record-basis
-- If you need an import statement in your labeling functions, please check if it is given in the [whitelisted libraries](https://onetask.readme.io/reference/whitelisted-libraries). If you need a library that we have not yet whitelisted, feel free to reach out to us.
-
-An example to register your custom labeling function is as follows:
-```python
-def my_labeling_function(record):
-  """
-  Detect a list of values in the records that tend to occur in urgent messages.
-  """
-  keywords = ["asap", "as soon as possible", "urgent"]
-
-  message_lower = record["message"].lower()
-  for keyword in keywords:
-    if keyword in message_lower:
-      return "Urgent"
+client = Client(username, password, project_id)
+# if you run the application locally, please the following instead:
+# client = Client(username, password, project_id, uri="http://localhost:4455")
 ```
 
-You can then enter them using the client:
-
+Now, you can easily fetch the data from your project:
 ```python
-client.register_lf(my_labeling_function)
+df = client.fetch_export()
 ```
 
-The labeling function is then automatically executed once registered, where you can always change and re-run it.
+The `df` contains data of the following scheme:
+- all your record attributes are stored as columns, e.g. `headline` or `running_id` if you uploaded records like `{"headline": "some text", "running_id": 1234}`
+- per labeling task three columns:
+  - `<attribute_name|None>__<labeling_task_name>__MANUAL`: those are the manually set labels of your records
+  - `<attribute_name|None>__<labeling_task_name>__WEAK SUPERVISION`: those are the weakly supervised labels of your records
+  - `<attribute_name|None>__<labeling_task_name>__WEAK SUPERVISION_confidence`: those are the probabilities or your weakly supervised labels
 
-### Generating embeddings (experimental)
+With the `client`, you easily integrate your data into any kind of system; may it be a custom implementation, an AutoML system or a plain data analytics framework 🚀
 
-One of the main features of onetask is to apply both Weak Supervision and Active Learning jointly. To build the best possible Active Learning Weak Sources, you can generate embeddings for your attributes using the SDK. To do so, you have to first upload your data in our web application and select a unique attribute (see our [documentation](https://onetask.readme.io/reference/create-your-project) for further reference on how to set this up).
+## Roadmap
+- [ ] Register information sources via wrappers
+- [ ] Fetch project statistics
 
-Once this is done, you can easily generate embedding files. Imagine you have the following attributes in your records:
-- `headline`: an english text describing e.g. the news of a paper (e.g. _"5 footballers that should have won the ballon d'or"_, ...)
-- `running_id`: a unique identifier for each headline, i.e. a simple number (e.g. 1, 2, 3, ...)
 
-You can then call the client object to generate an embedding file using a dictionary of attribute/configuration string pairs:
-```python
-client.generate_embeddings({"headline": "distilbert-base-uncased"})
-```
-
-This will generate an embedding JSON-file as follows:
-
-```json
-[
-  {
-    "running_id": 1,
-    "distilbert-base-uncased": [0.123456789, "..."]
-  },
-  {
-    "running_id": 2,
-    "distilbert-base-uncased": [0.234567891, "..."]
-  },
-]
-```
+If you want to have something added, feel free to open an [issue](https://github.com/code-kern-ai/kern-python/issues).
 
-You can upload this file to your project in the overview tab of your project.
+## Contributing
+Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.
 
-The following configuration strings are available to configure how your attributes are embedded:
-| Configuration String | Data Type                     | Explanation                                                                                  |
-|----------------------|-------------------------------|----------------------------------------------------------------------------------------------|
-| identity             | integer, float                | No transformation                                                                            |
-| onehot               | category (low-entropy string) | one-hot encodes attribute                                                                    |
-| bow                  | string                        | Bag of Words transformation                                                                  |
-| boc                  | string                        | Bag of Characters transformation                                                             |
-| _huggingface_        | string                        | Huggingface-based transformation. You can use any available [huggingface](https://huggingface.co/) configuration string |
+If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".
+Don't forget to give the project a star! Thanks again!
 
-If you want to embed multiple attributes (which makes sense e.g. when you have structured data), you can provide multiple key/value pairs in your input dictionary. The resulting embeddings will be concatenated into one vector.
-
-
-### Autogenerating labeling functions (experimental)
-
-As you manually label data, onetask can help you to analyze both the explicitic and implicit data patterns. Our first approach for explicit pattern detection is to find regular expressions in free text attributes you provide. They are being mined using linguistic analysis, therefore you need to provide a spacy nlp object for the respective language of your free text.
-
-If you have an english free text, you can implement the mining as follows:
-```python
-import spacy 
-# you need to also download the en_core_web_sm file 
-# using $ python -m spacy download en_core_web_sm
-
-nlp = spacy.load("en_core_web_sm")
-lf_df = client.generate_regex_labeling_functions(nlp, "headline")
-```
-
-This creates a DataFrame containing mined regular expressions. You can display them in a convenient way:
-
-```python
-client.display_generated_labeling_functions(lf_df)
-```
-
-**Caution**: The quality and quantity of mined regular expressions heavily depends on how much data you have labeled and how diverse your dataset is. We have tested the feature on various datasets and found it to be very helpful. If you have problems autogenerating labeling functions in your project, contact us.
-
-
-### Topic Modeling using BERT embeddings (experimental)
-
-As onetask lets you put insights of explorative analysis into programmatic data labeling, we also provide topic modeling. We use the [BERTopic](https://github.com/MaartenGr/BERTopic) library for topic modeling, and provide an easy access to your projects data and embeddings. Once you uploaded BERT embeddings to your project (such that can be created using a huggingface configuration string), you can create a topic model:
-
-```python
-topic_model = client.model_topics("headline", "distilbert-base-uncased")
-```
+1. Fork the Project
+2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
+3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
+4. Push to the Branch (`git push origin feature/AmazingFeature`)
+5. Open a Pull Request
 
-The `topic_model` provides various methods to explore the different keywords and topics. You can also find further documentation [here](https://maartengr.github.io/BERTopic/api/bertopic.html).
+And please don't forget to leave a ⭐ if you like the work! 
 
-## Outlook and Feature Requests
-In the near future, we'll extend the Python SDK to include programmatic imports and exports, data access, and many more. If you have any requests, feel free to [contact us](https://www.onetask.ai/contact-us).
+## License
+Distributed under the MIT License. See LICENSE.txt for more information.
 
-## Support
-If you need help, feel free to join our Slack Community channel. It is currently only available via invitation.
+## Contact
+This library is developed and maintained by [kern.ai](https://github.com/code-kern-ai). If you want to provide us with feedback or have some questions, don't hesitate to contact us. We're super happy to help ✌️
diff --git a/kern/__init__.py b/kern/__init__.py
@@ -0,0 +1,48 @@
+# -*- coding: utf-8 -*-
+
+from wasabi import msg
+import pandas as pd
+from kern import authentication, api_calls, settings, exceptions
+from typing import Optional
+
+
+class Client:
+    """Client object which can be used to directly address the Kern AI API.
+
+    Args:
+        user_name (str): Your username for the application.
+        password (str): The respective password. Do not share this!
+        project_id (str): The link to your project. This can be found in the URL in an active project.
+        uri (str, optional): Link to the host of the application. Defaults to "https://app.kern.ai".
+
+    Raises:
+        exceptions.get_api_exception_class: If your credentials are incorrect, an exception is raised.
+    """
+
+    def __init__(
+        self, user_name: str, password: str, project_id: str, uri="https://app.kern.ai"
+    ):
+        settings.set_base_uri(uri)
+        self.session_token = authentication.create_session_token(
+            user_name=user_name, password=password
+        )
+        if self.session_token is not None:
+            msg.good("Logged in to system.")
+        else:
+            msg.fail(f"Could not log in at {uri}. Please check username and password.")
+            raise exceptions.get_api_exception_class(401)
+        self.project_id = project_id
+
+    def fetch_export(self, num_samples: Optional[int] = None) -> pd.DataFrame:
+        """Collects the export data of your project (i.e. the same data if you would export in the web app).
+
+        Args:
+            num_samples (Optional[int], optional): If set, only the first `num_samples` records are collected. Defaults to None.
+
+        Returns:
+            pd.DataFrame: DataFrame containing your record data. For more details, see https://docs.kern.ai
+        """
+        url = settings.get_export_url(self.project_id, num_samples=num_samples)
+        api_response = api_calls.get_request(url, self.session_token)
+        df = pd.read_json(api_response)
+        return df
diff --git a/kern/api_calls.py b/kern/api_calls.py
@@ -0,0 +1,53 @@
+# -*- coding: utf-8 -*-
+from json.decoder import JSONDecodeError
+import pkg_resources
+from kern import exceptions
+import requests
+from typing import Any, Dict
+
+try:
+    version = pkg_resources.get_distribution("kern-python-client").version
+except pkg_resources.DistributionNotFound:
+    version = "noversion"
+
+
+def post_request(url: str, body: Dict[str, Any], session_token: str) -> str:
+    headers = _build_headers(session_token)
+    response = requests.post(url=url, json=body, headers=headers)
+    return _handle_response(response)
+
+
+def get_request(url: str, session_token: str) -> str:
+    headers = _build_headers(session_token)
+    response = requests.get(url=url, headers=headers)
+    return _handle_response(response)
+
+
+def _build_headers(session_token: str) -> Dict[str, str]:
+    return {
+        "Content-Type": "application/json",
+        "User-Agent": f"python-sdk-{version}",
+        "Authorization": f"Bearer {session_token}",
+    }
+
+
+def _handle_response(response: requests.Response) -> str:
+    status_code = response.status_code
+    if status_code == 200:
+        json_data = response.json()
+        return json_data
+    else:
+        try:
+            json_data = response.json()
+            error_code = json_data.get("error_code")
+            error_message = json_data.get("error_message")
+        except JSONDecodeError:
+            error_code = 500
+            error_message = "The server was unable to process the provided data."
+
+        exception = exceptions.get_api_exception_class(
+            status_code=status_code,
+            error_code=error_code,
+            error_message=error_message,
+        )
+        raise exception
diff --git a/kern/authentication.py b/kern/authentication.py
@@ -0,0 +1,27 @@
+# -*- coding: utf-8 -*-
+from kern import settings
+import requests
+
+
+def create_session_token(user_name: str, password: str) -> str:
+    headers = {"Accept": "application/json"}
+    action_url = (
+        requests.get(settings.get_authentication_url(), headers=headers)
+        .json()
+        .get("ui")
+        .get("action")
+    )
+    session_token = (
+        requests.post(
+            action_url,
+            headers=headers,
+            json={
+                "method": "password",
+                "password": password,
+                "password_identifier": user_name,
+            },
+        )
+        .json()
+        .get("session_token")
+    )
+    return session_token
diff --git a/onetask/exceptions.py → kern/exceptions.py b/onetask/exceptions.py → kern/exceptions.py
@@ -1,45 +1,44 @@
 # -*- coding: utf-8 -*-
 from typing import Optional
 
-
-class ClientError(Exception):
+# https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#client_error_responses
+class SDKError(Exception):
     def __init__(self, message: Optional[str] = None):
         if message is None:
-            message = "Please check the documentation."
+            message = (
+                "Please check the SDK documentation at https://docs.kern.ai/reference."
+            )
         super().__init__(message)
 
 
-class ParameterError(ClientError):
+# 401 Unauthorized
+class UnauthorizedError(SDKError):
     pass
 
 
-# https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#client_error_responses
-class APIError(Exception):
-    def __init__(self, message: Optional[str] = None):
-        if message is None:
-            message = "Please check the API reference."
-        super().__init__(message)
-
-
-# 401 Unauthorized
-class UnauthorizedError(APIError):
+# 404 Not Found
+class NotFoundError(SDKError):
     pass
 
 
 # 500 Server Error
-class InternalServerError(APIError):
+class InternalServerError(SDKError):
     pass
 
 
-RESPONSE_CODES_API_EXCEPTION_MAP = {401: UnauthorizedError, 500: InternalServerError}
+RESPONSE_CODES_API_EXCEPTION_MAP = {
+    401: UnauthorizedError,
+    404: NotFoundError,
+    500: InternalServerError,
+}
 
 
 def get_api_exception_class(
     status_code: int,
     error_code: Optional[str] = None,
     error_message: Optional[str] = None,
-) -> APIError:
-    exception_or_dict = RESPONSE_CODES_API_EXCEPTION_MAP.get(status_code, APIError)
+) -> SDKError:
+    exception_or_dict = RESPONSE_CODES_API_EXCEPTION_MAP.get(status_code, SDKError)
     if isinstance(exception_or_dict, dict):
         exception_class = exception_or_dict.get(error_code, exception_or_dict["*"])
     else: