Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.vscode/

# Jupyter
*.ipynb

Expand Down
163 changes: 38 additions & 125 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,153 +1,66 @@
[![Python 3.8](https://img.shields.io/badge/python-3.8-blue.svg)](https://www.python.org/downloads/release/python-380/)
![kern-python](https://uploads-ssl.webflow.com/61e47fafb12bd56b40022a49/62766400bd3c57b579d289bf_kern-python%20Banner.png)
[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/)

# onetask API for Python
# Kern AI API for Python

This is the official Python SDK for onetask, your IDE for programmatic data labeling.
This is the official Python SDK for Kern AI, your IDE for programmatic data enrichment and management.

## Installation

You can use pip to install the library:

`$ pip install onetask`

Alternatively, you can clone the repository and run the setup.py script:

`$ python setup.py install`
You can set up this library via either running `$ pip install kern-python-client`, or via cloning this repository and running `$ pip install -r requirements.txt` in your repository.

## Usage

The SDK currently offers the following functions:
- registering local Python functions as labeling functions in our system (you can of course also develop such functions within our web system)
- _experimental_: generating embeddings for your attributes, e.g. free texts or structured data containing categories and numbers
- _experimental_: autogenerating labeling functions from manually labeled records in your project
- _experimental_: topic modeling using BERT embeddings that you have registered in your project

All of this is also documented in the [onetask Documentation](https://onetask.readme.io/reference/getting-started), with additional screenshots to guide you through the process.


### Instantiating a Client object

You begin by creating a `Client` object. The `Client` will generate and store a session token for you based on your user name, password, and project id. The project id can be found in the URL, e.g. https://app.beta.onetask.ai/app/projects/**03f7d82c-f14c-4f0f-a1ff-59533bab30cc**/overview. Simply copy and paste this into the following pattern:
Once you installed the package, you can access the application from any Python terminal as follows:

```python
from onetask import Client
from kern import Client

username = "your-username"
password = "your-password"
project_id = "your-project-id"
stage = "beta" # if you have onetask on local, you can also set stage to "local"
client = Client(username, password, project_id, stage)
```

Once you correctly instantiated your Client, you can start using it for the various functions provided in the SDK.
project_id = "your-project-id" # can be found in the URL of the web application


### Registering local Python labeling functions

You can register functions e.g. from your local Jupyter Notebook using our SDK. When doing so, please always ensure that your labeling functions:
- return label names that also exist in your project definition
- have exactly one parameter; we execute labeling functions on a record-basis
- If you need an import statement in your labeling functions, please check if it is given in the [whitelisted libraries](https://onetask.readme.io/reference/whitelisted-libraries). If you need a library that we have not yet whitelisted, feel free to reach out to us.

An example to register your custom labeling function is as follows:
```python
def my_labeling_function(record):
"""
Detect a list of values in the records that tend to occur in urgent messages.
"""
keywords = ["asap", "as soon as possible", "urgent"]

message_lower = record["message"].lower()
for keyword in keywords:
if keyword in message_lower:
return "Urgent"
client = Client(username, password, project_id)
# if you run the application locally, please the following instead:
# client = Client(username, password, project_id, uri="http://localhost:4455")
```

You can then enter them using the client:

Now, you can easily fetch the data from your project:
```python
client.register_lf(my_labeling_function)
df = client.fetch_export()
```

The labeling function is then automatically executed once registered, where you can always change and re-run it.
The `df` contains data of the following scheme:
- all your record attributes are stored as columns, e.g. `headline` or `running_id` if you uploaded records like `{"headline": "some text", "running_id": 1234}`
- per labeling task three columns:
- `<attribute_name|None>__<labeling_task_name>__MANUAL`: those are the manually set labels of your records
- `<attribute_name|None>__<labeling_task_name>__WEAK SUPERVISION`: those are the weakly supervised labels of your records
- `<attribute_name|None>__<labeling_task_name>__WEAK SUPERVISION_confidence`: those are the probabilities or your weakly supervised labels

### Generating embeddings (experimental)
With the `client`, you easily integrate your data into any kind of system; may it be a custom implementation, an AutoML system or a plain data analytics framework 🚀

One of the main features of onetask is to apply both Weak Supervision and Active Learning jointly. To build the best possible Active Learning Weak Sources, you can generate embeddings for your attributes using the SDK. To do so, you have to first upload your data in our web application and select a unique attribute (see our [documentation](https://onetask.readme.io/reference/create-your-project) for further reference on how to set this up).
## Roadmap
- [ ] Register information sources via wrappers
- [ ] Fetch project statistics

Once this is done, you can easily generate embedding files. Imagine you have the following attributes in your records:
- `headline`: an english text describing e.g. the news of a paper (e.g. _"5 footballers that should have won the ballon d'or"_, ...)
- `running_id`: a unique identifier for each headline, i.e. a simple number (e.g. 1, 2, 3, ...)

You can then call the client object to generate an embedding file using a dictionary of attribute/configuration string pairs:
```python
client.generate_embeddings({"headline": "distilbert-base-uncased"})
```

This will generate an embedding JSON-file as follows:

```json
[
{
"running_id": 1,
"distilbert-base-uncased": [0.123456789, "..."]
},
{
"running_id": 2,
"distilbert-base-uncased": [0.234567891, "..."]
},
]
```
If you want to have something added, feel free to open an [issue](https://github.com/code-kern-ai/kern-python/issues).

You can upload this file to your project in the overview tab of your project.
## Contributing
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.

The following configuration strings are available to configure how your attributes are embedded:
| Configuration String | Data Type | Explanation |
|----------------------|-------------------------------|----------------------------------------------------------------------------------------------|
| identity | integer, float | No transformation |
| onehot | category (low-entropy string) | one-hot encodes attribute |
| bow | string | Bag of Words transformation |
| boc | string | Bag of Characters transformation |
| _huggingface_ | string | Huggingface-based transformation. You can use any available [huggingface](https://huggingface.co/) configuration string |
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".
Don't forget to give the project a star! Thanks again!

If you want to embed multiple attributes (which makes sense e.g. when you have structured data), you can provide multiple key/value pairs in your input dictionary. The resulting embeddings will be concatenated into one vector.


### Autogenerating labeling functions (experimental)

As you manually label data, onetask can help you to analyze both the explicitic and implicit data patterns. Our first approach for explicit pattern detection is to find regular expressions in free text attributes you provide. They are being mined using linguistic analysis, therefore you need to provide a spacy nlp object for the respective language of your free text.

If you have an english free text, you can implement the mining as follows:
```python
import spacy
# you need to also download the en_core_web_sm file
# using $ python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")
lf_df = client.generate_regex_labeling_functions(nlp, "headline")
```

This creates a DataFrame containing mined regular expressions. You can display them in a convenient way:

```python
client.display_generated_labeling_functions(lf_df)
```

**Caution**: The quality and quantity of mined regular expressions heavily depends on how much data you have labeled and how diverse your dataset is. We have tested the feature on various datasets and found it to be very helpful. If you have problems autogenerating labeling functions in your project, contact us.


### Topic Modeling using BERT embeddings (experimental)

As onetask lets you put insights of explorative analysis into programmatic data labeling, we also provide topic modeling. We use the [BERTopic](https://github.com/MaartenGr/BERTopic) library for topic modeling, and provide an easy access to your projects data and embeddings. Once you uploaded BERT embeddings to your project (such that can be created using a huggingface configuration string), you can create a topic model:

```python
topic_model = client.model_topics("headline", "distilbert-base-uncased")
```
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

The `topic_model` provides various methods to explore the different keywords and topics. You can also find further documentation [here](https://maartengr.github.io/BERTopic/api/bertopic.html).
And please don't forget to leave a ⭐ if you like the work!

## Outlook and Feature Requests
In the near future, we'll extend the Python SDK to include programmatic imports and exports, data access, and many more. If you have any requests, feel free to [contact us](https://www.onetask.ai/contact-us).
## License
Distributed under the MIT License. See LICENSE.txt for more information.

## Support
If you need help, feel free to join our Slack Community channel. It is currently only available via invitation.
## Contact
This library is developed and maintained by [kern.ai](https://github.com/code-kern-ai). If you want to provide us with feedback or have some questions, don't hesitate to contact us. We're super happy to help ✌️
48 changes: 48 additions & 0 deletions kern/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# -*- coding: utf-8 -*-

from wasabi import msg
import pandas as pd
from kern import authentication, api_calls, settings, exceptions
from typing import Optional


class Client:
"""Client object which can be used to directly address the Kern AI API.

Args:
user_name (str): Your username for the application.
password (str): The respective password. Do not share this!
project_id (str): The link to your project. This can be found in the URL in an active project.
uri (str, optional): Link to the host of the application. Defaults to "https://app.kern.ai".

Raises:
exceptions.get_api_exception_class: If your credentials are incorrect, an exception is raised.
"""

def __init__(
self, user_name: str, password: str, project_id: str, uri="https://app.kern.ai"
):
settings.set_base_uri(uri)
self.session_token = authentication.create_session_token(
user_name=user_name, password=password
)
if self.session_token is not None:
msg.good("Logged in to system.")
else:
msg.fail(f"Could not log in at {uri}. Please check username and password.")
raise exceptions.get_api_exception_class(401)
self.project_id = project_id

def fetch_export(self, num_samples: Optional[int] = None) -> pd.DataFrame:
"""Collects the export data of your project (i.e. the same data if you would export in the web app).

Args:
num_samples (Optional[int], optional): If set, only the first `num_samples` records are collected. Defaults to None.

Returns:
pd.DataFrame: DataFrame containing your record data. For more details, see https://docs.kern.ai
"""
url = settings.get_export_url(self.project_id, num_samples=num_samples)
api_response = api_calls.get_request(url, self.session_token)
df = pd.read_json(api_response)
return df
53 changes: 53 additions & 0 deletions kern/api_calls.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# -*- coding: utf-8 -*-
from json.decoder import JSONDecodeError
import pkg_resources
from kern import exceptions
import requests
from typing import Any, Dict

try:
version = pkg_resources.get_distribution("kern-python-client").version
except pkg_resources.DistributionNotFound:
version = "noversion"


def post_request(url: str, body: Dict[str, Any], session_token: str) -> str:
headers = _build_headers(session_token)
response = requests.post(url=url, json=body, headers=headers)
return _handle_response(response)


def get_request(url: str, session_token: str) -> str:
headers = _build_headers(session_token)
response = requests.get(url=url, headers=headers)
return _handle_response(response)


def _build_headers(session_token: str) -> Dict[str, str]:
return {
"Content-Type": "application/json",
"User-Agent": f"python-sdk-{version}",
"Authorization": f"Bearer {session_token}",
}


def _handle_response(response: requests.Response) -> str:
status_code = response.status_code
if status_code == 200:
json_data = response.json()
return json_data
else:
try:
json_data = response.json()
error_code = json_data.get("error_code")
error_message = json_data.get("error_message")
except JSONDecodeError:
error_code = 500
error_message = "The server was unable to process the provided data."

exception = exceptions.get_api_exception_class(
status_code=status_code,
error_code=error_code,
error_message=error_message,
)
raise exception
27 changes: 27 additions & 0 deletions kern/authentication.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# -*- coding: utf-8 -*-
from kern import settings
import requests


def create_session_token(user_name: str, password: str) -> str:
headers = {"Accept": "application/json"}
action_url = (
requests.get(settings.get_authentication_url(), headers=headers)
.json()
.get("ui")
.get("action")
)
session_token = (
requests.post(
action_url,
headers=headers,
json={
"method": "password",
"password": password,
"password_identifier": user_name,
},
)
.json()
.get("session_token")
)
return session_token
35 changes: 17 additions & 18 deletions onetask/exceptions.py → kern/exceptions.py
Original file line number Diff line number Diff line change
@@ -1,45 +1,44 @@
# -*- coding: utf-8 -*-
from typing import Optional


class ClientError(Exception):
# https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#client_error_responses
class SDKError(Exception):
def __init__(self, message: Optional[str] = None):
if message is None:
message = "Please check the documentation."
message = (
"Please check the SDK documentation at https://docs.kern.ai/reference."
)
super().__init__(message)


class ParameterError(ClientError):
# 401 Unauthorized
class UnauthorizedError(SDKError):
pass


# https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#client_error_responses
class APIError(Exception):
def __init__(self, message: Optional[str] = None):
if message is None:
message = "Please check the API reference."
super().__init__(message)


# 401 Unauthorized
class UnauthorizedError(APIError):
# 404 Not Found
class NotFoundError(SDKError):
pass


# 500 Server Error
class InternalServerError(APIError):
class InternalServerError(SDKError):
pass


RESPONSE_CODES_API_EXCEPTION_MAP = {401: UnauthorizedError, 500: InternalServerError}
RESPONSE_CODES_API_EXCEPTION_MAP = {
401: UnauthorizedError,
404: NotFoundError,
500: InternalServerError,
}


def get_api_exception_class(
status_code: int,
error_code: Optional[str] = None,
error_message: Optional[str] = None,
) -> APIError:
exception_or_dict = RESPONSE_CODES_API_EXCEPTION_MAP.get(status_code, APIError)
) -> SDKError:
exception_or_dict = RESPONSE_CODES_API_EXCEPTION_MAP.get(status_code, SDKError)
if isinstance(exception_or_dict, dict):
exception_class = exception_or_dict.get(error_code, exception_or_dict["*"])
else:
Expand Down
Loading