# Python SDK course

## 1. Introduction to Python SDK

- Introduction to Python SDK
- Documentation &amp; Github repository
- Install using pip

**The Python Software Development Kit (SDK)**
The Cognite Python SDK requires Python 3.5+ and provides access to the Cognite Data Fusion API from applications written in the Python language. For detailed information, see the [Cognite Python SDK Documentation](https://docs.cognite.com/dev/guides/sdk/python/).

### Install the python-sdk package

In [None]:
! pip install cognite-sdk

## 2. Authentication

- Creating CogniteClient using different methods
  - Interactive login
  - Using Device code
  - Using clientID &amp; client secret
- Checking the login status

You can authenticate the Python SDK with Azure AD by using a token retrieved when a user authenticates or with a static client secret for long-running jobs like extractors or calculations.

### Prerequisites

* Install the Microsoft Authentication Library (MSAL) for Python.

In [None]:
! pip install msal

* You need to specify the values for the following configuration parameters:
 * `Tenant ID` - the ID of the Azure AD tenant where the user is registered.
 * `Client ID` - the ID of the application in Azure AD.
 * `Cluster` - the cluster where your CDF project is installed. For example, api and westeurope-1.
 * `CDF project` - the name of the CDF project.

If you don't know which values to use for these variables, contact your CDF administrator or Cognite Support.

You can directly set the values for these parameters here or read as environment variables or from file.

In [None]:
TENANT_ID="**"
CLIENT_ID="**"
CDF_CLUSTER="api"
COGNITE_PROJECT="ds-basics" # Note : The code in this notebook mostly works with this project data

Also set the following derived variables, which will be used for obtaining token

In [None]:
SCOPES = [f"https://{CDF_CLUSTER}.cognitedata.com/.default"]

AUTHORITY_HOST_URI = "https://login.microsoftonline.com"
AUTHORITY_URI = AUTHORITY_HOST_URI + "/" + TENANT_ID
PORT = 53000

TOKEN_URL = f"https://login.microsoftonline.com/{TENANT_ID}/oauth2/v2.0/token"

### Authenticate with user credentials
You can authenticate the Python SDK with Azure AD by using a token retrieved with user credentials.

#### Interactive Login
You can get the token by letting the user sign in interactively via a browser and use the authenticate with interactive login and token refresh flow to access CDF when you're running short-term scripts or using Jupyter.

In [None]:
from cognite.client import CogniteClient
from msal import PublicClientApplication

In [None]:
app = PublicClientApplication(client_id=CLIENT_ID, authority=AUTHORITY_URI)
creds = app.acquire_token_interactive(scopes=SCOPES, port=PORT)

In [None]:
client = CogniteClient(
    token_url=creds["id_token_claims"]["iss"],
    token=creds["access_token"],
    token_client_id=creds["id_token_claims"]["aud"],
    client_name="my-client-interactive",
    project=COGNITE_PROJECT)

#### Using Device Code (NOTE :  code is giving errors )
If a browser is not available, for example, if you are logged into a terminal, you can use the authenticate with user credentials and device code flow.

In [None]:
app = PublicClientApplication(client_id=CLIENT_ID, authority=AUTHORITY_URI)
device_flow = app.initiate_device_flow(scopes=SCOPES)
print(device_flow["message"])  # print device code to screen
creds = app.acquire_token_by_device_flow(flow=device_flow)

In [None]:
creds

In [None]:
client = CogniteClient(
    token_url=creds["id_token_claims"]["iss"],
    token=creds["access_token"],
    token_client_id=creds["id_token_claims"]["aud"])

### Authenticate with client secret

In [None]:
import os
CLIENT_SECRET = os.getenv("CLIENT_SECRET")  # store secret in env variable

In [None]:
from getpass import getpass
CLIENT_SECRET = getpass("Enter the Client Secret: ")  # Enter the client secret interactively here

In [None]:
client = CogniteClient(
    token_url=TOKEN_URL,
    token_client_id=CLIENT_ID,
    token_client_secret=CLIENT_SECRET,
    token_scopes=SCOPES,
    project=COGNITE_PROJECT,
    base_url=f"https://{CDF_CLUSTER}.cognitedata.com",
    client_name="client_secret_test_script",
    debug=False,
)

### Check the login status

In [None]:
client.login.status()

Also let's name client as c for later references

In [None]:
c = client

### (Code note to be shown) Authenticate using API-key

In [None]:
# client_apikey = CogniteClient(api_key=getpass("Enter the Open Industrial Data API-KEY: "),
#                        project="publicdata", client_name="cdf_client_public_data", debug=False)

# c = client_apikey

## 3. List operation

**Code Pattern**
`client.<cdf_resource_type>.list()`
where **cdf_resource_type** = { data_sets, assets, time_series, events, files, labels, sequences, relationships ... }


### List the CDF Resource types

In [None]:
c.data_sets.list()

In [None]:
c.assets.list(limit=5)

In [None]:
c.time_series.list(limit=5)

In [None]:
c.events.list(limit=5)

In [None]:
c.files.list(limit=5)

Similar code for listing other resource types


```
c.labels.list()
```
```
c.sequences.list()
```
```
c.relationships.list()
```


There are no labels in Publicdata, need to create some dummy may be.

In [None]:
c.labels.list(limit=5)

In [None]:
c.sequences.list(limit=5)

In [None]:
c.relationships.list(limit=5)

### Filter the list results

using label filter

In [None]:
from cognite.client.data_classes import LabelFilter

In [None]:
c.assets.list(labels=LabelFilter(contains_all=["EQUIPMENT_VALVE"]),limit=5)

In [None]:
c.assets.list(labels=LabelFilter(contains_any=["EQUIPMENT_PUMP", "EQUIPMENT_VALVE"]),limit=5)

using metadata filter

In [None]:
# First get some metadata keys to inspect
c.assets.list(limit=5).to_pandas()['metadata'][4]

In [None]:
# Get the assets list satisfying metadata filter
c.assets.list(metadata={'ELC_STATUS_ID': '1211'},limit=5)

Other filters

In [None]:
c.data_sets.list(write_protected=False)

In [None]:
c.labels.list(limit=5, name="EQUIPMENT_VALVE")

In [None]:
c.assets.list(root=True)

In [None]:
c.time_series.list(is_step=True,limit=5)

In [None]:
c.events.list(start_time={"max": 1500000000},limit=5)

### Iterate over the list

In [None]:
for data_set in c.data_sets:
    print(data_set.name) # do something with the data_set

When List is too big, then use the **chunk_size** parameter to get the list in chunks

In [None]:
for data_set_list in c.data_sets(chunk_size=5):
    print([x.name for x in data_set_list]) # do something with the list

## 4. Searching in CDF

- Search the CDF resource types
- Filter the search results

**Code Pattern**

`client.<cdf_resource_type>.search()`

### Fuzzy Search on one field

In [None]:
c.assets.search(name="23-HA-9114",limit=5)

In [None]:
c.time_series.search(name="VAL_23-PDT-92501",limit=5)

In [None]:
c.files.search(name=".pdf")

Similarly for sequences and events

`c.sequences.search(name="some name")`

`c.events.search(description="some description")`

### Multi-field fuzzy search

In [None]:
c.assets.search(query="Discharge Cooler",limit=5)

### Exact search on one field (e.g. name, description etc)

In [None]:
c.assets.search(filter={"name": "23-HA-9114"})

get all timeseries for the above asset

In [None]:
c.time_series.search(filter={"asset_ids":[5192617294065915]})

In [None]:
# First get an example type to filter on
example_type = c.events.list()[0].type
# Filter the events of that type
c.events.search(filter={"type":example_type})

### Multiple filters in search function

In [None]:
c.assets.list(limit=5)

In [None]:
c.assets.search(name="23-HA-9114",filter={"parent_ids": [2001559427541439]})

### Filter asset search using Label Filter

In [None]:
from cognite.client.data_classes import AssetFilter

In [None]:
c.assets.search(name="LER13",filter=AssetFilter(labels=LabelFilter(contains_all=["EQUIPMENT_VALVE"])),limit=5)

## 5. Retrieve the CDF resource types and data

### Code Pattern
`client.<cdf_resource_type>.retrieve()`

`client.<cdf_resource_type>.retrieve_multiple()`

### Retrieve the CDF resource types

single item - using id

In [None]:
# id of dataset
c.data_sets.retrieve(id=6847140037409299)

In [None]:
# id of timeseries
c.datapoints.retrieve(id=947391658441, start="2w-ago", end="now")

In [None]:
# Get an example sequence id
c.sequences.list()[0].id

In [None]:
# Get all rows for the sequence
c.sequences.data.retrieve(id=752497012302, start=0, end=None)

single item - using external_id

In [None]:
c.data_sets.retrieve(external_id = "AIR")

multiple items - using ids

In [None]:
c.data_sets.list()

In [None]:
c.data_sets.retrieve_multiple(ids=[784506703157529, 5382870207800064])

multiple items - using external_ids

In [None]:
c.data_sets.retrieve_multiple(external_ids=["AIR", "BLUEPRINT_APP_DATASET"], ignore_unknown_ids=True)

In [None]:
from datetime import datetime
# Don't put a long datetime range, as it'll fetch all the raw datapoints, which can be huge
c.datapoints.retrieve(external_id=['pi:163657','pi:163658'],start=datetime(2018,1,1),end=datetime(2018,1,2))

### Retrieve all items related to an asset

In [None]:
# get the asset object for "23-HA-9114"
asset_obj = c.assets.retrieve(id = 5192617294065915)

Asset Subtree

In [None]:
c.assets.retrieve_subtree(id = 5192617294065915)

Children

In [None]:
asset_obj.children()

Parent

In [None]:
asset_obj.parent()

Events

In [None]:
asset_obj.events()

Files

In [None]:
asset_obj.files()

Sequences

In [None]:
asset_obj.sequences()

Timeseries

In [None]:
asset_obj.time_series()

### Retrieve data

#### Retrieve Raw Data

In [None]:
# Don't put a long datetime range, as it'll fetch all the raw datapoints, which can be huge for Raw datapoints
c.datapoints.retrieve(external_id=['pi:163657','pi:163658'],start=datetime(2018,1,1),end=datetime(2018,1,2))

#### Retrieve Aggregated data

In [None]:
c.datapoints.retrieve_dataframe(external_id=['pi:163657','pi:163658'],
                    start="2w-ago",
                    end="now",
                    aggregates=["average","sum"],
                    granularity="1h")

Retrieve latest datapoint before a particular time (last or any past time point)

In [None]:
c.datapoints.retrieve_latest(external_id='pi:163657', before="2d-ago")[0]

#### Retrieve Sequence rows

In [None]:
c.sequences.list()[0]

In [None]:
c.sequences.data.retrieve_dataframe(id=752497012302, start=0, end=None) # end=None to fetch all rows, can use a limit e.g. end=100

### Plot the data points

In [None]:
c.datapoints.retrieve(external_id='pi:163657', start="2w-ago", end="now").plot()

### Download files

Download files to disk

In [None]:
c.files.search(filter={"source":"CDF Vision"})

In [None]:
! mkdir my_directory
c.files.download(directory="my_directory", id=[30813179433802,442095693369785])

Download a single file to a specific path

In [None]:
c.files.download_to_path("my_directory/my_downloaded_file.jpg", id=30813179433802)

## 6. Create various resource types [TBD on ds-basics project]

Create timeseries, datasets, labels, events etc.

**Code Pattern**

`client.<cdf_resource_type>.create()`

Dataset

In [None]:
data_sets = [DataSet(name="1st level"), DataSet(name="2nd level")]
res = c.data_sets.create(data_sets)

Labels

In [None]:
labels = [LabelDefinition(external_id="ROTATING_EQUIPMENT", name="Rotating equipment"), LabelDefinition(external_id="PUMP", name="pump")]
res = c.labels.create(labels)

Assets

In [None]:
assets = [Asset(name="asset1"), Asset(name="asset2")]
res = c.assets.create(assets)

Timeseries

In [None]:
ts = c.time_series.create(TimeSeries(name="my ts"))

Sequences

In [None]:
column_def = [{"valueType":"STRING","externalId":"user","description":"some description"}, {"valueType":"DOUBLE","externalId":"amount"}]
seq = c.sequences.create(Sequence(external_id="my_sequence", columns=column_def))

Files metadata

In [None]:
file_metadata = FileMetadata(name="MyFile")
res = c.files.create(file_metadata)

Relationships

In [None]:
flowrel1 = Relationship(external_id="flow_1", source_external_id="source_ext_id", source_type="asset", target_external_id="target_ext_id", target_type="event", confidence=0.1, data_set_id=1234)
flowrel2 = Relationship(external_id="flow_2", source_external_id="source_ext_id", source_type="asset", target_external_id="target_ext_id", target_type="event", confidence=0.1, data_set_id=1234)
res = c.relationships.create([flowrel1,flowrel2])

Create Asset with Labels

In [None]:
asset = Asset(name="my_pump", labels=[Label(external_id="PUMP")])
res = c.assets.create(asset)

Create Asset Hierarchy

In [None]:
assets = [Asset(external_id="root", name="root"), Asset(external_id="child1", parent_external_id="root", name="child1"), Asset(external_id="child2", parent_external_id="root", name="child2")]
res = c.assets.create_hierarchy(assets)

Upload files and folders

Single file

In [None]:
res = c.files.upload("/path/to/file", name="my_file")

All files in a directory

In [None]:
res = c.files.upload("/path/to/my/directory")

## 7. Update various resource types [TBD on ds-basics project]
**Code Pattern**

`client.<cdf_resource_type>.update()`

Full Update

In [None]:
data_set = c.data_sets.retrieve(id=1)
data_set.description = "New description"
res = c.data_sets.update(data_set)

Partial Update

In [None]:
my_update = DataSetUpdate(id=1).description.set("New description").metadata.remove(["key"])
res = c.data_sets.update(my_update)

Full Update

In [None]:
my_update = AssetUpdate(id=1).description.set("New description").metadata.add({"key": "value"})
res1 = c.assets.update(my_update)

Partial Update

In [None]:
another_update = AssetUpdate(id=1).description.set(None)
res2 = c.assets.update(another_update)



---



In [None]:
res = c.time_series.retrieve(id=1)
res.description = "New description"
res = c.time_series.update(res)

In [None]:
res = c.sequences.retrieve(id=1)
res.description = "New description"
res = c.sequences.update(res)

In [None]:
event = c.events.retrieve(id=1)
event.description = "New description"
res = c.events.update(event)

In [None]:
file_metadata = c.files.retrieve(id=1)
file_metadata.description = "New description"
res = c.files.update(file_metadata)

In [None]:
rel = c.relationships.retrieve(external_id="flow1")
rel.confidence = 0.75
res = c.relationships.update(rel)

## 8. Insert the data in CDF [TBD on ds-basics project]

**Code Pattern**
`client.<cdf_resource_type>.insert()`

### Insert the Datapoints/Rows

In [None]:
# with datetime objects
datapoints = [(datetime(2018,1,1), 1000), (datetime(2018,1,2), 2000)]
c.datapoints.insert(datapoints, id=1)
# with ms since epoch
datapoints = [(150000000000, 1000), (160000000000, 2000)]

In [None]:
data = [(1, ['pi',3.14]), (2, ['e',2.72]) ]
c.sequences.data.insert(column_external_ids=["col_a","col_b"], rows=data, id=1)

### Insert the Dataframe

In [None]:
ts_id = 123
x = pd.DatetimeIndex([datetime(2018, 1, 1) + timedelta(days=d) for d in range(100)])
y = np.random.normal(0, 1, 100)
df = pd.DataFrame({ts_id: y}, index=x)
c.datapoints.insert_dataframe(df)

In [None]:
c.sequences.data.insert_dataframe(df*2, id=123)

### Insert the datapoints in multiple timeseries

In [None]:
datapoints = []
# with datetime objects and id
datapoints.append({"id": 1, "datapoints": [(datetime(2018,1,1), 1000), (datetime(2018,1,2), 2000)]})
# with ms since epoch and externalId
datapoints.append({"externalId": 1, "datapoints": [(150000000000, 1000), (160000000000, 2000)]})
c.datapoints.insert_multiple(datapoints)

## 9. Delete the data in CDF [TBD on ds-basics project]

### Delete using ids

In [None]:
c.sequences.data.delete(id=0, rows=[1,2,42])

### Delete using a Range

In [None]:
c.datapoints.delete_range(start="1w-ago", end="now", id=1)

In [None]:
c.sequences.data.delete_range(id=0, start=0, end=None)

### Delete multiple ranges

In [None]:
ranges = [{"id": 1, "start": "2d-ago", "end": "now"},
...             {"externalId": "abc", "start": "2d-ago", "end": "now"}]
c.datapoints.delete_ranges(ranges)

### Delete various resource types

In [None]:
c.labels.delete(external_id=["big_pump", "small_pump"])

In [None]:
c.assets.delete(id=[1,2,3], external_id="3")

In [None]:
c.time_series.delete(id=[1,2,3], external_id="3")

In [None]:
c.sequences.delete(id=[1,2,3], external_id="3")

In [None]:
c.events.delete(id=[1,2,3], external_id="3")

In [None]:
c.files.delete(id=[1,2,3], external_id="3")

In [None]:
c.relationships.delete(external_id=["a","b"])

## 10. Use-case (WIP)

- Example dataset uploading and creation of various resource types
- Retrieve, explore and update the data
- Cleanup

Steps in this section are required to solve the exercises at the end.

#### Create a dataset

In [None]:
from cognite.client.data_classes import DataSet

In [None]:
dataset_name = 'world_info'

In [None]:
data_set_list = [DataSet(name=dataset_name,external_id=dataset_name,write_protected=True)]
res = c.data_sets.create(data_set_list)

In [None]:
c.data_sets.list()

In [None]:
ds_id = c.data_sets.retrieve(external_id=dataset_name).id
ds_id

### Country Wise Data

Country Wise UN Location Codes
https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes/blob/master/all/all.csv

In [None]:
! wget https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv

In [None]:
import pandas as pd

In [None]:
world_df = pd.read_csv("all.csv")

In [None]:
world_df.head()

### Create Assets and Hierarchy

#### Create the Root Asset

In [None]:
from cognite.client.data_classes import Asset

In [None]:
root_asset = Asset(external_id='world', name='World', description='World asset used as root for all countries', data_set_id=ds_id)

In [None]:
c.assets.create(root_asset)

#### Update the asset details

In [None]:
from cognite.client.data_classes import AssetUpdate

In [None]:
# Update the name of an Asset
name_update = AssetUpdate(external_id="world").name.set("Global")
res = c.assets.update(name_update)

#### Create Region assets

In [None]:
# List with all regions
reg_list = [continent for continent in set(world_df['region'].dropna().tolist())]

In [None]:
reg_list

In [None]:
region_assets = []
for region in reg_list:
    region_assets.append(Asset(external_id=region+'_test', name=region, parent_external_id='world', data_set_id=ds_id))

In [None]:
c.assets.create(region_assets)

#### Create Country Assets under Region assets

In [None]:
df_country = world_df.groupby('name').first().reset_index()[['name','region']].dropna()
country_region = list(zip(df_country.name, df_country.region))

In [None]:
country_assets = []

for pair in country_region:
    country_assets.append(Asset(external_id=pair[0]+'_test', name=pair[0], parent_external_id=pair[1]+'_test', data_set_id=ds_id))

In [None]:
c.assets.create(country_assets)

### Add Data for each country

#### Add timeseries

World Population Data over the years https://data.worldbank.org/indicator/SP.POP.TOTL

##### Create the Time Series objects in CDF

In [None]:
from cognite.client.data_classes import TimeSeries

In [None]:
timeseries = []

for index, row in df_country.iterrows():
    external_id = row['name']+"_population"
    asset = c.assets.retrieve(external_id = row['name']+'_test')
    if asset is not None:
        timeseries.append(TimeSeries(external_id=external_id+'_test',name=external_id, asset_id = asset.id,data_set_id=ds_id))

In [None]:
for t in timeseries:
    try:
        c.time_series.create(t)
    except:
        pass

##### Download & prepare the data

In [None]:
# Install the worldbank API
! pip install wbgapi

In [None]:
import wbgapi as wb

In [None]:
population_df = wb.data.DataFrame('SP.POP.TOTL', time=range(2000, 2020), labels=True,skipBlanks=True, columns='series').reset_index()

##### Insert the data in time series

#### Add some files

PDF Files data for each country by UN https://unctadstat.unctad.org/CountryProfile/GeneralProfile/en-GB/012/index.html
Get the PDF files by passing the Country Code to this URL
https://unctadstat.unctad.org/CountryProfile/GeneralProfile/en-GB/{CountryCode}/GeneralProfile{CountryCode}.pdf

#### Add events

Events data e.g. All distasters by Country https://public.emdat.be/data

#### Add Labels

Create labels such "Cold" or "Hot" climate countries.

#### Add Relationships

Create Relationships such as "Neighbours"

### Delete the data ( Cleanup )

How to delete recursively and effectively.