New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TI vs TM1py which is a better ETL process? #573
Comments
Hi @pal-16, thank you for providing these stats. Very interesting! Can you please do one more test with 250k or 500k? As a "Pythonista" and TM1py developer I may be biased, but here is my take on this:
from TM1py import TM1Service
with TM1Service(address="", port=12354, ssl=True, user="admin", password="apple") as tm1_source:
with TM1Service(address="", port=12297, ssl=True, user="admin", password="apple") as tm1_target:
dimension = tm1_source.dimensions.get(dimension_name="Financial Year")
tm1_target.dimensions.update_or_create(dimension)
Any other opinions on this one? |
Interesting topic !
Python and tm1py certainly have advantages, including cross-model (while
other tools like Jedox - very similar to TM1 - support this natively within
the software).
I always thought that TI would be the fastest way compared to tm1py or
other REST-based tools.
A few points regarding TI vs. Python:
With respect to the example of Marius on updating dimensions between
instances: we should know what the dimensions.update_or_create method does.
Does it bring over subsets ? Attributes ? Dimension properties ?
Hierarchies (PA-speak) within the dimension ? Security settings ? Etc.
While there are ready-made methods that make a number of things much
easier, it also involves learning Python as well as knowing which
methods to use, what they do / do not do. We all know
DimensionElementInsert and AttrPutS kind of functions, so starting from
what one knows is usually how it is done.
Python also involves installations.
But definitely tm1py is a very welcome asset in the TM1 landscape so to
speak.
…------
Best regards / Beste groeten,
Wim Gielis
MS Excel MVP 2011-2014
https://www.wimgielis.com <http://www.wimgielis.be>
Op do 15 jul. 2021 om 09:56 schreef Marius Wirtz ***@***.***>:
Hi @pal-16 <https://github.com/pal-16>,
thank you for providing these stats. Very interesting! Can you please do
one more test with 250k or 500k?
It's not unusual to have such large dimensions in TM1. And loading very
large dimensions can be a bottleneck.
As a "Pythonista" and TM1py developer I may be biased, but here is my take
on this:
I think while TI is easier and to many old-school TM1'ers more familiar,
Python is the better choice for ETL due to the following reasons:
- Python is a proper programming language that allows you to express
logic in efficient ways using modern data structures (lists, tuples,
dictionaries, etc.). features (classes, functions, etc.), and not to
mention automated tests.
- Python's standard library and third-party extensions (pandas, numpy,
etc.) go way beyond the scope of what TI can do.
- Contrary to Turbo Integrator, a TM1py script does not run within the
scope of a TM1 instance!
It is therefore not more complex to interact with n TM1 instances than
it is to interact with 1 instance from the script. For instance to load a
dimension from instance A to instance B is very simple in python and very
hard in TI. Sample:
from TM1py import TM1Service
with TM1Service(address="", port=12354, ssl=True, user="admin", password="apple") as tm1_source:
with TM1Service(address="", port=12297, ssl=True, user="admin", password="apple") as tm1_target:
dimension = tm1_source.dimensions.get(dimension_name="Financial Year")
tm1_target.dimensions.update_or_create(dimension)
- TI is limited in terms of the data sources it can connect to. With
python, we you connect to almost any source system seamlessly.
Any other opinions on this one?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#573 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEDHULLDLQ63JCDQTHNLYMTTX2ICXANCNFSM5AM5KINA>
.
|
Yeah. When it comes to writing cell-level data, TI is still the fastest option. The sample above updates the dimension with all hierarchies with all elements and edges and attributes. Yes. Python involves an installation, though not necessarily on the machine that is running TM1.
Thanks :) |
Okay, I will add my two cents here... :-) TI is only faster because it ignores everything as it runs in GOD-MODE... That is faster but from a maintenance and security point of view it is a nightmare... Python and tmpy1 (actually the rest api in general but unfortunately no other language offers a package like tm1py) like @MariusWirtz said opens tm1 up to all kinds of modern technology. Be it git or json or CI/CD or DevOps or ML or AI... the list goes on. Looking forward to your comments. ;-) |
I'm no way near as in-the-loop as I used to be with matters relating to TM1, but how can an 'out-of-process' ETL be faster than an 'in-process' ETL? Is TM1Py usually ran on the same machine as TM1? If-not then these metrics are not applicable at all and are downright misleading. Even if it is running on the same box, the TM1Py library will introduce a socket related lag which wont be in TI due to it being 'in-process'. Let me know what I'm missing here... |
Faster way to load data: So should I go with TI? So should I go with TM1py? Conclusion: |
The related lag you mention would only be applicable in testing if you were using the same "method" of processing. In many cases the TI based line by line method is slower than then a query followed by a table transformation operation for example.
What we are really missing, and desperately need, is an ETL REST endpoint. One that allows us to use 3P ETL tools.
…Sent from my mobile phone
On Jul 15, 2021 9:05 PM, Ben Hill ***@***.***> wrote:
I'm no way near as in-the-loop as I used to be with matters relating to TM1, but how can an 'out-of-process' ETL be faster than an 'in-process' ETL?
Is TM1Py usually ran on the same machine as TM1? If-not then these metrics are not applicable at all and are downright misleading. Even if it is running on the same box, the TM1Py library will introduce a socket related lag which wont be in TI due to it being 'in-process'.
Let me know what I'm missing here...
-
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#573 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEK7GZXRHFQ66ZQMA7MN53LTX6AQVANCNFSM5AM5KINA>.
|
Hi Ryan, That makes sense as a way in which the TM1Py would be faster - one could optimize the amount of data to be added into TM1 prior to actioning it against the dimensions / cube. Also, I totally agree that flexibility Py will provide here is perfect for complex scenarios with merging multiple data queries potentially from multiple places. That said, the initial benchmarks are lacking context, is this a like-for-like comparison - both systems running the same data through using the same methodology or an edge case where each engine is taking a different methodology based on its unique capabilities. Regardless of methodology, TM1Py needs to talk to the TM1 Rest API over a network interface (even if on the same machine) adding a delay based on the size of the data being sent, the more data input, the more added delay over the TI approach which includes no such lag. This is why I'm strongly in favor of in-process ETL, I don't think 3P ETL Tools are the answer (unless they are embedded), I think a better TI scripting language/engine would be the answer. @pal-16 Can we see the source code for this benchmark? |
@pal-16 It would be interesting to see the code of the two benchmarks as sometimes you can be comparing apples and oranges. Both TI and TM1Py are great tools for interacting with TM1. As as default I would still stick to TI when dealing with "standard" data sources such as ODBC and flat files. TI is a little clunky but it is very good at what it does and is super fast. There are some great tricks in TM1Py to improve performance but TI does run in-process. That means there isn't any overhead in terms of parsing HTTP requests and JSON and it has direct access to the data stored in memory. TI is also "compiled" so you don't require parsing of the code after it has been saved. TM1Py is great for stuff that you can't do or that is hard in TI. There are more and more web based sources and dealing with JSON (or XML) in TI can be painful. TM1Py can also be great if you have Python expertise, Python is a very nice language and there are lots tutorials and an endless list of libraries. There are lots of great examples of how TM1Py has opened up a whole new world because it enables so many things that aren't possible with TI. In summary, both are great but have their own sweet spots, it isn't a matter of better but instead what fits the job. |
Very interesting points. Thanks, everyone for sharing your thoughts and expertise! regarding the stats, while I would love to challenge the code, but the results don't surprise me, to be honest. An IBM employee familiar with the TM1 engine, recently told me that dimension updates through REST should already be faster than through TI. @MODLR
Agree. Perhaps the cases can be split into three groups.
|
@MODLR I kinda think the TM1 REST API is already the answer. Ultimately everyone prefers different languages and technologies (and it changes over time too!) and REST caters to that. I would rather have IBM focussing on making the REST API as fast and feature-rich and robust as possible than have them inventing a new language or integrating one fixed scripting language into the server. |
@rclapp |
Still, having basic improvements in TI like...
functions
or collections
or regular expressions
or a decent Replace function, Left/Right ...
or a process template to pick from a list
or common snippets
or a library that every developer now does on his/her own (a Bedrock
light for instance)
or IRR / NPV / ...
or ... (I could go on a long time)
shouldn't be too hard, is it ? We are 2021 already, not 1995.
It cannot / shouldn't be the case that we need to revert to other outside
tools to make sure this can be done.
For example, I regularly write a process "function" logic in rules and ask
for the result with a series of CellPutN/S and CellGetN/S. Hello IBM, it's
2021 !
…------
Best regards / Beste groeten,
Wim Gielis
MS Excel MVP 2011-2014
https://www.wimgielis.com <http://www.wimgielis.be>
Op vr 16 jul. 2021 om 12:26 schreef Marius Wirtz ***@***.***>:
I think a better TI scripting language/engine would be the answer.
@MODLR <https://github.com/MODLR>
interesting thought! What do you have in mind?
I kinda think the TM1 REST API is already the answer. Ultimately everyone
prefers different languages and technologies (and it changes over time
too!) and REST caters to that.
I would rather have IBM focussing on making the REST API as fast and
feature-rich and robust as possible than have them inventing a new language
or integrating one fixed scripting language into the server.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#573 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEDHULMZKKDRURLEFAMLWVDTYACMLANCNFSM5AM5KINA>
.
|
I am with @MariusWirtz on this one. Rest-based is the future. TI will slowly but surely be deprecated. Afterwards you can either use Python or any other language. Besides performance there is no real need for ti because more or less all other languages on the planet have more flexibility. And for a pure data dump into the server IBM will maybe provide something. |
So, @lotsaram, when will Bedrock move from TI to tm1py? ;) |
Today + 21916 days 😜
|
I don't think REST can be the future, well at least not now we know it today. It was never intended to retrieve/send terabytes of data.
…Sent from my mobile phone
On Jul 16, 2021 7:09 AM, Christoph Hein ***@***.***> wrote:
I am with @MariusWirtz<https://github.com/MariusWirtz> on this one. Rest-based is the future. TI will slowly but surely be deprecated. Afterwards you can either use Python or any other language. Besides performance there is no real need for ti because more or less all other languages on the planet have more flexibility. And for a pure data dump into the server IBM will maybe provide something.
-
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#573 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEK7GZWTUJDNTJNIOX2EZX3TYAHKRANCNFSM5AM5KINA>.
|
Yes we are working to replace CCC with it. However, I am more interested an endpoint that can access the underlying trie structure directly, that way we can use things like AWS Glue.
…Sent from my mobile phone
On Jul 16, 2021 6:38 AM, Marius Wirtz ***@***.***> wrote:
What we are really missing, and desperately need, is an ETL REST endpoint. One that allows us to use 3P ETL tools.
@rclapp<https://github.com/rclapp>
Have you looked into Apache Airflow? It's perhaps more workflow management than classic ETL but I imagine it could go really well with TM1 and TM1py. @scrambldchannel<https://github.com/scrambldchannel> did some pioneering work on this.
https://scrambldchannel.github.io/airflow-tm1.html#airflow-tm1
-
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#573 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEK7GZSC3MJ4DZMYELTRWYLTYADZRANCNFSM5AM5KINA>.
|
This is the feedback we need to provide to IBM regarding the REST API! |
I would love to learn more about how you use it today. Didn't know about AWS Glue yet. Will check it out! |
Probably an edge case but I would assume a oneliner to add an element to a
dimension like currently DimensionElementInsert( dim, ‘’, name, type ); in
TI.
Should it be different, that is already 1 problem but it should be a
oneliner as it is now.
The newish Hierarchy* functions are also picked up very slowly (I only use
them in TI when I need to) so I would only change if really really needed.
Op vr 16 jul. 2021 om 14:53 schreef Marius Wirtz ***@***.***>
I don't think REST can be the future, well at least not now we know it
today. It was never intended to retrieve/send terabytes of data.
This is the feedback we need to provide to IBM regarding the REST API!
Loading terabytes of data is somewhat of an edge case though 🙃
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#573 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEDHULMPTDTYXG5WEP5TBBDTYATTVANCNFSM5AM5KINA>
.
--
…------
Best regards / Beste groeten,
Wim Gielis
MS Excel MVP 2011-2014
https://www.wimgielis.com <http://www.wimgielis.be>
|
I agree with Ryan, REST API as a technology should serve mainly end-user
requests, and not working as ETL solution. REST API is too verbose to do
good and efficient ETL.
Do not misunderstand we love TM1py and it is great as the glue between data
science applications and several other applications but TM1 would need a
proper API to work with the core.
…On Fri, 16 Jul 2021 at 14:53, Marius Wirtz ***@***.***> wrote:
I don't think REST can be the future, well at least not now we know it
today. It was never intended to retrieve/send terabytes of data.
This is the feedback we need to provide to IBM regarding the REST API!
Loading terabytes of data is somewhat of an edge case though 🙃
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#573 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHNEK3A4CPB37Y3CYEMISN3TYATTXANCNFSM5AM5KINA>
.
|
Completely disagree. Who said it's not efficient? Exchanging data through JSON is not per se inefficient. For dimension updates, it is more or less on par with TI (according to the stats above and according to what we hear from IBM). For data, you must not look at the throughput rate (e.g. update 100k cells per second) but at the runtime of an allocation or something. You will see that in many cases with REST we are already faster than TI at the bottom line.
Are you suggesting to rather wait for a "proper API" and not use REST for loads? I remember doing a project with SQL a while ago. We were dealing with massive data quantities and struggling to load them into SQL fast enough. Ultimately we found out: the fastest way to load into MSSQL was a bulk insert from CSV files. In TM1 we are currently exactly in the same situation! We can use REST / TM1py for everything but if you are really dealing with terabytes, just create CSV files on the server and use bulk mode / TI for the very last step ( |
The vast majority of TM1 models out there can just suffice with what we now
have in TM1.
You know, updating dimensions, loading data, transferring data from 1 cube
to another.
TI is certainly sufficient in terms of possibilities and speed, not in
terms of ease of use (debatable) or code structures or whatever I noted
earlier in this topic.
How often do we need to go beyond TI ? Very few times. It's with edge cases
like a ARIMA models or IRR or working cross-TM1 model or other statistical
excursions or joining SQL statements, ... that you would need to deviate
from TI.
Then we can supplement TI with tm1py. I'm happy to do that.
Making other scripting languages and REST the de facto standard will
certainly not be my preference. Even not if that new tool is more on par
regarding speed.
So for me:
default: TI
very much appreciated surrounding developments in tools like tm1py when TI
won't cut it (not often, in my experience) but that's not the focus
I built a few useful scripts in tm1py like counting users I'm not going to
go away from TI, knowing very well that TI lacks essential things that it
should have received long time ago.
…------
Best regards / Beste groeten,
Wim Gielis
MS Excel MVP 2011-2014
https://www.wimgielis.com <http://www.wimgielis.be>
Op vr 16 jul. 2021 om 15:27 schreef Marius Wirtz ***@***.***>:
REST API is too verbose to do good and efficient ETL
Completely disagree. Who said it's not efficient?
Exchanging data through JSON is not per se inefficient. For dimension
updates, it is more or less on par with TI (according to the stats above
and according to what we hear from IBM).
For data, you must not look at the throughput rate (e.g. update 100k cells
per second) but at the runtime of an allocation or something. You will see
that in many cases with REST we are already faster than TI at the bottom
line.
Are there even more efficient ways to exchange data than JSON? Yes, and
the TM1 REST API is eventually going to offer them and TM1py is going to
implement them.
TM1 would need a proper API to work with the core.
Are you suggesting to rather wait for a "proper API" and not use REST for
loads?
Doesn't make sense IMO. A bird in the hand is worth two in the bush and
IBM has communicated multiple times that REST is the way to go forward in
terms of APIs.
I remember doing a project with SQL a while ago. We were dealing with
massive data quantities and struggling to load them into SQL fast enough.
Ultimately we found out: *the fastest way to load into MSSQL was a bulk
insert from CSV files.*
In TM1 we are currently exactly in the same situation! We can use REST /
TM1py for everything but if you are really dealing with terabytes, just
create CSV files on the server and use bulk mode / TI.
I had to do this only once in my life. My experience: 95% of the time REST
is fast enough. You may also look into multi-threading TM1py if REST isn't
fast enough.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#573 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEDHULN6X5DIYPLYJFQ76QLTYAXUBANCNFSM5AM5KINA>
.
|
Marius,
The ODBCOutput function in TI is rather slow (for large data volumes) if we
do it record by record in the Data tab for instance.
Bulk insert of a csv of SQL is much faster. So that is then the Epilog tab
and does not make use leave TI, does it ?
Op vr 16 jul. 2021 om 15:27 schreef Marius Wirtz ***@***.***>:
… REST API is too verbose to do good and efficient ETL
Completely disagree. Who said it's not efficient?
Exchanging data through JSON is not per se inefficient. For dimension
updates, it is more or less on par with TI (according to the stats above
and according to what we hear from IBM).
For data, you must not look at the throughput rate (e.g. update 100k cells
per second) but at the runtime of an allocation or something. You will see
that in many cases with REST we are already faster than TI at the bottom
line.
Are there even more efficient ways to exchange data than JSON? Yes, and
the TM1 REST API is eventually going to offer them and TM1py is going to
implement them.
TM1 would need a proper API to work with the core.
Are you suggesting to rather wait for a "proper API" and not use REST for
loads?
Doesn't make sense IMO. A bird in the hand is worth two in the bush and
IBM has communicated multiple times that REST is the way to go forward in
terms of APIs.
I remember doing a project with SQL a while ago. We were dealing with
massive data quantities and struggling to load them into SQL fast enough.
Ultimately we found out: *the fastest way to load into MSSQL was a bulk
insert from CSV files.*
In TM1 we are currently exactly in the same situation! We can use REST /
TM1py for everything but if you are really dealing with terabytes, just
create CSV files on the server and use bulk mode / TI.
I had to do this only once in my life. My experience: 95% of the time REST
is fast enough. You may also look into multi-threading TM1py if REST isn't
fast enough.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#573 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEDHULN6X5DIYPLYJFQ76QLTYAXUBANCNFSM5AM5KINA>
.
|
@wimgielis from TM1py import TM1Service, Element
with TM1Service(address="", port=12354, ssl=True, user="admin", password="apple") as tm1:
tm1.elements.add_elements(
dimension_name="d2",
hierarchy_name="d2",
elements=[Element("e11", "Numeric"), Element("e12", "Numeric")]) |
I would be interested too! |
Thanks. Back in that project, we weren't writing from TM1 to SQL but writing from Java to SQL. Please don't ask why this architecture.... And yes TM1 was coming after SQL :) |
Allow me to disagree here.
I could go on. :-) BTW: Awesome discussion here. Loving it! We should get Hubert in on that. |
Christoph,
Adding users... how often do you that ? :-) My sales colleagues would want
to see it every day at every customer but reality is different, no ? ;-)
Data cleaning: I agree it can be much better, with reusable functions,
regex, etc. That relates to reusing code that you brought up as well.
Switching values: how does tm1py help here ? Assuming we already have TI
and possibly Bedrock (and if you add own libraries I do wonder what tm1py
does add to the table here).
…------
Best regards / Beste groeten,
Wim Gielis
MS Excel MVP 2011-2014
https://www.wimgielis.com <http://www.wimgielis.be>
Op vr 16 jul. 2021 om 15:52 schreef Christoph Hein ***@***.***
:
The vast majority of TM1 models out there can just suffice with what we
now have in TM1.
Allow me to disagree here.
- Every form of data cleaning is just very painful in TI.
- Adding users is painful because I only can add one user at a time
and only add the user to one group per line, etc.
- Reusing code is nearly impossible (just check the length they have
to go in bedrock)
- Switching values from one element to the other is painful
I could go on. :-)
BTW: Awesome discussion here. Loving it! We should get Hubert in on that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#573 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEDHULJHAKXJQXXQOLG2JNDTYA2RDANCNFSM5AM5KINA>
.
|
We have lots of systems where automatically new users are added. Could definitely be less lines of code in Python. ;-) If you have a complex logic for switching values where you have to iterate over the whole cube a pandas dataframe could be very helpful to speed things up and makes it more transparent. |
Switching gears slightly; it would be great if TI did modernize, but even if it doesn't, there are about 3 fundamental features/improvements that are missing that could be added today.
1) Data Duplicate Function: Copy all data from element a to element b for example, where it does not require a record wise operation.
2) Cube Calculations Expressed with Rules: Imagine that you could write rule syntax in TI, and have the resulting values written to the cube. CubeRunStaticCalc(cube, rules). No row wise operations. Just a onetime rule execution. Makes drivers * cost pool just as quick as a merge in pandas. Likewise you could instantly convert rules on a cube to true values without having to export and import.
3) More efficient zero out: this is one is painful. Why the server must traverse all cells to make them 0 seems crazy to me.
|
Thank you to everyone for giving their opinion. This was really an insightful discussion. Actually, in the beginning, Marius pointed to try with 250k+ data but I don't have that much data to try with and the code would be difficult to share as it company-specific. However, I have completely followed the documentation of this repository and carried out my analysis for building a hierarchical dimension, adding elements and elements attributes to it with the help of this. TM1py is really an excellent open source project I found out personally where each issue is discussed and solved. Indeed, Thank you. |
In the MODLR platform (a TM1-like competitor) we embedded JavaScript, this means it runs 'In-Process' so it will be more efficient than anything which works over the REST API and also it affords us the benefits of a modern language - Arrays, Objects, Functions, Timers, Template Literals, Try-Catch. We also have the stats language R embedded as a secondary option and could add Python if it was requested enough. JavaScript is also the most commonly known language / most frequently used as it's in practically every website. We also have some handy utility functions which make life easy for developers -
So you can imagine how this would reduce the number of lines of code to maintain. Honestly, I would love to see TM1 with a powerful embedded language like JavaScript V8 Engine (from Google - used inside Chrome etc and is open source). As per your comment on REST API updates, REST API based ETL can never be as fast as an in-process language so besides using it for edge cases, I don't see a REST ETL becoming the go-to for standard builds. When I was reviewing other platforms I looked at Jedox and at the time their ETL was out-of-process (not sure about now) and therefore REST API based however it was an order-of-magnitude slower than TI as a result. Since there were no alternatives which could compete with TM1 I started my own. |
Hi @MODLR, We should keep the discussion to TM1 rather than talking about other products 😀. |
Agreed! |
Hello @MariusWirtz
Creation | Updation time is written in the cells. Other than time performance what other factors would you suggest to determine a better ETL process?
The text was updated successfully, but these errors were encountered: