Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TI vs TM1py which is a better ETL process? #573

Closed
pal-16 opened this issue Jul 15, 2021 · 36 comments
Closed

TI vs TM1py which is a better ETL process? #573

pal-16 opened this issue Jul 15, 2021 · 36 comments
Labels

Comments

@pal-16
Copy link

pal-16 commented Jul 15, 2021

Hello @MariusWirtz
image

Creation | Updation time is written in the cells. Other than time performance what other factors would you suggest to determine a better ETL process?

@MariusWirtz
Copy link
Collaborator

Hi @pal-16,

thank you for providing these stats. Very interesting! Can you please do one more test with 250k or 500k?
It's not unusual to have such large dimensions in TM1. And loading very large dimensions can be a bottleneck.

As a "Pythonista" and TM1py developer I may be biased, but here is my take on this:
I think while TI is easier and to many old-school TM1'ers more familiar, Python is the better choice for ETL due to the following reasons:

  • Python is a proper programming language that allows you to express logic in efficient ways using modern data structures (lists, tuples, dictionaries, etc.). features (classes, functions, etc.), and not to mention automated tests.
  • Python's standard library and third-party extensions (pandas, numpy, etc.) go way beyond the scope of what TI can do.
  • Contrary to Turbo Integrator, a TM1py script does not run within the scope of a TM1 instance!
    It is therefore not more complex to interact with n TM1 instances than it is to interact with 1 instance from the script. For instance to load a dimension from instance A to instance B is very simple in python and very hard in TI. Sample:
from TM1py import TM1Service

with TM1Service(address="", port=12354, ssl=True, user="admin", password="apple") as tm1_source:
    with TM1Service(address="", port=12297, ssl=True, user="admin", password="apple") as tm1_target:
        dimension = tm1_source.dimensions.get(dimension_name="Financial Year")

        tm1_target.dimensions.update_or_create(dimension)
  • TI is limited in terms of the data sources it can connect to. With python, we you connect to almost any source system seamlessly.

Any other opinions on this one?

@wimgielis
Copy link
Contributor

wimgielis commented Jul 15, 2021 via email

@MariusWirtz
Copy link
Collaborator

@wimgielis,

Yeah. When it comes to writing cell-level data, TI is still the fastest option.
In TM1py we try to work around it with the use_ti option in the write method but due to the overhead, it doesn't eliminate the difference entirely.

The sample above updates the dimension with all hierarchies with all elements and edges and attributes.
Attribute values need to be transferred separately. The same goes for the security stuff and subsets.

Yes. Python involves an installation, though not necessarily on the machine that is running TM1.
And yes you have to learn python but the same goes for TI. Mastering TI is not trivial.

But definitely tm1py is a very welcome asset in the TM1 landscape so to speak

Thanks :)

@scrumthing
Copy link

Okay, I will add my two cents here... :-)

TI is only faster because it ignores everything as it runs in GOD-MODE... That is faster but from a maintenance and security point of view it is a nightmare...
Yes we all know how TI works. But only because we have always done it that way. And most of the time it is not really the right way. Just because TI expects you to work with your data line by line does not make it the best approach. Most definitely there are a lot of use cases where working with complete datasets instead of a single line makes for a better solution. 1,000 lines of code in metadata or data because you have to think of every variant of data that could be in the source is not how it should be done.

Python and tmpy1 (actually the rest api in general but unfortunately no other language offers a package like tm1py) like @MariusWirtz said opens tm1 up to all kinds of modern technology. Be it git or json or CI/CD or DevOps or ML or AI... the list goes on.
So the question is not what is better. The only question is what is tm1py missing that you are not using it in every project.

Looking forward to your comments. ;-)

@VentureHill
Copy link

I'm no way near as in-the-loop as I used to be with matters relating to TM1, but how can an 'out-of-process' ETL be faster than an 'in-process' ETL?

Is TM1Py usually ran on the same machine as TM1? If-not then these metrics are not applicable at all and are downright misleading. Even if it is running on the same box, the TM1Py library will introduce a socket related lag which wont be in TI due to it being 'in-process'.

Let me know what I'm missing here...

@rkvinoth
Copy link
Contributor

rkvinoth commented Jul 16, 2021

Faster way to load data:
It's obviously a TI process because it runs in Admin mode as @scrumthing pointed out.

So should I go with TI?
Well, no! It totally depends on the use case. I have experience converting TI processes that ran allocations in 20 minutes to TM1py based scripts which can complete in 1.5 minutes.

So should I go with TM1py?
Well, no! It totally depends on the use case. I have tried loading large files into cubes or exporting large dimensions/attributes. In these type of situations TI is the best. But with the help of new features in TM1py like unbound processes, you can now achieve these things in TM1py (unless your requirement is too complex).

Conclusion:
You cannot arrive at the right answer with out performing some experiments. Even experienced python people here would definitely try to solve things using different wats and see which is performing better.
IMO, TM1py is an excellent addition to our field. Let's embrace it and understand how to use it better. @MariusWirtz has been providing wonderful support for all the queries to a greater extent (no matter how dumb the questions can be).

@rclapp
Copy link
Collaborator

rclapp commented Jul 16, 2021 via email

@VentureHill
Copy link

VentureHill commented Jul 16, 2021

The related lag you mention would only be applicable in testing if you were using the same "method" of processing. In many cases the TI based line by line method is slower than then a query followed by a table transformation operation for example. What we are really missing, and desperately need, is an ETL REST endpoint. One that allows us to use 3P ETL tools.

Hi Ryan,

That makes sense as a way in which the TM1Py would be faster - one could optimize the amount of data to be added into TM1 prior to actioning it against the dimensions / cube. Also, I totally agree that flexibility Py will provide here is perfect for complex scenarios with merging multiple data queries potentially from multiple places.

That said, the initial benchmarks are lacking context, is this a like-for-like comparison - both systems running the same data through using the same methodology or an edge case where each engine is taking a different methodology based on its unique capabilities.

Regardless of methodology, TM1Py needs to talk to the TM1 Rest API over a network interface (even if on the same machine) adding a delay based on the size of the data being sent, the more data input, the more added delay over the TI approach which includes no such lag. This is why I'm strongly in favor of in-process ETL, I don't think 3P ETL Tools are the answer (unless they are embedded), I think a better TI scripting language/engine would be the answer.

@pal-16 Can we see the source code for this benchmark?

@cubewise-tryan
Copy link
Member

@pal-16 It would be interesting to see the code of the two benchmarks as sometimes you can be comparing apples and oranges.

Both TI and TM1Py are great tools for interacting with TM1. As as default I would still stick to TI when dealing with "standard" data sources such as ODBC and flat files. TI is a little clunky but it is very good at what it does and is super fast. There are some great tricks in TM1Py to improve performance but TI does run in-process. That means there isn't any overhead in terms of parsing HTTP requests and JSON and it has direct access to the data stored in memory. TI is also "compiled" so you don't require parsing of the code after it has been saved.

TM1Py is great for stuff that you can't do or that is hard in TI. There are more and more web based sources and dealing with JSON (or XML) in TI can be painful. TM1Py can also be great if you have Python expertise, Python is a very nice language and there are lots tutorials and an endless list of libraries. There are lots of great examples of how TM1Py has opened up a whole new world because it enables so many things that aren't possible with TI.

In summary, both are great but have their own sweet spots, it isn't a matter of better but instead what fits the job.

@MariusWirtz
Copy link
Collaborator

MariusWirtz commented Jul 16, 2021

Very interesting points. Thanks, everyone for sharing your thoughts and expertise!

regarding the stats, while I would love to challenge the code, but the results don't surprise me, to be honest. An IBM employee familiar with the TM1 engine, recently told me that dimension updates through REST should already be faster than through TI.

@MODLR
From my experience, a lot of TM1py implementations do run on the same machine as TM1. So if the stats are based on that assumption they are not misleading per se.

In summary, both are great but have their own sweet spots, it isn't a matter of better but instead what fits the job.

Agree. Perhaps the cases can be split into three groups.

  • I think there is a range of common TM1 problems that TI can address faster and easier than TM1py. Like for instance reading data from one cube to another with a simple transformation. There is no point in bringing the data into python and then back into TM1.

  • When it comes to dealing with SQL and flat files I think it makes sense to default to Turbo Integrator.
    However, depending on the complexity of the script and your expertise in Python it can make sense to write a python script instead.
    While a simple CSV to cube load will perform faster in TI, we know of more than one example with complex calculations involving multiples cubes that perform better in Python!
    I understand that this is due to TI's record by record processing (imagine 3 AttrS + 2 CellGetN +1 CellPutN statements for every record) while python is slicing chunks out of a few cubes at the beginning, doing its thing in python, and then writing back one chunk to TM1 at the end.

  • For anything that goes beyond the scope of TI (data integration from the cloud, forecasting, complex calculations, multi-instance logic, etc.) I think it makes sense to default to TM1py for the moment.

@MariusWirtz
Copy link
Collaborator

I think a better TI scripting language/engine would be the answer.

@MODLR
interesting thought! What do you have in mind?

I kinda think the TM1 REST API is already the answer. Ultimately everyone prefers different languages and technologies (and it changes over time too!) and REST caters to that.

I would rather have IBM focussing on making the REST API as fast and feature-rich and robust as possible than have them inventing a new language or integrating one fixed scripting language into the server.

@MariusWirtz
Copy link
Collaborator

What we are really missing, and desperately need, is an ETL REST endpoint. One that allows us to use 3P ETL tools.

@rclapp
Have you looked into Apache Airflow? It's perhaps more workflow management than classic ETL but I imagine it could go really well with TM1 and TM1py. @scrambldchannel did some pioneering work on this.
https://scrambldchannel.github.io/airflow-tm1.html#airflow-tm1

@wimgielis
Copy link
Contributor

wimgielis commented Jul 16, 2021 via email

@scrumthing
Copy link

I am with @MariusWirtz on this one. Rest-based is the future. TI will slowly but surely be deprecated. Afterwards you can either use Python or any other language. Besides performance there is no real need for ti because more or less all other languages on the planet have more flexibility. And for a pure data dump into the server IBM will maybe provide something.

@AlexanderDvoynev
Copy link

So, @lotsaram, when will Bedrock move from TI to tm1py? ;)

@wimgielis
Copy link
Contributor

wimgielis commented Jul 16, 2021 via email

@rclapp
Copy link
Collaborator

rclapp commented Jul 16, 2021 via email

@rclapp
Copy link
Collaborator

rclapp commented Jul 16, 2021 via email

@MariusWirtz
Copy link
Collaborator

I don't think REST can be the future, well at least not now we know it today. It was never intended to retrieve/send terabytes of data.

This is the feedback we need to provide to IBM regarding the REST API!
Loading terabytes of data is somewhat of an edge case though 🙃

@MariusWirtz
Copy link
Collaborator

Yes we are working to replace CCC with it. However, I am more interested an endpoint that can access the underlying trie structure directly, that way we can use things like AWS Glue.

I would love to learn more about how you use it today. Didn't know about AWS Glue yet. Will check it out!

@wimgielis
Copy link
Contributor

wimgielis commented Jul 16, 2021 via email

@zsoltmoravcsik
Copy link

zsoltmoravcsik commented Jul 16, 2021 via email

@MariusWirtz
Copy link
Collaborator

MariusWirtz commented Jul 16, 2021

REST API is too verbose to do good and efficient ETL

Completely disagree. Who said it's not efficient?

Exchanging data through JSON is not per se inefficient. For dimension updates, it is more or less on par with TI (according to the stats above and according to what we hear from IBM).

For data, you must not look at the throughput rate (e.g. update 100k cells per second) but at the runtime of an allocation or something. You will see that in many cases with REST we are already faster than TI at the bottom line.
Are there even more efficient ways to exchange data than JSON? Yes, and the TM1 REST API is eventually going to offer them and TM1py is going to implement them.

TM1 would need a proper API to work with the core.

Are you suggesting to rather wait for a "proper API" and not use REST for loads?
Doesn't make sense IMO. A bird in the hand is worth two in the bush and IBM has communicated multiple times that REST is the way to go forward in terms of APIs.

I remember doing a project with SQL a while ago. We were dealing with massive data quantities and struggling to load them into SQL fast enough. Ultimately we found out: the fastest way to load into MSSQL was a bulk insert from CSV files.

In TM1 we are currently exactly in the same situation! We can use REST / TM1py for everything but if you are really dealing with terabytes, just create CSV files on the server and use bulk mode / TI for the very last step (CellPutN).
I had to do this only once in my life. My experience: 95% of the time REST is fast enough. You may also look into multi-threading TM1py if REST isn't fast enough.

@wimgielis
Copy link
Contributor

wimgielis commented Jul 16, 2021 via email

@wimgielis
Copy link
Contributor

wimgielis commented Jul 16, 2021 via email

@MariusWirtz
Copy link
Collaborator

Probably an edge case but I would assume a oneliner to add an element to a dimension like currently DimensionElementInsert( dim, ‘’, name, type ); in TI.

@wimgielis
In the ElementService there are functions for that purpose: add_elements, add_edges, add_element_attributes

from TM1py import TM1Service, Element

with TM1Service(address="", port=12354, ssl=True, user="admin", password="apple") as tm1:
    tm1.elements.add_elements(
        dimension_name="d2",
        hierarchy_name="d2",
        elements=[Element("e11", "Numeric"), Element("e12", "Numeric")])

@scrumthing
Copy link

Yes we are working to replace CCC with it. However, I am more interested an endpoint that can access the underlying trie structure directly, that way we can use things like AWS Glue.

I would love to learn more about how you use it today. Didn't know about AWS Glue yet. Will check it out!

I would be interested too!

@MariusWirtz
Copy link
Collaborator

The ODBCOutput function in TI is rather slow (for large data volumes) if we do it record by record in the Data tab for instance. Bulk insert of a csv of SQL is much faster. So that is then the Epilog tab and does not make use leave TI, does it ?

Thanks. Back in that project, we weren't writing from TM1 to SQL but writing from Java to SQL. Please don't ask why this architecture.... And yes TM1 was coming after SQL :)

@scrumthing
Copy link

The vast majority of TM1 models out there can just suffice with what we now have in TM1.

Allow me to disagree here.

  • Every form of data cleaning is just very painful in TI.
  • Adding users is painful because I only can add one user at a time and only add the user to one group per line, etc.
  • Reusing code is nearly impossible (just check the length they have to go in bedrock)
  • Switching values from one element to the other is painful

I could go on. :-)

BTW: Awesome discussion here. Loving it! We should get Hubert in on that.

@wimgielis
Copy link
Contributor

wimgielis commented Jul 16, 2021 via email

@scrumthing
Copy link

We have lots of systems where automatically new users are added. Could definitely be less lines of code in Python. ;-)

If you have a complex logic for switching values where you have to iterate over the whole cube a pandas dataframe could be very helpful to speed things up and makes it more transparent.

@rclapp
Copy link
Collaborator

rclapp commented Jul 16, 2021 via email

@pal-16
Copy link
Author

pal-16 commented Jul 18, 2021

Thank you to everyone for giving their opinion. This was really an insightful discussion. Actually, in the beginning, Marius pointed to try with 250k+ data but I don't have that much data to try with and the code would be difficult to share as it company-specific. However, I have completely followed the documentation of this repository and carried out my analysis for building a hierarchical dimension, adding elements and elements attributes to it with the help of this. TM1py is really an excellent open source project I found out personally where each issue is discussed and solved. Indeed, Thank you.

@VentureHill
Copy link

VentureHill commented Jul 18, 2021

I think a better TI scripting language/engine would be the answer.

@MODLR
interesting thought! What do you have in mind?

I kinda think the TM1 REST API is already the answer. Ultimately everyone prefers different languages and technologies (and it changes over time too!) and REST caters to that.

I would rather have IBM focussing on making the REST API as fast and feature-rich and robust as possible than have them inventing a new language or integrating one fixed scripting language into the server.

In the MODLR platform (a TM1-like competitor) we embedded JavaScript, this means it runs 'In-Process' so it will be more efficient than anything which works over the REST API and also it affords us the benefits of a modern language - Arrays, Objects, Functions, Timers, Template Literals, Try-Catch. We also have the stats language R embedded as a secondary option and could add Python if it was requested enough.

JavaScript is also the most commonly known language / most frequently used as it's in practically every website.

We also have some handy utility functions which make life easy for developers -

  • notification.email
  • notification.sms
  • dimension.createOrWipe
  • hierarchy.createOrWipe
  • alias.createOrWipe
  • cube.wipe(cubeName, e1, e2, eN)
    Process function documentation - https://docs.modlr.co/process-functions

So you can imagine how this would reduce the number of lines of code to maintain.

Honestly, I would love to see TM1 with a powerful embedded language like JavaScript V8 Engine (from Google - used inside Chrome etc and is open source).

As per your comment on REST API updates, REST API based ETL can never be as fast as an in-process language so besides using it for edge cases, I don't see a REST ETL becoming the go-to for standard builds. When I was reviewing other platforms I looked at Jedox and at the time their ETL was out-of-process (not sure about now) and therefore REST API based however it was an order-of-magnitude slower than TI as a result. Since there were no alternatives which could compete with TM1 I started my own.

@cubewise-tryan
Copy link
Member

Hi @MODLR,

We should keep the discussion to TM1 rather than talking about other products 😀.

@scrumthing
Copy link

Hi @MODLR,

We should keep the discussion to TM1 rather than talking about other products 😀.

Agreed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants