Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False predictions when importing several files at once #77

Closed
gety9 opened this issue Nov 15, 2018 · 13 comments
Closed

False predictions when importing several files at once #77

gety9 opened this issue Nov 15, 2018 · 13 comments
Labels

Comments

@gety9
Copy link

gety9 commented Nov 15, 2018

Guys hi,

It's more a question on usage than issue, hope you could explain me. I've read Quick Start and Documentation but since i am using several importers and 1 of them have 2 "modes" (credit card and checking) can't understand how to apply directions provided.

I have following folder structure:

/downloads/
/office/
	at.beancount
	at.import
	/importers/	
		__init__.py
		/paypal/
			__init__.py
		/chase/
			__init__.py

at.beancount looks like this
/paypal/__init__.py/ and chase/__init__.py like this

Using bean-extract -e at.beancount at.import ../Downloads/ > temp.beancount
gives me temp.beancount file similar to this

Than i manually put correct accounts and get this.

I'd like to automate this last manual part with smart_importer. As far as i understand i don't need @PredictPayees(), but only @PredictPostings(). But i can't understand in which importer file to insert them (in at.import or /chase/__init__.py and /paypal/__init__.py) and where exactly :) Python programmer helped me with importers, but now he is not available. So i have to figure out on my own.

@johannesjh
Copy link
Collaborator

hi,

I think that what you want to achieve is to apply the @PredictPostings() decorators to your existing importers, in order to enhance them with machine learning. The easiest way to do this is to apply the decorators straight at your existing importer classes, which in your case are defined in the __init.py__ files.

For example, /chase/__init__.py/ before applying the decorator:

class Importer(importer.ImporterProtocol):
  # ...

...and after applying the decorator:

@PredictPostings()
class Importer(importer.ImporterProtocol):
  # ...

That's it.

Note: You may prefer alternative methods of applying the decorators to be able to unittest undecorated importer classes. But the simple solution above is sufficient to get you started.

@gety9
Copy link
Author

gety9 commented Nov 15, 2018

@johannesjh

thank you for your reply.

Do i need to use any additional commands? (to train the model)

Now i am using bean-extract -e at.beancount at.import ../Downloads/ > temp.beancount
Or this command is enough and it will use at.beancount as data for training ?

@johannesjh
Copy link
Collaborator

No additional commands needed. Data from at.beancount are used for training.

Technically, the decorator wraps the importer's extract(self, file, existing_entries=None) method. When the extract method is invoked through bean-extract or fava, the importer grabs the existing_entries and uses them as training data.

@johannesjh
Copy link
Collaborator

did it work? can we close this issue?

@gety9
Copy link
Author

gety9 commented Nov 17, 2018

@johannesjh

i've made it work, surprisingly for some import files it works perfect, but for some not at all, below i will provide examples.

I am still getting warning, not sure if it's important one:
bean-extract -e at.beancount at.import ../Downloads/ > tmp181116.beancount

2 [main] python3.6m 15556 child_info_fork::abort: address space needed by '_superlu.cpython-36m-x86_64-cygwin.dll' (0x400000) is already occupied
/usr/lib/python3.6/site-packages/sklearn/externals/joblib/_multiprocessing_helpers.py:38: UserWarning: [Errno 11] Resource temporarily unavailable.  joblib will operate in serial mode
  warnings.warn('%s.  joblib will operate in serial mode' % (e,))
/usr/lib/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp

I have 5 chase import files and 2 paypal import files https://puu.sh/C31l7/851b3a0da3.png
For some import files like this Paypal (DownloadAT) https://puu.sh/C31fH/1edb32c280.png i am getting perfect results.
For some like this Paypal (DownloadIS) i am not getting results https://puu.sh/C31ax/bbb87441a1.png, for some reason it puts 3 accounts in each transaction.
Not sure what is the reason for this.

So basically now in my case pretty much all predictions in some import files are right, and in some they are wrong. And the "wrongness" is that model puts more accounts in transactions, here is example pretty much all transaction predictions for this import file have 4 accounts in it https://puu.sh/C32nU/7e18db510f.png

@gety9
Copy link
Author

gety9 commented Nov 17, 2018

Seems like i found the pattern, the first file importer proceeds has very accurate predictions.
But for all next import files within this importer predictions are incorrect.
For example let say we have 5 chase import files:
ChaseXXX1_Activity_20181115.CSV
ChaseXXX2_Activity_20181115.CSV
ChaseXXX3_Activity_20181115.CSV
ChaseXXX4_Activity_20181115.CSV
ChaseXXX5_Activity_20181115.CSV

than predictions for ChaseXXX1_Activity_20181115.CSV will be correct, and for all other ones incorrect, including (ChaseXXX2_Activity_20181115.CSV)

but if i delete the ChaseXXX1_Activity_20181115.CSV, and now we have 4 files
ChaseXXX2_Activity_20181115.CSV
ChaseXXX3_Activity_20181115.CSV
ChaseXXX4_Activity_20181115.CSV
ChaseXXX5_Activity_20181115.CSV

than predictions for ChaseXXX2_Activity_20181115.CSV will be correct, and for others incorrect.

If we use several importers and have following import files:
ChaseXXX1_Activity_20181115.CSV
ChaseXXX2_Activity_20181115.CSV
ChaseXXX3_Activity_20181115.CSV
ChaseXXX4_Activity_20181115.CSV
ChaseXXX5_Activity_20181115.CSV
PaypalAT.CSV
PaypalIS.CSV

than predictions for ChaseXXX1_Activity_20181115.CSV and PaypalAT.CSV will be correct, for all others incorrect.

Could you suggest what's the problem and how could it be solved it?

As per your suggestions i've applied the smart importers like this

@johannesjh
Copy link
Collaborator

johannesjh commented Nov 25, 2018

Hm, difficult to say.

  • What do you mean by saying the predictions are incorrect? In which way are they incorrect?
  • Are there any differences between the CSV files, regarding their content (e.g., are they for the same account or for different accounts), regarding what training data should be used, and regarding your expectation about correct vs. incorrect predictions?
  • How do you start the import, e.g., through beancount's commandline api or through fava? When you start the import, do you tell the program to import several files at once?

I can, for now, only guess, but one idea for an explanation is this: Is it possible that your program (beancount or fava) re-uses importer instances when it is told to import several files? Such re-use would make perfect sense for regular importers, but smart importers could end up using false training data.

EDIT, Note:
I have always imported just one file with each importer, and I never experienced such problems.

@gety9
Copy link
Author

gety9 commented Nov 27, 2018

1 "What do you mean by saying the predictions are incorrect? In which way are they incorrect?"

It's completely off. Here is example:

If correct transaction is

2017-10-20 * "GODADDY.COM" "Order: #38070"
  Assets:Paypal:IS                                              -19.95 USD
  Expenses:Business:IS:Hosting

than prediction can be

2017-10-20 * "GODADDY.COM" "Order: #38070"
  Assets:Paypal:AT                                              
  Expenses:Business:AT:Advertisement
  Assets:Paypal:IS                                              -19.95 USD

2 "Are there any differences between the CSV files, regarding their content (e.g., are they for the same account or for different accounts), regarding what training data should be used, and regarding your expectation about correct vs. incorrect predictions?"

files are pretty much the same, it's order they go in downloads folder that matters.
If i have file that proceeds (get predicted) incorrectly, once i rename it so it goes first or import only that file than it proceeds (get predicted) correctly.

3 "How do you start the import, e.g., through beancount's commandline api or through fava? When you start the import, do you tell the program to import several files at once?"

I am using command line, example:
bean-extract -e at.beancount at.import ../Downloads/ > tmp181126.beancount

"do you tell the program to import several files at once?"
yes, exporting several at once (usually 7 files), at.beancount looks like this

4 "I can, for now, only guess, but one idea for an explanation is this: Is it possible that your program (beancount or fava) re-uses importer instances when it is told to import several files? Such re-use would make perfect sense for regular importers, but smart importers could end up using false training data."

that's what i think too. It works correctly when i place 1 file in downloads folder, i just wanted to make it work with all 7 files, but it's ok. Not a big deal, i will just import them 1 at a time.

@johannesjh
Copy link
Collaborator

johannesjh commented Nov 29, 2018

Thank you for sharing this information. I think we now have sufficiently narrowed down the problem: False predictions when importing several files at once.

Next steps: This will need some debugging to confirm the suspicion that importer instances are cached and re-used, which leads to false training data being used for the predictions.

@johannesjh johannesjh changed the title Question on usage False predictions when importing several files at once Nov 29, 2018
@johannesjh johannesjh added the bug label Nov 29, 2018
@yagebu yagebu mentioned this issue Dec 1, 2018
@yagebu yagebu closed this as completed in #78 Dec 8, 2018
@gety9
Copy link
Author

gety9 commented Dec 9, 2018

@johannesjh hi,

So it should work correctly using hooks? Could you please explain how to apply them (hooks) to my sample file described in OP #77 (comment) ?

@yagebu
Copy link
Member

yagebu commented Dec 9, 2018

@gety9: Instead of applying the decorators to the importer classes, you should apply the hooks to importer instances as outlined in the README

@johannesjh
Copy link
Collaborator

...for your import configuration, this roughly translates to:

chase_importer = chase.Importer(...)
paypal_importer = paypal.Importer(...)

CONFIG = [
    apply_hooks(chase_importer, [PredictPostings(), PredictPayees()]),
    apply_hooks(paypal_importer, [PredictPostings(), PredictPayees()])
]

@gety9
Copy link
Author

gety9 commented Dec 16, 2018

@johannesjh @yagebu thank you guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants