Skip to content

Potter/outlook Outlook Connector#939

Merged
ryannikolaidis merged 26 commits into
mainfrom
potter/outlook
Jul 26, 2023
Merged

Potter/outlook Outlook Connector#939
ryannikolaidis merged 26 commits into
mainfrom
potter/outlook

Conversation

@potter-potter
Copy link
Copy Markdown
Contributor

@potter-potter potter-potter commented Jul 16, 2023

Adds Outlook connector
Hasn't been tested on email other than the devops one.
Does not download attachments.

changed file name in Onedrive example to match the rest.
changed documentation to correct the pip install.
Also notice change from or to and in main.py line 874. OneDrive and Outlook share credentials and it was triggering the Onedrive connector. Hopefully we can have an and instead of an or or else we'll need a different approach.

from unstructured.ingest.logger import logger
from unstructured.utils import requires_dependencies

MAX_NUM_EMAILS = 10000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what drives this max number? does the library trigger paging otherwise? is there a generator or some alternative so we can avoid the upper limit?

Copy link
Copy Markdown
Contributor Author

@potter-potter potter-potter Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The library triggers paging as the default. Which is 10. Supposedly there is a timeout at some point. But consider that this call is just getting mail objects (It's not downloading), it should be pretty fast.

10000 is my random pick of a lot of emails.

At the point that it starts downloading, it does it with separate calls, so there shouldn't be a timeout. Though it would be hard for us to test with just a couple emails...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, makes sense. is there any reason we need to limit this at all?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't use .top() (line 209) it will default to 10. if we use .top() with nothing in it. I think it always defaults to 10 also. I will look in the documentation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like top needs an parameter number. At least there is no default in the documentation. Without .top you only get 10 documents

Copy link
Copy Markdown
Contributor

@ryannikolaidis ryannikolaidis Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's so bizarre! issue is that's the pre-filtered number, so that will include you're entire inbox (I'm guessing trash too?). meaning it could get capped before fetching the messages of interest. I guess I'm tempted to make this something much bigger, like 10M. kind of a bummer we have to query all of the full messages and hold them in memory, unlike other connectors where each doc is responsible for actually fetching content, bc that does have me slightly worried about things blowing up and waiting for a giant call before processing. guessing there weren't any reasonable options here? Ideally would be nice to list all message IDs and then individually fetch each per doc..

fwiw it looks like there's a way to filter in that original query by mailbox, right? if that's correct, we should probably do that for the sake of reducing memory and request demands and skirting around the max fetch a bit.

Copy link
Copy Markdown
Contributor Author

@potter-potter potter-potter Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh. let me look into that. I wasn't thinking it was a big deal. But now that you mention deleted emails, 10k could get filled up pretty quick.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look deeper into this. For some reason I am recalling that a few of the root folders they allow "Inbox" as the ID (and "Sent Items" and a few others). There may have been a reason that this did not work though in the long run for non standard folders. I can't totally remember. But I will definitely do some exploration.

Copy link
Copy Markdown
Contributor Author

@potter-potter potter-potter Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It will only take folder IDs in that mail_folder[]. BUT we could get all the folders and subfolders first (and their ids) and then collect the mail after that we get the folder ids. It will only get mail in the exact folder id and not in the subfolders until you run it again with the subfolder id. So we'll iterate through all folder/subfolder ids. Then we don't have to do any filtering later. And we will only get the relevant emails in the relevant folders.

Glad you thought of looking deeper into this.

Should be a reasonably easy fix.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks for digging here.

Comment on lines +181 to +187
for folder in get_root_folders:
self.root_folders[folder.display_name].append(folder.id)
if folder.get_property("childFolderCount") > 0:
root_folders_with_subfolders.append(folder.id)

for folder in root_folders_with_subfolders:
self.recurse_folders(folder, self.root_folders)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have folders and subfolders in our example? or does everything get flattened? probably okay if so since the file names are by id anyways, but want to make sure we're testing the nesting logic.

Copy link
Copy Markdown
Contributor Author

@potter-potter potter-potter Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have folders and subfolders in the test account. And I have tested double folders, etc...

Everything is returned flattened for simplicity. The way the api lists parent folders was less than ideal, so this was my simple solution.

I'd be happy to look at subfolders in the partitioned return if you think it's important.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I think that's fine. As long as getting the filename is ultimately unique and deterministic, since we currently rely on that to determine if we should process a file on subsequent runs.

Copy link
Copy Markdown
Contributor

@ryannikolaidis ryannikolaidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed a tiny change to update onedrive secrets while we're making changes there.

just a couple quick questions, otherwise this looks great!

Copy link
Copy Markdown
Contributor

@ryannikolaidis ryannikolaidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! appreciate the catch/fix on the email partitioning!

@ryannikolaidis ryannikolaidis enabled auto-merge (squash) July 26, 2023 01:15
@ryannikolaidis ryannikolaidis disabled auto-merge July 26, 2023 01:16
Comment thread CHANGELOG.md
Comment on lines +9 to +10
### Fixes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Fixes
### Fixes
* Fixes issue with email partitioning where From field was being assigned the To field value.

@ryannikolaidis
Copy link
Copy Markdown
Contributor

@potter-potter can you update this to latest, bump version (accordingly), and review quick suggestion for changelog?

@ryannikolaidis ryannikolaidis enabled auto-merge (squash) July 26, 2023 03:38
@ryannikolaidis ryannikolaidis merged commit f7e46af into main Jul 26, 2023
@ryannikolaidis ryannikolaidis deleted the potter/outlook branch July 26, 2023 04:09
MthwRobinson added a commit that referenced this pull request Aug 3, 2023
* don't push

* enhancement: improve json detection by detect_filetype (#971)

* update regex pattern

* improve json regex pattern checks and add test file

* update file name

* update tests and formatting

* update changelog and version

* refactor: simplifies JSON detection and add tests (#975)

* refactor json detection

* version and changelog

* fix mock in test

* feat: adds Outlook connector (#939)

* bonus: fixes issue with email partitioning where From field was being assigned the To field value.

* Roman/expose dpi param (#966)

* Bump inference version

* Pass through the dpi param if available

* Update CHANGELOG

* Check dpi param passed in via unit test

* Bump inference version

* Fix unit test around file info to work on mac as well

* chore: cleanup changelog for 0.8.2 (#976)

* Update `partition_via_api` to not post a strategy value if not user specified (#967)

* remove default strategy

* working on test

* fixed test, coordinates param needed to be included

* nits

* update changelog

* lint

* update requirements

* build(release): cut 0.8.4 release (#979)

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Chore: add uns api repo unittests (#954)

* stage

* git clone

* ci ignore markdown file

* make install

* use env instead

* remove md

* add script

* wrong env value

* add note

* maybe don't rm

* no cd../

---------

Co-authored-by: cragwolfe <crag@unstructured.io>

* fix: handling for empty tables in word docs and powerpoints (#982)

* fix table index error

* changelog and version

* fix: only download nltk packages if necessary (#985)

* fix: only download nltk if necessary

* changelog and version

* Chore: Pass table support  param to partition image (#973)

* add param and test in image table extraction

* version and changelog

* need to publish this one for api repo

* add new param skip_infer_table_types

* use warning

* clean up with mapping

* add test for tsv

* fix test fail

* weird change from merge

* doc nit

* don't use mapping

* correct conflict

* Update pip in makefile (#981)

* update pip in makefile

* merge and update requirements

* update version

* update outlook requirements

* chore: remove debug printing (#988)

* fix: correct nltk download arg order (#991)

* fix: correct download order to nltk args

* add smoke test for tokenizers

* Chore: put back function `split_by_paragraph` (#992)

* put back function

* not really fixes

* don't push

* fix: clean up code

* fix: clean up

* fix: clean up

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Roman/ingest refactor (#978)

* Pull out s3 code as subcommand

* Pull out dropbox code as subcommand

* Pull out azure code as subcommand

* Pull out fsspec code as subcommand

* Pull out github code as subcommand

* Pull out gitlab code as subcommand

* Pull out reddit code as subcommand

* Pull out slack code as subcommand

* Pull out discord code as subcommand

* Pull out wikipedia code as subcommand

* Pull out gdrive code as subcommand

* Pull out biomed code as subcommand

* rename parameters

* Pull out onedrive code as subcommand

* Pull out outlook code as subcommand

* Pull out local code as subcommand

* Pull out elasticsearch code as subcommand

* Pull out confluence code as subcommand

* Drop previous main file

* update changelog

* Add back in mp.Pool

* Fix mypy issues with click

* Make sure all tests run with verbose flag

* refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data

* Pull out some more shared options

* Support running code via python as well as cli

* update ingest readme and move it to the ingest folder

* update usage in connector docs

* move local command arg in test

* Seperate out cli code from logic running unstructured

* Make some cli fields required rather than optional

* rename process -> processor

* Improve logger to avoid duplicate handlers

---------

Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* feat: adds Box connector (#996)

* chore: rename Element's "date" field to "last_modified" (#997)

Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.

* don't push

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* fix: removie prints

* remove unused file

* fix: apply linter

* feat: add post processing filter_element_types

* feat: add tests for filter_element_types

* feat: update changelog

* feat: add doc string for filter_element_types

* fix: change the version

* feat: update documentation

* bump dev version number

* cleanup changelog

* linting, linting, linting

---------

Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: David Potter <potterdavidm@gmail.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
jasonbot pushed a commit that referenced this pull request Aug 8, 2023
* don't push

* enhancement: improve json detection by detect_filetype (#971)

* update regex pattern

* improve json regex pattern checks and add test file

* update file name

* update tests and formatting

* update changelog and version

* refactor: simplifies JSON detection and add tests (#975)

* refactor json detection

* version and changelog

* fix mock in test

* feat: adds Outlook connector (#939)

* bonus: fixes issue with email partitioning where From field was being assigned the To field value.

* Roman/expose dpi param (#966)

* Bump inference version

* Pass through the dpi param if available

* Update CHANGELOG

* Check dpi param passed in via unit test

* Bump inference version

* Fix unit test around file info to work on mac as well

* chore: cleanup changelog for 0.8.2 (#976)

* Update `partition_via_api` to not post a strategy value if not user specified (#967)

* remove default strategy

* working on test

* fixed test, coordinates param needed to be included

* nits

* update changelog

* lint

* update requirements

* build(release): cut 0.8.4 release (#979)

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Chore: add uns api repo unittests (#954)

* stage

* git clone

* ci ignore markdown file

* make install

* use env instead

* remove md

* add script

* wrong env value

* add note

* maybe don't rm

* no cd../

---------

Co-authored-by: cragwolfe <crag@unstructured.io>

* fix: handling for empty tables in word docs and powerpoints (#982)

* fix table index error

* changelog and version

* fix: only download nltk packages if necessary (#985)

* fix: only download nltk if necessary

* changelog and version

* Chore: Pass table support  param to partition image (#973)

* add param and test in image table extraction

* version and changelog

* need to publish this one for api repo

* add new param skip_infer_table_types

* use warning

* clean up with mapping

* add test for tsv

* fix test fail

* weird change from merge

* doc nit

* don't use mapping

* correct conflict

* Update pip in makefile (#981)

* update pip in makefile

* merge and update requirements

* update version

* update outlook requirements

* chore: remove debug printing (#988)

* fix: correct nltk download arg order (#991)

* fix: correct download order to nltk args

* add smoke test for tokenizers

* Chore: put back function `split_by_paragraph` (#992)

* put back function

* not really fixes

* don't push

* fix: clean up code

* fix: clean up

* fix: clean up

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Roman/ingest refactor (#978)

* Pull out s3 code as subcommand

* Pull out dropbox code as subcommand

* Pull out azure code as subcommand

* Pull out fsspec code as subcommand

* Pull out github code as subcommand

* Pull out gitlab code as subcommand

* Pull out reddit code as subcommand

* Pull out slack code as subcommand

* Pull out discord code as subcommand

* Pull out wikipedia code as subcommand

* Pull out gdrive code as subcommand

* Pull out biomed code as subcommand

* rename parameters

* Pull out onedrive code as subcommand

* Pull out outlook code as subcommand

* Pull out local code as subcommand

* Pull out elasticsearch code as subcommand

* Pull out confluence code as subcommand

* Drop previous main file

* update changelog

* Add back in mp.Pool

* Fix mypy issues with click

* Make sure all tests run with verbose flag

* refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data

* Pull out some more shared options

* Support running code via python as well as cli

* update ingest readme and move it to the ingest folder

* update usage in connector docs

* move local command arg in test

* Seperate out cli code from logic running unstructured

* Make some cli fields required rather than optional

* rename process -> processor

* Improve logger to avoid duplicate handlers

---------

Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* feat: adds Box connector (#996)

* chore: rename Element's "date" field to "last_modified" (#997)

Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.

* don't push

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* fix: removie prints

* remove unused file

* fix: apply linter

* feat: add post processing filter_element_types

* feat: add tests for filter_element_types

* feat: update changelog

* feat: add doc string for filter_element_types

* fix: change the version

* feat: update documentation

* bump dev version number

* cleanup changelog

* linting, linting, linting

---------

Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: David Potter <potterdavidm@gmail.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants