Skip to content

Commit

Permalink
feat: add string_splitter_transformer (#53)
Browse files Browse the repository at this point in the history
  • Loading branch information
premsrii authored Jan 20, 2023
1 parent b8bf874 commit fdf89e1
Show file tree
Hide file tree
Showing 7 changed files with 270 additions and 147 deletions.
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ poetry install
[`String transformer`](https://chrislemke.github.io/sk-transformers/API-reference/transformer/string_transformer/)|[`PhoneTransformer`](https://chrislemke.github.io/sk-transformers/API-reference/transformer/string_transformer/#sk_transformers.string_transformer.PhoneTransformer)|Transforms a phone number into multiple features.|
[`String transformer`](https://chrislemke.github.io/sk-transformers/API-reference/transformer/string_transformer/)|[`StringSimilarityTransformer`](https://chrislemke.github.io/sk-transformers/API-reference/transformer/string_transformer/#sk_transformers.string_transformer.StringSimilarityTransformer)|Calculates the similarity between two strings using the `gestalt pattern matching` algorithm from the `SequenceMatcher` class.|
[`String transformer`](https://chrislemke.github.io/sk-transformers/API-reference/transformer/string_transformer/)|[`StringSlicerTransformer`](https://chrislemke.github.io/sk-transformers/API-reference/transformer/string_transformer/#sk_transformers.string_transformer.StringSlicerTransformer)|Slices all entries of specified string features using the slice() function.|
[`String transformer`](https://chrislemke.github.io/sk-transformers/API-reference/transformer/string_transformer/)|[`StringSplitterTransformer`](https://chrislemke.github.io/sk-transformers/API-reference/transformer/string_transformer/#sk_transformers.string_transformer.StringSplitterTransformer)|Splits a string column into multiple columns based on the occurrence of a character.|

## Usage
Let's assume you want to use some method from [NumPy's mathematical functions, to sum up the values of column `foo` and column `bar`. You could
Expand Down
26 changes: 25 additions & 1 deletion examples/playground.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -664,6 +664,30 @@
"transformer.fit_transform(X)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### [`StringSplitterTransformer`](https://chrislemke.github.io/sk-transformers/API-reference/transformer/string_transformer/#sk_transformers.string_transformer.StringSplitterTransformer)\n",
"\n",
"Uses the pandas `str.split` method to split a column of strings into multiple columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sk_transformers import StringSplitterTransformer\n",
"\n",
"X = pd.DataFrame({\"foo\": [\"a_b\", \"c_d\", \"e_f\"], \"bar\": [\"g*h*i\", \"j*k*l\", \"m*n*o\"]})\n",
"transformer = StringSplitterTransformer([(\"foo\", \"_\", 2), (\"bar\", \"*\", 3)])\n",
"transformer.fit_transform(X)"
]
},
{
"attachments": {},
"cell_type": "markdown",
Expand All @@ -689,7 +713,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:25:29) [Clang 14.0.6 ]"
"version": "3.10.8"
},
"vscode": {
"interpreter": {
Expand Down
157 changes: 80 additions & 77 deletions poetry.lock

Large diffs are not rendered by default.

138 changes: 69 additions & 69 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -370,49 +370,49 @@ executing==1.2.0 ; python_version >= "3.8" and python_version < "3.11" \
fastjsonschema==2.16.2 ; python_version >= "3.8" and python_version < "3.11" \
--hash=sha256:01e366f25d9047816fe3d288cbfc3e10541daf0af2044763f3d0ade42476da18 \
--hash=sha256:21f918e8d9a1a4ba9c22e09574ba72267a6762d47822db9add95f6454e51cc1c
fastparquet==2022.12.0 ; python_version >= "3.8" and python_version < "3.11" \
--hash=sha256:01508f83a3952de94b927fdeabf908d9d39608e67270dcd1c9a3f5d56b0baa03 \
--hash=sha256:0eb862ab28e3ac9479e06faa5153357e60ca54869126519fbbda557b725b0a0c \
--hash=sha256:14cc3c50fca76df6a7aa58f69376e77f547865bd946a5405b70b32e8a3126ec7 \
--hash=sha256:21a41e67e713a10bb87ff8ea946661b465266265bc8c34f5969a55dad450a4e4 \
--hash=sha256:27f511c62e47ab9a3fe49ed9f426f886f388d4aa4df3eb7e747353f18c78cac2 \
--hash=sha256:2ab2ae2e6dd22e23a31586daced4fe2128ddea8e4abb25c685fce34cc36a537f \
--hash=sha256:2f0c0af41ed731976b90bacfd763f1b2d90d7653a1bab9fa8c6261f8380a6744 \
--hash=sha256:35ad000a4e969b87931f52d37cd5a9241f0daa3d9fb4be2f52f95e592c5ba9cc \
--hash=sha256:3e2927999ef7f2a94b09a00c2f61e1bb86d24ca3d2a259bba449eef938b049f8 \
--hash=sha256:45371a5393e27c8823d05c5f857d1354c947c5fb2fc9f314006e50bba3d1d506 \
--hash=sha256:46c3f482106742623eadcfbcd18597758e7210a1d811aa1606c555e8c3aec2b6 \
--hash=sha256:540eafe9a811736f09dc77712dfd1adde197020e521159ae95eed65315afc03d \
--hash=sha256:5f4ec6e91600626ad4a60476bbdda154807458ef54141d64ad5e2a7b436a5d0f \
--hash=sha256:6c32349355652ab35a082d79e76e809ec0a9b68546d321a0003dda762fb40a18 \
--hash=sha256:7257c7dc1f39bd27fc5f964d1fd017dc378e972dc35278aafe1468fd2e364a97 \
--hash=sha256:76cf692c2390cef70cccd7122f4aca0a2dcd9a53b0b5622e9f4b9b0e6f9da63c \
--hash=sha256:82458131bb7619b23ab670d0fa3c8379663ce59042a4281820ca5c8bb3f72374 \
--hash=sha256:8cb46f8282f5a23684b1c31380c6cec6987233dbaa55b81cb75c1f4cc82a0111 \
--hash=sha256:8f6a42c67398c4b9d502d5f8e39d569d0cca30fa21bbc230f436915be3ed086d \
--hash=sha256:a0d252e9342458d6c2574103ac0224b3a9bf44da6ebb5e08509da84d25cbffa3 \
--hash=sha256:a23be3c71c2e21cbb87cccafc08458815d5d61921e3a05659469e8f071fe0cc2 \
--hash=sha256:a9b0e6227d676a5394a8a3f46c0105db31e3013d6c64fb9901f36294bf5d5eb3 \
--hash=sha256:aa69af2509d0a5a822abf26fc53247940b2c2f4cb614c6eff7ce8118dd61d665 \
--hash=sha256:ac746c782fbb8a8ec6155da8467e02a65ba781d652a5347c986b24466c7d17c6 \
--hash=sha256:b7be4935c57ce4db9ec5cf7703a15312a3b722179f1645c3976b36660a25c11c \
--hash=sha256:b7c95a7e1bc2bc91b72704a2869f99f10334b5c40a365bce1b61f46ceffc7e3c \
--hash=sha256:b95a48d68e11c9b3f0a8ee206dfa264e1337ea233ee1f012d0c98b0dfca31c0f \
--hash=sha256:cf2d772db1bc2dfad641eb7b75adc1e09fe31a59536dbef619f7bdb8f3f0f1d8 \
--hash=sha256:dcde27d4461e094036e01c21df833b2274afd025311b631ca635fa11dc06e83e \
--hash=sha256:e6f0b4c04db5d465f9e47f29ee40c213cd41c29ed290807de6953963d855205d \
--hash=sha256:eb7e2b452e36fbf9b8ed71be7e5200c7e5e4f9f8bd50c697f26638d95c18413c \
--hash=sha256:eea0f24d786ece9cae64a2ddcf06b504436229af08f04c0ec76210cd428eb79c \
--hash=sha256:f47d1652044966abdfa6b235a8bd586461ece984575fae1b49a609128c795d42
fastparquet==2023.1.0 ; python_version >= "3.8" and python_version < "3.11" \
--hash=sha256:00347f09a060852ce4330ce678c638977faf6fdb5c29caf89ad5651e0f0d7621 \
--hash=sha256:02e0619a86e9e328373cbfb22fceb8e4054b6d32badffb565ff21d7a3566ed38 \
--hash=sha256:09e2bcca0d95867b0364637a02d032844a496a47c2a2926e007a126e2bc25f55 \
--hash=sha256:0db7578d62945e4b9b6e983afc0f15fe9d82f47f76ebc3cdbd713c5fadd4ea84 \
--hash=sha256:1096fdebb87a9630b69bd7c68185783a337d01c1cd24916b1489ecb82b55cefb \
--hash=sha256:11adc51b17af433db8486b9be959c806034d44184e073249bd3285db85dc768e \
--hash=sha256:201b05ececa2e2e607230039cea6f9e0027837e8e273c8ad83886f10699bc9c9 \
--hash=sha256:3bcf1e969a42f8dabedca2cb255e7649d0725eafebf1e897450d84af504a5c70 \
--hash=sha256:3d138a35979d72e4e2e1c06a6f275ea8b8885d1484e791fa7ad148af3aca8878 \
--hash=sha256:3feb1758b7b746e92d7aef64a013a0402a5919ff0147803276bc40e102141815 \
--hash=sha256:4ebcaa49b57d4f11112160e80f3feab1a36af68072e415672da985930c66c3a2 \
--hash=sha256:568406db0c7fc37179e468503221e526a4945e553d145fbf1f6344b5b3a8c8e6 \
--hash=sha256:5efcb6e0280fe8e103e8a5f6bf4a5ecd32915d3f9959a4e85f64661c7cbecede \
--hash=sha256:72fcd440472a4acfda2ab2007c2c23de37bce33ad4c609ab095aeb00012e699c \
--hash=sha256:74ebaff8f4f7922f44953161770c44a88b61dccd3cc11393f20856e34c3cf05c \
--hash=sha256:76dd48cb568c4596baded551251f870a3690a43893e29653baf26062549b82b3 \
--hash=sha256:851fa21b1df421d8acadfd10025d7721c46c2182d4a64cef9a3811fa4a25a2eb \
--hash=sha256:8fdfc1adcbc0ea1d05f9ac3576cf12732189c54e4b1c9d38da990dc36d9cc348 \
--hash=sha256:92252538823da2bf958d2f2edd14e3864ae296d28f5be24e07eb685b4b08bed2 \
--hash=sha256:97b978b90037d312d673dfd2e2c17cca85c692eaa9373f44856b1d5ed48a8cec \
--hash=sha256:98065a55bfbedddcc237a791109ea9b3ac3e8008318e4c8e7b39227219494e4b \
--hash=sha256:993079d95120ab234b7bfae200c3b7f56b16df4e284c62353a466dbfce951d23 \
--hash=sha256:a104fede9b113e079a9e480242de809b0eacb95d718d20c3a9e14a65cffd4031 \
--hash=sha256:b8256c56bea62d43fd26307f68fd2ad281a1b21478b64a94bb94a01681a97583 \
--hash=sha256:c00c47cce430204f4e7c007f84e420feada5676a6e752e093ca039cab5fa7370 \
--hash=sha256:c3a1ae4dbd079bc4b195249a0791a187c45b9b1802af947167c8d76a01cd8a79 \
--hash=sha256:c40fe744c478c64105dae97b1bdf10709c5f730f12fbeaa719a6714513c4eb7e \
--hash=sha256:cb3c6406e086db3bf5835a62e46626111928e50bad5bfe56e63d40d293303be1 \
--hash=sha256:cec14b87d5f721ba85e0fc0797a9adfb751d8e501863b5c587da09c2e65f2095 \
--hash=sha256:dea4af358ff2b55101d7708e9309283ec6dacd99d42b7060d79d5c1227bfa079 \
--hash=sha256:e873286445e0850a5f044c71b9c3f55279a1fbd7b7e39590c866f24de5ce850f \
--hash=sha256:fafd22c2a799ae9f3fcc6c1763d2480da3d47199beb6c8667b04d688a5507905 \
--hash=sha256:fc6af1f2b2f9c29f1e61097fa7a8adcbf568815dea787ed2d2590d1ec8467826
feature-engine==1.5.2 ; python_version >= "3.8" and python_version < "3.11" \
--hash=sha256:218edefcb34394562ee779fe5b883c11f602727d13dccd39521912dd75ba1cb8 \
--hash=sha256:d8fbe773d6b43dfc1eb051995256406e6346728711bc9728ab4418adad80e23f
filelock==3.9.0 ; python_version >= "3.8" and python_version < "3.11" \
--hash=sha256:7b319f24340b51f55a2bf7a12ac0755a9b03e718311dac567a0f4f7fabd2f5de \
--hash=sha256:f58d535af89bb9ad5cd4df046f741f8553a418c01a7856bf0d173bbc9f6bd16d
fsspec==2022.11.0 ; python_version >= "3.8" and python_version < "3.11" \
--hash=sha256:259d5fd5c8e756ff2ea72f42e7613c32667dc2049a4ac3d84364a7ca034acb8b \
--hash=sha256:d6e462003e3dcdcb8c7aa84c73a228f8227e72453cd22570e2363e8844edfe7b
fsspec==2023.1.0 ; python_version >= "3.8" and python_version < "3.11" \
--hash=sha256:b833e2e541e9e8cde0ab549414187871243177feb3d344f9d27b25a93f5d8139 \
--hash=sha256:fbae7f20ff801eb5f7d0bedf81f25c787c0dfac5e982d98fa3884a9cde2b5411
fst-pso==1.8.1 ; python_version >= "3.8" and python_version < "3.11" \
--hash=sha256:b3d16ec27b0b4d36b35b306af40c00cd0b34e5e0a9e30a71ed02490e8954a26a
fuzzytm==2.0.5 ; python_version >= "3.8" and python_version < "3.11" \
Expand Down Expand Up @@ -1211,35 +1211,35 @@ spacy-legacy==3.0.11 ; python_version >= "3.8" and python_version < "3.11" \
spacy-loggers==1.0.4 ; python_version >= "3.8" and python_version < "3.11" \
--hash=sha256:e050bf2e63208b2f096b777e494971c962ad7c1dc997641c8f95c622550044ae \
--hash=sha256:e6f983bf71230091d5bb7b11bf64bd54415eca839108d5f83d9155d0ba93bf28
spacy==3.4.4 ; python_version >= "3.8" and python_version < "3.11" \
--hash=sha256:07a10999a3e37f896758a92c2eed263638bcbf2747dc3a4aeea929aaa20ea28c \
--hash=sha256:0bb7d53f1a780bb8cc1b27a81e02e8b9bc71abb959f4dc13c21af4041fdd2c7a \
--hash=sha256:10643c6d335a02805f6676738a3e992323cfd9438115cc253435e5053dc93824 \
--hash=sha256:15e5c41d408d1d30d8f3dd8e4eed9ed28e6174e011b8d61c1345981562e2e8f5 \
--hash=sha256:1b7791a6c0592615b0566001596cc48c72325d1b97e46e574c91bff34f4e3f4c \
--hash=sha256:1f4736fea2630e696422dfe38bfb3d0a7864bc6a9072d6e49a906af46870e36e \
--hash=sha256:29d6bb428a6bb19e026d8bbb9d4385c25b21e1ce51fcaabadfb5599b2390a79c \
--hash=sha256:2f1edbecfde9c11b17e87768bb5f2c33948fb1e3bf54b2197031ff9053607277 \
--hash=sha256:31e9a637960b60c1bb7a36a187271425717e97c14e9d1df613dc4efeffefcbec \
--hash=sha256:486228cfa7ced18ec99008388028bd2329262ab8108e7c19252c1a67b2801909 \
--hash=sha256:498bf01e8c7ab601c3f8d6c51497817b40a3322a3967c032536b18ce9ea26d0a \
--hash=sha256:4ade19c1e676cac2546f268db22bc5eba08d12beafabe80f1b9f06028b3a0b52 \
--hash=sha256:66eaf4764e95699934cbd8f38717b283db185c896cfd3d1fb1ad5c6552e8b3c9 \
--hash=sha256:71f9449ffadef85b048c9735ee235da5dca9d0a87038dba6d4ed20c5188e0f13 \
--hash=sha256:8979dbd3594c5c268cedad53f456a3ec3a0a2b78a1199788aacedcd68eef3a00 \
--hash=sha256:8a495b0fc00910fb5c1fbe64fdbfe1d3c11b09f421d1ae4e30cdb4c2388a91e4 \
--hash=sha256:95f880c6fea57d51c448ad84f96d79d8758e5e18bdbaaee060c15af11641079b \
--hash=sha256:9ccbede9be470c5d795168bf3be41fc86e18892a9247a742b394ba866c005391 \
--hash=sha256:a21187ad4c44e166dc3deed23992ea1a74d731c9a6bdd9fca306d455181577fa \
--hash=sha256:aa027e69ef9fe42c8b02b940872e5bde0ce1bf66b6bf488c6493e3ce660c4b3a \
--hash=sha256:bcb7a213178c298b95532075d6dddfb374bbe56ef8d2687212763b4583048da2 \
--hash=sha256:c1a5ce5c9b19cdfb4469079e710e72bb09c3cab855f21ef6a614b84c765e0311 \
--hash=sha256:ddeb5d725b6fa9c9009b1ff645db8f5caab9ed8956ee3a84b8379951caad1d36 \
--hash=sha256:e500cf2cb5f1849461a7928fa269703756069bdfb71559065240af6d0208b08c \
--hash=sha256:e6d98511dc8a88d3a96bcae13971a284459362076738c85053d1a3791f6cde92 \
--hash=sha256:e782c8a7c4805cc1b34ed2b11f72a5cf2b9851e20f7afe3e97caf206f19f761b \
--hash=sha256:f2cad9c5543f03b3375c252e4dd45670ee8ed99c925dca15eadab5084fd1b033 \
--hash=sha256:f7044dca3542579ea1e3ac6cdd821640c2f65dd0c56230688f36e15aca1b8217
spacy==3.5.0 ; python_version >= "3.8" and python_version < "3.11" \
--hash=sha256:02c54fd297c8e9b91da0198e2619fb66a36b5a49d2e429ce9ab7fd85918d2e8e \
--hash=sha256:09045de05a378c4c6e7f209fc31995d22c88d4af1e036f086425d4febf63d542 \
--hash=sha256:0d67a02ecb3abcaa273031d111d5e41276460e6484d191b82336092059663c55 \
--hash=sha256:1aa6fd598cf9b6e9d671fb9f80ef1bcb24d69c34cdbf38a6626d265d1060474b \
--hash=sha256:1f95c739a8a9b84131aaa4af70f4a98e9cbb7d81576f12f49bbd20cf709b85c2 \
--hash=sha256:28c762da4bcb2849de7f180476a6a942df4ae7997e5aaf4f0670b61a3204ad89 \
--hash=sha256:42fbe255828e129af7a5b7807a13ea2d84b3b4388e8517713e0f4e8807b320c5 \
--hash=sha256:5315ab53a1dc04bf2f0f8f6677bb1f93c75aa2e049f006ae0d53851870625d65 \
--hash=sha256:5af92ed98229fcd0327af73644e0bca510ef6cd9211cad15dd04530f8fe947f3 \
--hash=sha256:657350719742e925d305e66f35f4af29642f039b9556aa4c510de1ebb09f6913 \
--hash=sha256:65f7927e36d9d520e88457b1ee0b0aafce5f0267f1cd66cf840378f91f01447e \
--hash=sha256:705f43746e415b1b9ea518530d0d4e5c1cb09526f459960a6c96497ae1ccb716 \
--hash=sha256:7630b4c1268b18da1a0abd23e2662f48fbf4c36ee223526d3c49de140d4d2e1e \
--hash=sha256:7c2884f102847aeacc366838d89879faf54d7f3ff9cc53220ee02deedb7e2c33 \
--hash=sha256:7ca247b654348d3a97e490f9a3950e1039d11ad66d3efc409ca32ecc8371da94 \
--hash=sha256:90ccb5e675ae6dc1aa1ad187432f4e4483c90a8d2d07f7a1ea77b582d637984c \
--hash=sha256:92d23532380deb077164466e89064e0e5366732ac971af158c36eae8490d32bf \
--hash=sha256:9d28110437e0382d76852f9e45925bfd5cecaf43e26628cfdcd0c2f61b23d57c \
--hash=sha256:a709387d833c0e88d4fccfe17e16421de050b8ae22a81c509f6e98e3b178a164 \
--hash=sha256:bb6b7f79552ca1d39d3e7d415beb7cbb85313ab11dc58fa963410ae99c125578 \
--hash=sha256:c855f5b2826c21dbf2e61308f8e1beec5939f67950f8ecf95abfd42621297d5a \
--hash=sha256:ced60f84c412d69ee4634d642316e012e1fd63142c9b5877b03e6a44997228a8 \
--hash=sha256:d6ad4940e4e9591fa8f3ce289d28a49d3b9a8e7d32fb1352d197a95f46d6c6c4 \
--hash=sha256:d7d3ed664964aff6b15fcefbd7302dac3e0d3b06cf3b86c9130fdcdfafa56d0c \
--hash=sha256:e27e938fca23b87bab978a6098a30ad7d0974d10a630d2f5ba43103eefba4d06 \
--hash=sha256:f1d1c2069fb85447f647baf7c02886a9b63c2f40d75bb9479c921598b8acf8a2 \
--hash=sha256:f81ddd6475b59ba62ddaf72fcdc940873a6688ad1866f308fb722c1ae63fa2a5 \
--hash=sha256:fe20127012992778804d93f75ce9370d588573072639c3832cec38f54bf7e4a5
srsly==2.4.5 ; python_version >= "3.8" and python_version < "3.11" \
--hash=sha256:04d0b4cd91e098cdac12d2c28e256b1181ba98bcd00e460b8e42dee3e8542804 \
--hash=sha256:0f9abb7857f9363f1ac52123db94dfe1c4af8959a39d698eff791d17e45e00b6 \
Expand Down Expand Up @@ -1424,9 +1424,9 @@ urllib3==1.26.14 ; python_version >= "3.8" and python_version < "3.11" \
virtualenv==20.17.1 ; python_version >= "3.8" and python_version < "3.11" \
--hash=sha256:ce3b1684d6e1a20a3e5ed36795a97dfc6af29bc3970ca8dab93e11ac6094b3c4 \
--hash=sha256:f8b927684efc6f1cc206c9db297a570ab9ad0e51c16fa9e45487d36d1905c058
wasabi==0.10.1 ; python_version >= "3.8" and python_version < "3.11" \
--hash=sha256:c8e372781be19272942382b14d99314d175518d7822057cb7a97010c4259d249 \
--hash=sha256:fe862cc24034fbc9f04717cd312ab884f71f51a8ecabebc3449b751c2a649d83
wasabi==1.1.1 ; python_version >= "3.8" and python_version < "3.11" \
--hash=sha256:32e44649d99a64e08e40c1c96cddb69fad460bd0cc33802a53cab6714dfb73f8 \
--hash=sha256:f5ee7c609027811bd16e620f2fd7a7319466005848e41b051a62053ab8fd70d6
wcwidth==0.2.6 ; python_version >= "3.8" and python_version < "3.11" \
--hash=sha256:a5220780a404dbe3353789870978e472cfe477761f06ee55077256e509b156d0
webencodings==0.5.1 ; python_version >= "3.8" and python_version < "3.11" \
Expand Down
1 change: 1 addition & 0 deletions src/sk_transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,5 @@
PhoneTransformer,
StringSimilarityTransformer,
StringSlicerTransformer,
StringSplitterTransformer,
)
62 changes: 62 additions & 0 deletions src/sk_transformers/string_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -389,3 +389,65 @@ def transform(self, X: pd.DataFrame) -> pd.DataFrame:
X[feature] = [x[slice(*slice_args)] for x in X[feature]]

return X


class StringSplitterTransformer(BaseTransformer):
"""Uses the pandas `str.split` method to split a column of strings into
multiple columns.
Example:
```python
import pandas as pd
from sk_transformers import StringSplitterTransformer
X = pd.DataFrame({"foo": ["a_b", "c_d", "e_f"], "bar": ["g*h*i", "j*k*l", "m*n*o"]})
transformer = StringSplitterTransformer([("foo", "_", 2), ("bar", "*", 3)])
transformer.fit_transform(X)
```
```
foo bar foo_part_1 foo_part_2 bar_part_1 bar_part_2 bar_part_3
0 a_b g*h*i a b g h i
1 c_d j*k*l c d j k l
2 e_f m*n*o e f m n o
```
Args:
features (List[Tuple[str, str, int]]): A list of tuples where
the first element is the name of the feature,
the second element is the string separator,
and the third element is the desired number of splits.
"""

def __init__(
self,
features: List[
Tuple[
str,
str,
int,
]
],
) -> None:
super().__init__()
self.features = features

def transform(self, X: pd.DataFrame) -> pd.DataFrame:
"""Splits the strings based on a separator character.
Args:
X (pandas.DataFrame): DataFrame to transform.
Returns:
pandas.DataFrame: Dataframe containing additional columns containing
each split part of the string.
"""

X = check_ready_to_transform(self, X, [feature[0] for feature in self.features])

for column, separator, maxsplit in self.features:
split_column_names = [f"{column}_part_{i+1}" for i in range(maxsplit)]
X[split_column_names] = X[column].str.split(
separator, n=maxsplit, expand=True
)

return X
Loading

0 comments on commit fdf89e1

Please sign in to comment.