Skip to content

ebravofm/e_utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

e_utils

Miscellaneous utilities I sometimes use.

Installing

pip install git+https://github.com/ebravofm/e_utils.git

Filename()

Takes any string and returns a new string that is valid for naming files.

  • Removes accents and tildes.
  • Removes invalid characters.
  • Replaces spaces (' ') with underscores ('_')
  • Keeps first 255 characters only.
>>> s = "La vie d'Adèle (2003).mp4"
>>> filename(s)

'La_vie_dAdele_(2003).mp4'

Clean_df()

Takes a pandas dataframe or series and homogenizes each row by doing the following vector actions:

  • Applies lower case
  • Strips leading and trailing spaces
  • Removes accents and tildes

Homog_lev()

Takes a pandas dataframe or series and homogenizes each row matching similar strings and naming them the same. The module uses levenshtein distance to compare the strings to each other, by default, strings that are one single-character edit away are taken as the same. The module takes de levenshtein distance matrix, creates clusters and renames all the elements within the cluster the same. This module uses scikitlearn's DBSCAN.

>>> from e_utils.utils import clean_df
>>> from e_utils.utils import homog_lev

>>> series
0        Bad Bunny
1         bad buny
2      bag bunny
3            Ozuna
4     De La Ghetto
5      de la geto
6     Daddy Yankee
7      dade yankee
8        Nicky Jam
9        nicky jam
10        J Balvin
11        jbalvin
12          Maluma
13          maluma
14        Anuel AA

>>> series2 = clean_df(series)
>>> series2 = homog_lev(series2, eps=3)
>>> pd.concat([series, series2.str.title()], axis=1, keys=['*Original*', '*Fixed*'])

      *Original*       *Fixed*
0      Bad Bunny     Bad Bunny
1       bad buny     Bad Bunny
2    bag bunny       Bad Bunny
3          Ozuna         Ozuna
4   De La Ghetto  De La Ghetto
5    de la geto   De La Ghetto
6   Daddy Yankee  Daddy Yankee
7    dade yankee  Daddy Yankee
8      Nicky Jam     Nicky Jam
9      nicky jam     Nicky Jam
10      J Balvin      J Balvin
11      jbalvin       J Balvin
12        Maluma        Maluma
13        maluma        Maluma
14      Anuel AA      Anuel Aa

About

Miscellaneous utilities I sometimes use

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages